Journal on Data Semantics XV

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Author: Stefano Spaccapietra

29 downloads 1635 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6720

Stefano Spaccapietra (Ed.)

Journal on Data Semantics XV

13

Editor-in-Chief Stefano Spaccapietra EPFL-IC – Database Laboratory Lausanne, Switzerland E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349 (LNCS) ISSN 1861-2032 e-ISSN 1861-2040 (Journal on Data Semantics) ISBN 978-3-642-22629-8 e-ISBN 978-3-642-22630-4 DOI 10.1007/978-3-642-22630-4 Springer Heidelberg Dordrecht London New York CR Subject Classification (1998): I.2.4, H.3.5, H.2.8, I.2.3, H.2.3, H.3.3, D.3

© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

The LNCS Journal on Data Semantics

Computerized information handling has changed its focus from centralized data management systems to decentralized data-exchange facilities. Modern distribution channels, such as high-speed Internet networks and wireless communication infrastructure, provide reliable technical support for data distribution and data access, materializing the new, popular idea that data may be available to anybody, anywhere, anytime. However, providing huge amounts of data on request often turns into a counterproductive service, making the data useless because of poor relevance or inappropriate level of detail. Semantic knowledge is the essential missing piece that allows the delivery of information that matches user requirements. Semantic agreement, in particular, is essential to meaningful data exchange. Semantic issues have long been open issues in data and knowledge management. However, the boom in semantically poor technologies, such as the Web and XML, has boosted renewed interest in semantics. Conferences on the Semantic Web, for instance, attract big crowds of participants, while ontologies on their own have become a hot and popular topic in the database and artiﬁcial intelligence communities. Springer’s LNCS Journal on Data Semantics aims at providing a highly visible dissemination channel for remarkable work that in one way or another addresses research and development related to the semantics of data. The target domain ranges from theories supporting the formal deﬁnition of semantic content to innovative domain-speciﬁc application of semantic knowledge. This publication channel should be of the highest interest to researchers and advanced practitioners working on the Semantic Web, interoperability, mobile information services, data warehousing, knowledge representation and reasoning, conceptual database modeling, ontologies, and artiﬁcial intelligence. Topics of relevance to this journal include: • • • • • • • • • • • • •

Semantic interoperability, semantic mediators Ontologies Ontology, schema and data integration, reconciliation and alignment Multiple representations, alternative representations Knowledge representation and reasoning Conceptualization and representation Multimodel and multiparadigm approaches Mappings, transformations, reverse engineering Metadata Conceptual data modeling Integrity description and handling Evolution and change Web semantics and semi-structured data

VI

• • • • • • •

The LNCS Journal on Data Semantics

Semantic caching Data warehousing and semantic data mining Spatial, temporal, multimedia and multimodal semantics Semantics in data visualization Semantic services for mobile users Supporting tools Applications of semantic-driven approaches

These topics are to be understood as speciﬁcally related to semantic issues. Contributions submitted to the journal and dealing with semantics of data will be considered even if they are not from the topics in the list. Although the journal issues look like the books from the well-known Springer LNCS series, the mode of operation is that of a journal. Contributions are freely submitted by authors and reviewed by the Editorial Board. Contributions may also be invited, and nevertheless carefully reviewed, as in the case for issues that contain extended versions of the best papers from major conferences addressing data semantics issues. Special issues, focusing on a speciﬁc topic, are coordinated by guest editors once the proposal for a special issue is accepted by the Editorial Board. Finally, it is also possible that a journal issue be devoted to a single text. The Editorial Board comprises an Editor-in-Chief (with overall responsibility), a Co-editor-in-Chief, and several members. The Editor-in-Chief has a fouryear mandate. Members of the board have a three-year mandate. Mandates are renewable and new members may be elected at any time. We are happy to welcome you to our readership and authorship, and hope we will share this privileged contact for a long time. IMPORTANT NOTE This is the last JoDS issue published as an LNCS volume. Starting in 2012 JoDS will become a Springer journal with a regular, quarterly, publication schedule. All information about the “new” journal will be available on Springer’s website. A new Call for Papers will very soon give to prospective authors instructions for submission. Stefano Spaccapietra Editor-in-Chief http://lbd.epﬂ.ch/e/Springer/

Previous Issues

• JoDS I:

Special Issue on Extended Papers from 2002 Conferences, LNCS 2800, December 2003 Co-editors: Sal March and Karl Aberer • JoDS II: Special Issue on Extended Papers from 2003 Conferences, LNCS 3360, December 2004 Co-editors: Roger (Buzz) King, Maria Orlowska, Elisa Bertino, Dennis McLeod, Sushil Jajodia, and Leon Strous • JoDS III: Special Issue on Semantic-Based Geographical Information Systems, LNCS 3534, August 2005 Guest Editor: Esteban Zim´ anyi • JoDS IV: Normal Issue, LNCS 3730, December 2005 • JoDS V: Special Issue on Extended Papers from 2004 Conferences, LNCS 3870, February 2006 Co-editors: Paolo Atzeni, Wesley W. Chu, Tiziana Catarci, and Katia P. Sycara • JoDS VI: Special Issue on Emergent Semantics, LNCS 4090, September 2006 Guest Editors: Karl Aberer and Philippe Cudre-Mauroux • JoDS VII: Normal Issue, LNCS 4244, November 2006 • JoDS VIII: Special Issue on Extended Papers from 2005 Conferences, LNCS 4830, February 2007 Co-editors: Pavel Shvaiko, Mohand-Sa¨ıd Hacid, John Mylopoulos, Barbara Pernici, Juan Trujillo, Paolo Atzeni, Michael Kifer, Fran¸cois Fages, and Ilya Zaihrayeu • JoDS IX: Special Issue on Extended Papers from 2005 Conferences (continued), LNCS 4601, September 2007 Co-editors: Pavel Shvaiko, Mohand-Sa¨ıd Hacid, John Mylopoulos, Barbara Pernici, Juan Trujillo, Paolo Atzeni, Michael Kifer, Fran¸cois Fages, and Ilya Zaihrayeu • JoDS X: Normal Issue, LNCS 4900, February 2008

VIII

Previous Issues

• JoDS XI:

Special Issue on Extended Papers from 2006 Conferences, LNCS 5383, December 2008 Co-editors: Jeﬀ Z. Pan, Philippe Thiran, Terry Halpin, Steﬀen Staab, Vojtech Svatek, Pavel Shvaiko, and John Roddick • JoDS XII: Normal Issue, LNCS 5480, July 2009 • JoDS XIII: Special Issue on Semantic Data Warehouses, LNCS 5530, August 2009 Guest Editors: Esteban Zim` anyi and Il-Yeong Song • JoDS XIV: Normal Issue, LNCS 5880, November 2009 Co-editor: Lois Delcambre

JoDS Volume XV

This volume of JoDS presents articles selected from among 25 full-paper submissions received in response to two calls for contributions issued in 2009 and 2010. Reviews of submitted papers resulted in requests for major revisions of several papers. Reviews of revised versions eventually led to the acceptance of ﬁve regular papers for publication. They are listed in the table of contents hereinafter. This volume also hosts, for the ﬁrst time in JoDS history, a special report on the Ontology Alignment Evaluation Initiative, an event that has been held once a year in the last ﬁve years and has attracted considerable attention from the ontology community. We would like to thank authors of all submitted papers as well as all the reviewers who contributed to improving the papers through their detailed comments. This is the last volume of JoDS to be published as an LNCS volume. The next issue of JoDS will be as a regular Springer journal, published quarterly starting in 2012. Please look for the Call for Papers that will announce the “new” journal, whose deﬁnition will introduce some signiﬁcant changes to the LNCS JoDS. One thing will not change: the goal to be a well-known publication venue for high-quality contributions. Meanwhile, we hope you’ll enjoy reading this volume. Stefano Spaccapietra Editor-in-Chief http://lbdwww.epﬂ.ch/e/Springer/

Reviewers

We are very grateful to the external reviewers listed below who helped the editorial board in the reviewing task: Mahmoud Awat Jes´ us Berm´ udez Silvana Castano Ronald Chenu Marco Comerio Tudor Groza Ivan Jureta Domenico Lembo Sotirios Liaskos Jose Antonio Fernandes de Macedo Ming Mao Chun-Kit Ngan Rebecca Ong Matteo Palmonari Gabriella Pasi H´ector P´erez-Urbina Chiara Renso Ernesto Jimenez Ruiz Hao Xu

George Mason University, USA Basque Country University, Spain University of Milan, Italy University of Trento, Italy University of Milan-Bicocca, Italy The University of Queensland, Australia University of Namur, Belgium Universit` a di Roma La Sapienza, Italy York University, Toronto, Canada Federal University of Ceara, Brazil SAP Labs, USA George Mason University, USA ISTI CNR, Italy University of Milan-Bicocca, Italy University of Milan-Bicocca, Italy Oxford University Computing Laboratory, UK ISTI CNR, Italy Universitat Jaume I, Castell´on, Spain University of Trento, Italy

JoDS Editorial Board

Editor-in-Chief Stefano Spaccapietra

EPFL, Switzerland

Coeditor-in-Chief Lois Delcambre

Portland State University, USA

Members Carlo Batini Alex Borgida Shawn Bowers Tiziana Catarci David W. Embley Jerome Euzenat Dieter Fensel Fausto Giunchglia Nicola Guarino Jean-Luc Hainaut Ian Horrocks Arantza Illarramendi Larry Kerschberg Michael Kifer Tok Wang Ling Shamkant B. Navathe Antoni Oliv´e Jos´e Palazzo M. de Oliveira Christine Parent Klaus-Dieter Schewe Heiner Stuckenschmidt Pavel Shvaiko Katsumi Tanaka Yair Wand Eric Yu Esteban Zim´ anyi

Universit` a di Milano-Bicocca, Italy Rutgers University, USA University of California Davis, USA Universit` a di Roma La Sapienza, Italy Brigham Young University, USA INRIA Alpes, France University of Innsbruck, Austria University of Trento, Italy National Research Council, Italy FUNDP Namur, Belgium University of Manchester, UK Universidad del Pa´ıs Vasco, Spain George Mason University, USA State University of New York at Stony Brook, USA National University of Singapore, Singapore Georgia Institute of Technology, USA Universitat Polit`ecnica de Catalunya, Spain Universidade Federal do Rio Grande do Sul, Brazil Universit´e de Lausanne, Switzerland Massey University, New Zealand University of Mannheim, Germany TasLab, Trento, Italy University of Kyoto, Japan University of British Columbia, Canada University of Toronto, Canada Universit´e Libre de Bruxelles (ULB), Belgium

Table of Contents

Regular Papers Capturing Rhetoric and Argumentation Aspects within Scientiﬁc Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tudor Groza, Siegfried Handschuh, and Stefan Decker A Reference Model for Semantic Peer-to-Peer Networks . . . . . . . . . . . . . . . Abdul-Rahman Mawlood-Yunis, Michael Weiss, and Nicola Santoro Discovery of Probabilistic Mappings between Taxonomies: Principles and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R´emi Tournaire, Jean-Marc Petit, Marie-Christine Rousset, and Alexandre Termier TwigTable: Using Semantics in XML Twig Pattern Query Processing . . . Huayu Wu, Tok Wang Ling, Bo Chen, and Liang Xu Database Semantics Recovery through Analysis of Dynamic SQL Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony Cleve, Jean-Roch Meurisse, and Jean-Luc Hainaut

1 37

66

102

130

Special Report Ontology Alignment Evaluation Initiative: Six Years of Experience . . . . . J´erˆ ome Euzenat, Christian Meilicke, Heiner Stuckenschmidt, Pavel Shvaiko, and C´ assia Trojahn

158

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

193

Capturing Rhetoric and Argumentation Aspects within Scientific Publications Tudor Groza, Siegfried Handschuh, and Stefan Decker DERI, National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Ireland {tudor.groza,siegfried.handschuh,stefan.decker}@deri.org http://www.deri.ie/

Abstract. Dissemination can be seen as a communication process between scientists, in which they expose and support their ﬁndings, while discussing claims stated in related scientiﬁc publications. Often this discourse structure is hidden in the semantics expressed by the publication’s content and thus hard to discover by the reader. Externalization, the process of articulating tacit knowledge into explicit concepts, holds the key to knowledge creation. Consequently, the knowledge becomes crystallized, thus allowing it to be shared with and by others. In this paper, we present SALT (Semantically Annotated LATEX), a semantic authoring framework that aims at deﬁning a clear formalization for externalizing the knowledge captured within rhetorical and argumentation discourses. SALT follows a layered approach with the goal of providing a comprehensive domain-independent model for scientiﬁc publications, based of three ontologies: (i) the Document Ontology, capturing the linear structure of the publication, (ii) the Rhetorical Ontology, modeling the rhetorical and argumentation, and (iii) the Annotation Ontology, linking the rhetoric and argumentation to the publication’s structure and content. SALT can be used independently of the writing environment. As proof-of-concept, we show its application in LATEX, based on a special LATEX syntax and in MS Word 2003, using visual controls. The resulting semantic documents can be used in a variety of applications, one of them being brieﬂy detailed in this paper. Finally, in addition to a detailed discussion on the state-of-the-art, the paper presents the evaluation we have carried out, to analyze the framework’s soundness, suitability for the task and its general usability during the authoring process. Keywords: Semantic Authoring, Rhetorical Structure, Argumentation, Ontologies.

1

Introduction

Dissemination, an important phase of scientiﬁc research, can be seen as a communication process between scientists. They expose and support their ﬁndings,

Corresponding author.

S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 1–36, 2011. c Springer-Verlag Berlin Heidelberg 2011

2

T. Groza, S. Handschuh, and S. Decker

while discussing claims stated in related scientiﬁc publications. The communication takes place over the course of several publications, where each paper itself contains a rhetorical discourse structure laying out supportive evidence for the raised claims. This discourse structure is captured in the semantics expressed by the publication’s content and thus hard to be completely unveiled by the reader. Externalization, as deﬁned by Nonaka [1], represents the process of articulating tacit knowledge into explicit concepts. As such, it holds the key to knowledge creation. Consequently, the knowledge becomes crystallized, thus allowing it to be shared with and by others. Although made explicit, the externalized knowledge is dependent on the degree of formalization. In the case of the rhetorical and argumentation discourse based on claims, it can be a couple of keywords, or a weakly structured text, both possibly including direct references to the publications stating the actual claims. The World Wide Web represents an essential factor in the dissemination of scientiﬁc work in many ﬁelds. At the same time, its exponential growth is reﬂected in the substantial increase of the amount of scientiﬁc research being published. As an example, in the biomedical domain, the well-known MedLine1 now hosts over 18 million articles, having a growth rate of 0.5 million articles / year, which represents around 1300 articles / day [2]. Semantic Web technologies, with semantic metadata as foundation, have been proved to help at alleviating, at least partially, the above mentioned issues. Used in particular contexts, semantic metadata enables a more fertile search experience, complementing full text search with search based on diﬀerent facets (e.g., one can search for a publication by a speciﬁc author and with some speciﬁc keywords in its title). In addition, subject to its richness, it can also leverage links between publications, e.g., citation networks. In this paper we propose a comprehensive method for externalizing the rhetorical and argumentation discourse captured in the content of scientiﬁc publications and exposing it as semantic metadata. One of the key elements for succeeding stands in a proper decomposition of the publication into elementary knowledge units. These will enable the re-structuring of the content into a knowledge network, encapsulating the rhetoric in its vertices, and acting as realizations of associative trails [3]. Lifted at a global level, outside one publication, these elementary units will form argumentative discourse networks that span across multiple publications. Consequently, we will experience a much richer set of semantic metadata (both shallow and deep) that will improve the current ways of searching for and browsing scientiﬁc publications. The materialization of our proposal is the SALT (Semantically Annotated LATEX) semantic authoring framework. SALT follows a layered approach with the goal of providing a comprehensive domain-independent model for scientiﬁc publications. It is based on three ontologies: (i) the Document Ontology, capturing the linear structure of the publication, (ii) the Rhetorical Ontology, modeling the rhetorical and argumentation, and (iii) the Annotation Ontology, linking the rhetorical and argumentation to the publication’s structure and content. SALT 1

http://medline.cos.com/

Capturing Rhetoric and Argumentation Aspects

3

can be used independently of the writing environment. As proof-of-concept, we have developed a special LATEX syntax and a MS Word plugin to accommodate the annotation of the publications’ content with SALT instances. The resulting semantic documents can be used in a variety of applications, one of them being brieﬂy detailed later in this paper, i.e., a semantic claim federation infrastructure. The remainder of the paper is structured as follows: in Sect. 2 we list a series of objectives considered throughout the development of the framework, while in Sect. 3 we provide a detailed description of SALT. Sect. 4 shows an example of SALT implementation following a manual annotation approach, Sect. 5 details the evaluation we have performed both for the model and the manual annotation and Sect. 6 brieﬂy introduces one possible SALT application. Before concluding in Sect. 8, we perform an extensive analysis of the related work, in Sect. 7.

2

Motivation and Objectives

The need of a comprehensive approach for capturing the rhetoric present within scientiﬁc publications was our main motivation. Such an approach should consider the rhetorical roles that a text span can carry at diﬀerent granularity levels, and not focus on a particular one, in addition to modeling the underlying textual structure, for an appropriate provenance description. The materialization of the approach as semantic metadata will support the author during the authoring process of a publication, by oﬀering the means for structuring the argumentative thread in a coherent manner. Additionally, it will also support the reader in her quest for a richer and more semantic search and browse experience, by representing a valuable input for the corresponding systems. To have a better understanding of our goals, we present below a series of objectives that we have considered throughout the development of SALT, from two perspectives: (i) modeling, and (ii) application as an authoring framework. Modeling objectives. From the modeling perspective, we aim at making a clear decoupling of the textual content (including its intrinsic structure) from the content semantics (in our case, the rhetoric and argumentation facets). At the same time, capturing the original structure of the publication’s content represents an important element, as it provides the provenance information for the actual semantics. The rhetorical aspects of the publication can be observed at diﬀerent granularity levels. For example, an entire paragraph in an introduction could be interpreted as the authors’ motivation, while a phrase within the same introduction, could act as one of the publication’s claims. It is crucial to acquire and maintain these granularity levels, together with the relations among them, as they support the structuring of the content’s coherence and argumentation. Finally, the model’s applicability should be general enough to cover any type of scientiﬁc publication, in any domain.

4

T. Groza, S. Handschuh, and S. Decker

Authoring framework objectives. From an authoring perspective, we focused on two particular objectives: (i) to allow an incremental externalization of the rhetorical aspects, without imposing a certain minimum, and (ii) to provide elements, always present in scientiﬁc publications, with no user eﬀort (e.g., the usual publication metadata). During the last decade, several approaches having the same abstract goal were researched, e.g., ABCDE [4], or ScholOnto [5]. However, they usually focus on some of the objectives listed above, e.g., only on modeling a coarse-grained rhetorical structure, or, are tailored towards a speciﬁc domain. While we did build our approach on the previous research, we aimed at improving the expressiveness of the concepts to be modeled, in addition to trying to cover the entire list of objectives. As we shall see, the result is a rich and comprehensive model, compatible with most of the existing approaches.

3

The SALT Ontologies

SALT (Semantically Annotated LATEX)2 represents an authoring framework that captures the rhetoric and argumentation present within the content of scientiﬁc publications. SALT is based on a set of three ontologies, designed following the objectives mentioned in the previous section. Fig. 1 summarizes the three ontology layers of SALT: – the Document Ontology, modeling the linear structure of the publication, including the temporal and identiﬁcation aspects; – the Rhetorical Ontology, capturing the rhetorical and argumentation structures within publications, and

Fig. 1. SALT Ontological framework layers 2

One should note that although the name suggests a strong connection to LATEX, it is actually independent of any writing environment. The model conceptualizes, in general, the semantics captured in publications, and can be used and applied within any writing environment. We did start with LATEX, and for historical reasons we have decided to keep using the name.

Capturing Rhetoric and Argumentation Aspects

5

– the Annotation Ontology, creating a semantic bridge between the rhetoric and the actual content, enriched with additional shallow metadata. While, generally, attaching instances of such ontologies to text spans leads to the process of annotation, the message that we want to convey by using this particular naming scheme is that we consider the rhetoric to be an annotation for the document. In addition, the annotation layer represents an entry point for coupling other diﬀerent types of annotations for the document, such as domainspeciﬁc concepts or linguistic features. Finally, this layering allows us to achieve our ﬁrst mentioned objective of decoupling the publication’s content from the rhetoric and argumentation. 3.1

The Rhetorical Ontology

Since the main focus of our work is the rhetoric and argumentation aspects of scientiﬁc publications, we start by discussing the Rhetorical Ontology. Throughout its development, we were driven by the second modeling objective, i.e., acquire and maintain the diﬀerent granularity levels of the rhetorical aspects. Consequently, the resulted ontology revolves around two abstract entities, i.e., the discourse knowledge item (or epistemic item) and the rhetorical block. These two entities represent the two levels of granularity targeted by the model: (i) a ﬁne-grained level that creates a semantic wrap around short pieces of individual text – text spans varying from part of a sentence to a couple of sentences less than a paragraph – and (ii) a coarse-grained level that tags paragraphs according to their rhetorical role. To have a better understanding of the discourse knowledge items, without providing a formal deﬁnition, we consider them to be pieces of text that transpire the author’s contribution in the respective publication. This contribution can take several forms. It can represent a claim, i.e., the author’s original contribution motivating the work behind the publication, it can be a position (positive, negative or neutral), i.e., the author’s statement in regards to someone else’s claims, or it can act as an argument that supports the claims stated by the author. Concisely, the discourse knowledge item is an abstract term for all the concepts in the model dealing with the ﬁne-grained semantic structure, i.e., rhetorical elements and argumentation elements. Fig. 2 depicts the overall structure of the Rhetorical Ontology. This is split into three main parts: (i) the rhetorical blocks side, dealing with the coarse-grained rhetorical structure of the publication’s content, (ii) the rhetorical relations side, modeling the ﬁne-grained rhetorical structure of the publication’s content, and (iii) the argumentation side, describing the argumentation captured in the publication’s content. To provide a comprehensive description of the ontology, we will detail separately each part, in the order speciﬁed above. Rhetorical blocks are text spans within the publication having the length of a paragraph and an associated rhetorical role in the overall discourse. Our model proposes nine types of rhetorical blocks: Abstract – provides a short overview on the publication’s content (implicit rhetorical block).

6

T. Groza, S. Handschuh, and S. Decker

Fig. 2. SALT Rhetorical Ontology

Motivation – provides the reason motivating the research presented in the paper Scenario – shows real-world use cases in which the solution provided by the publication is applicable Contribution – describes the solution proposed by the publication Evaluation – describes how the research presented in the paper was evaluated, including its results Discussion – details interesting issues discovered while researching the publication’s topic Background – builds a view over the related work in the ﬁeld Conclusion – provides a conclusive overview of the publication Entities – deﬁnes the list of cited references (implicit rhetorical block). From the list above, two blocks, Abstract and Entities (References), have an intrinsic rhetorical nature and are present in all scientiﬁc publications. As it can be observed, the proposed rhetorical blocks emulate a best-practice structure for any scientiﬁc publication. Thus, from the applicability point of view (cf. also our third modeling objective), since all the concepts carry an abstract semantics, they can be used in any scientiﬁc domain. Nevertheless, for a more speciﬁc deﬁnition of the semantics, some domains might require a specialization of some of the terms. For example, in the biomedical ﬁeld, one might deﬁne Evaluation as a pair of Experiments and Results, but this does not conﬂict with our model, as it just provides a more domain-speciﬁc deﬁnition of it. The Rhetorical relations part consists of Rhetorical Elements and the actual concepts denoting Rhetorical Relations. Both main concepts, together with their specializations, are deﬁned according to the Rhetorical Structure of Text (RST) theory [6]. For the sake of completeness we will re-visit the deﬁnitions in the context of the ontological concepts.

Capturing Rhetoric and Argumentation Aspects

7

A Rhetorical Element can be specialized into a Nucleus or a Satellite. The Nucleus represents a text span playing an important role in the understanding an the entire text. It it essential to the writer’s purpose and reader’s understanding. The Satellite represents a text span playing a secondary role and supporting the Nucleus with additional information. It is often incomprehensible without the nucleus. Usually a piece of text comprising only nuclei (therefore by omitting the satellites) can be understood to some extent. For example, if we consider the sentence: [We have designed a one step RT-PCR assay that requires only a 5h turn-time around ]1 , [thus, making it appropriate for the rapid testing of clinical specimens.]2 its ﬁrst part represents the nucleus, while the latter represents the satellite. If we now focus on the publication as a knowledge spring, rather than on its low-level linguistic structure, we can deﬁne a Claim as a nucleus leveraged to the level of the entire publication, representing a discourse knowledge item that encapsulates the essence of the publication. Similarly, we will have a Support that is a corresponding Satellite for a Claim. For connecting the rhetorical elements deﬁned above, we need to introduce the concept of Rhetorical Relation. This is a linguistic relation between two or more discourse knowledge items having a particular rhetorical role. As an example, consider the following text: When [an ontology is selected as the underlying knowledge base,]1 [the Lexicon Builder automatically extracts entities out of the ontology to build the Lexicon.]2 The second element of the sentence represents a Claim, being part of the publication’s contribution. Consequently, the ﬁrst element will be a Support, connected to the Claim via the Circumstance rhetorical relation, signaled by the presence of the when discourse marker3 . Besides the pair of connecting elements, it is extremely important to understand also the position of the author (or writer) and of the reader, with regards to the intrinsic rhetorical role of the relations. Generally, the author uses a particular rhetorical relation, in a certain context, to suggest something to the reader. Thus, the construction Claim – Rhetorical Relation – Support should be interpreted as follows: the author uses the Support in conjunction with the Rhetorical Relation to attract attention to (or to suggest something about) the Claim. As an example, in the case of the Condition rhetorical relation, the writer suggests the reader that the realization of the Claim depends on the realization of the Support. Considering all these elements, we will use the following abbreviations for deﬁning each of the 12 rhetorical relations introduced by our model: (i) C will represent the Claim, (ii) S will represent the Support, (iii) W will represent the writer, and (iv) R will represent the reader; Also, in the examples we will give 3

For modeling purposes, it is not important to understand what is the role of a discourse marker, or what kinds of discourse makers signal the rhetorical relations.

8

T. Groza, S. Handschuh, and S. Decker

for each deﬁnition, we will show which is the discourse marker that signals the rhetorical relation. In this way we are able to represent the two connecting elements of the relation. Antithesis – C and S are in contrast. Because of the incompatibility that arises from the contrast, one cannot have a positive regard for both of those situations. Comprehending S and the incompatibility between the situations, increases R’s positive regard for C. Example: [Together with the lexical and syntactic information of the parsed queries, semantic information in the knowledge bases can also be utilized to help formulate the formal query,]1 but [how to accomplish this is still an open problem.]2 Cause – C presents a situation that could have caused the situation presented in S. The situation presented in C is more central to W’s purposes than is that presented in S Example: Because [the annotation process is interleaved with the writing process in the LATEXenvironment,]1 [the targeted PDF document does not even exist yet.]2 Circumstance – S sets a framework in the subject matter within which R is intended to interpret C Example: When [an ontology is selected as the underlying knowledge base,]1 [the Lexicon Builder automatically extracts entities out of the ontology to build the Lexicon.]2 Concession – W acknowledges a potential or apparent incompatibility between C and S. Recognizing the compatibility between C and S increases R’s positive regard for C. Example: Although [the ﬁrst natural language interface system came out more than three decades ago,]1 [a fully portable and widely used system for formalized knowledge bases is still unavailable.]2 Condition – The realization of C depends on the realization of S. Example: If [the input query is classiﬁed into some category,]1 [AquaLog will process it correctly.]2 Consequence – S presents a situation that could have caused the agent of the action in C to perform that action. Without the presentation of S, R might not regard the action as motivated. C is more central to W’s purposes in putting forth the C-S combination than S is. Example: [The user-deﬁned synonyms are not mandatory for the Lexicon, and all the mandatory contents are extracted in a totally automatic way.]1 Therefore, [the construction of the Lexicon is portable.]2 Elaboration – S presents additional detail about the situation or some element of subject matter which is presented in C or indiferentially accessible in C. Example: [In SALT, this is covered elsewhere (see the next section), which is why our A is the Abstract of the document.]1 Furthermore, [we extend ABCDE with the concepts Motivation, Scenario and Conclusion.]2 Finally, [the Argumentative part of the ontology allows the further modeling of scientiﬁc discourse in the form of Arguments and Counter Arguments.]3

Capturing Rhetoric and Argumentation Aspects

9

Evidence – R’s comprehending S increases R’s belief of C. Example: Since [one word may match multiple OntoEnties,]1 [there may be diﬀerent combinations and multiple valid OntoTriple result sets.]2 Means – S presents a method or instrument which tends to make realization of C more likely. Example: By [physically sorting data on the subject column,]1 [subject-subject joins between two tables, a very frequent operation, can be realized in linear time.]2 Preparation – S precedes C on the text. S tends to make R more ready, interested or oriented for reading C. Example: [We note A (c1 , c2) the set of acyclic paths between c1 and c2 and HSO : A (c1 , c2 ) —> B the function]1 such that [HSO(p) is true.]2 Purpose – S is to be realized through the activity in C. Example: In order to [help bridge the gap between user vocabulary and ontology vocabulary,]1 [general dictionary WordNet is utilized.]2 Restatement – S restates C, where S and C are of comparable bulk. C is more central to W’s purpose than E is. Example: [To translate a natural language query into a SPARQL query, we must ﬁnd the targets,]1 i.e. [the words that correspond to the variables after SELECT in the resultant SPARQL query.]2

Fig. 3. SALT Rhetorical annotation example

Fig. 3 depicts an example of rhetorical annotation. The entire block is considered to be a Contribution rhetorical block, while at a more ﬁne-grained level, one can observe how the rhetorical elements are delimited and the rhetorical relations connecting them. The markup present at the start of an arrow denotes a Claim, while the one present at the end of the arrow denotes the Support. The last part of the Rhetorical Ontology is the Argumentation. Our argumentation model follows closely the initial approach introduced by IBIS [7] and then developed further by DILIGENT [8]. Consequently, the concepts present in our ontology are perfectly compatible, from the semantics point of view, with the original concepts. The novelty that our approach brings, is the placing of the argumentation in the greater context of structuring the publication’s coherence and rhetoric, by enabling the connection between argumentation elements

10

T. Groza, S. Handschuh, and S. Decker

and the underlying rhetorical structure, built from rhetorical elements and their associated relations. In addition, it is important to note that one of the main goals with regards to argumentation was to provide a quite abstract form on it, similar to an upper level ontology, which can then be customized based on the application domain. The four main concepts representing the core of our model are: Statement, Issue, Position and Argument. A typical argumentation ﬂow would start with one expressing a Statement, which could raise an Issue, or about which others could state their Positions. On its turn, a Position could take the form of a request, or could agree, disagree or be neutral in regards to an issue or a statement. The existing issues or positions could then be argumented (Argument), either by providing a Justification, or by raising a Challenge which would leave the argumentative thread open for debate.

Fig. 4. Example of argumentative discourse network

The model allows the link between a Rhetorical Element and a Position or an Issue, by means of the represents relation. Therefore, a claim can be seen as the author’s position with regards to a statement or an issue, or as an issue per se. Both cases help in building argumentative discourse networks that span across multiple publications. These start from a claim stated in a publication, which is then seen (raised) as an issue in a diﬀerent publication, as shown in Fig. 4. This last publication will be provide arguments or state positions about the issue, positions that can be on their own turns, claims of the current publication. The comprehensive modeling of the communication that spans across multiple publications (by means of argumentation elements) raises a series of interesting challenges. On the one hand, from a conceptual perspective, the elements presented above do enable the capturing of argumentative discourse networks that emerge from the exchange of ideas between publications, and especially from the comparisons present in the publications’ background. On the other hand, more from an authoring perspective, one needs also to consider how is the provenance of the rhetorical elements handled across multiple publications, or how is the managing of ”duplicate” rhetorical elements present in diﬀerent publications done (e.g., the same abstract claim can be found in diﬀerent publications, authored by diﬀerent researchers, and maybe having a slightly diﬀerent textual representation, yet representing the same idea). Consequently, we have

Capturing Rhetoric and Argumentation Aspects

11

decide to give SALT a rather local scope, i.e., by modeling the rhetorical and argumentation elements from the single (local) publication point of view, and deal with the global perspective (and the above mentioned challenges) by means of the KonneXSALT model, described in [9]. 3.2

The Document Ontology

Following the objective of modeling the environment in which the rhetoric and argumentation takes place, for ensuring a proper provenance description, we have developed the Document Ontology. The role of this ontology is to capture the linear structure of the publication, in addition to the identiﬁcation and revisioning information of the publication’s content. The goal was to capture the content at diﬀerent granularity levels, from coarse-grained consisting in the publication as a whole, to ﬁne-grained comprising sentences or text chunks. Also, since we envision the use of the ontology instances without the duplication of the actual content (i.e., hasText-like properties), we adopted a pointer-based approach for mapping the instances to the textual content. Fig. 5 depicts the structure of the ontology, which is composed of the following concepts: Publication – represents the root of the model (and associated tree structure). Each publication for which one creates SALT instances, will have one Publication instance, enriched with property values and relations to the other concept instances present in the model. Section – represents the next level of granularity. To be able to create a full tree, the Sections may have sub-sections which represents on their own turn Sections again. In terms of properties, Sections have a title, optionally an order number, Paragraphs, and may contain Tables and Figures. Paragraph – if the previous two concepts modeled the linear structure of the publication, Paragraph is the ﬁrst concept to model the content. Paragraphs are usually part of a Section, they have an order number, a start pointer and an end pointer. These two values should locate the entire content of the paragraph in the corresponding publication.

Fig. 5. SALT Document Ontology

12

T. Groza, S. Handschuh, and S. Decker

Sentence – represents the decomposition of Paragraphs into subunits. Similarly to these, Sentences also have start and end pointers, with the same interpretations. TextChunk – discourse knowledge items can have diﬀerent lengths between parts of sentence to multiple sentences, less than one full paragraph. That is why, we have introduced the TextChunk concept, to act as an intermediary element. If the discourse knowledge item does not ﬁt perfectly into a Sentence or a Paragraph, then it should be modeled as a TextChunk. Since it represents conceptually the same underlying idea as the previous two concepts, i.e., a piece of the publication’s content, this concept will have the same properties. Table and Figure – model the corresponding concepts of a table and of a ﬁgure, part of the publication. Although we did not put a particular emphasis on these two, one could envision many properties that can be attached to them, such that they provide a rich information to complete the overall image of the publication’s model. Examples of such properties could be: the context reference or the surrounding paragraphs. Revision – captures the idea of change within the other concepts that link to it. Thus, a Sentence that was modiﬁed between two versions of the same publication, will automatically have linked such a Revision instance, pointing to the most current content. Practically, this concepts targets the simple textual revision of the publication’s content. However, if used in a proper environment, it will enable the capture of the evolution of the document at every level, and thus, the exploration of the way in which the rhetoric and argumentation evolved based on the diﬀerent revisions that the document had suﬀered. We have introduced the concept for completeness purposes, yet its actual use depends on the speciﬁc system implementation. As a remark, the shallow metadata attached to some of the enumerated concepts (e.g., title, authors, numbers) is part of the Annotation Ontology. This is because we wanted a clear separation between the linear structure of the publication and its shallow metadata. The identiﬁcation aspects are implicitly present in the framework through the instances URIs. The scope of the identiﬁcation should generally be considered as local, except at the Publication level, where it can also take a global scope. The provenance information is modeled in two ways: (i) at the root level, the Publication has attached a location property, while (ii) at ﬁner-grained levels, each concept has attached a pair of pointers (StartPointer, EndPointer) indicating the exact location of the instance’s content in the text. The actual encoding of these pointers is left unspeciﬁed, as one can choose diﬀerent options, ranging from simple text oﬀsets, special encodings dependent on the internal document structure, or implementations of the XPointer framework [10,11]. As it can be observed, the goal of this ontology is to model the linear structure of scientiﬁc publications, only, and in its current form cannot be used to model other types of ”documents”, such as Wiki pages. While conceptually, our terms emulate also the tree structure of a Wiki page, the exact semantics we have

Capturing Rhetoric and Argumentation Aspects

13

assigned to them, does not describe accurately the domain. In addition, although we capture some temporal aspects by means of Revisions, we treat the authors of such revisions in a quite static manner. We consider a publication to have a ﬁxed set of authors, with a global scope, and we don’t model at a ﬁne-grained level which author wrote which section (or revision), aspect which is important to be captured within the Wiki environment. Nevertheless, our model follows to some extent the W3C Document Object Model4 , with the Publication representing a form of Document and the Sections acting as Nodes.

Fig. 6. Example of linear and rhetorical block structures

To understand the relation between the linear and the rhetorical structures of the publication, Fig. 6 depicts an example. On the left side we present the usual linear structure, divided into sections, while on the right side, the rhetorical block structure is shown, with the rhetorical blocks grouping together scattered parts of the linear structure, based on their rhetorical role. Thus, the ﬁrst paragraph might have a motivation role, as well as the ﬁrst paragraph of the second section. The only two ”sections” that are the same in both the linear and the rhetorical structures are the Abstract and the References, because on their intrinsic rhetorical role. 3.3

The Annotation Ontology

The connection between the rhetoric and the publication content and structure is realized by the Annotation Ontology. The ontology consists of only two main concepts, as depicted in Fig. 7: Annotation – representing the concept that builds the links between layers CitationContext – as a specialization of Annotation, modeling the surrounding of the citation in the publication’s content. The role of these two concepts will be better understood in the context of the following section, where we discuss the way in which the links between the layers are realized. 4

http://www.w3.org/DOM/

14

T. Groza, S. Handschuh, and S. Decker

Fig. 7. SALT Annotation Ontology

In addition, the ontology also includes, virtually, properties and concepts describing the shallow metadata of the publication and of the cited references. We consider these properties as being virtual, because they are not actually part of the ontology, but deﬁned by other ontologies, such as FOAF5 , DublinCore6 , SWRC [12] and the Bibliographic Ontology (BIBO)7 . By adopting this approach, we wanted to re-use well-known and widely used entities, rather than “re-inventing the wheel”. Therefore, the usual shallow metadata of a publication is modeled in our case by means of: (i) the dcmi:creator – foaf:Person couple, to model the author of the publication, and (ii) dcmi:date, dcmi:description, foaf:topic and dcmi:title, to describe the publication’s date, topic, description and title, respectively. At the same time, for capturing the publication type and environment, we used a combination of SWRC and BIBO terms, such as: bibo:AcademicArticle, bibo:Chapter, bibo:Proceedings, bibo:Journal, or swrc:Conference. Last, but not least, the Annotation Ontology provides the gate to attaching domain knowledge to the publication, via the Annotation concept. We discuss more about this in the next section. 3.4

Connecting the Layers

In this section we describe how the three layers previously described connect to each other, to form the overall model of the publication. There are two linking perspectives: (i) the rhetorical (and argumentation) – annotation – document link, and (ii) the publication – shallow metadata link. Fig. 8 depicts the ﬁrst linking perspective. The two anchor points between the three layers are represented by the hasAnnotation and annotates 5 6 7

http://www.foaf-project.org/ http://dublincore.org/ http://bibliontology.com/

Capturing Rhetoric and Argumentation Aspects

15

Fig. 8. Rhetorical – annotation – document link

properties. Thus, any of the rhetorical elements, rhetorical blocks or argumentation elements will have an Annotation instance attached, which in its turn will annotate the corresponding TextChunk, Paragraph or Sentence. One interesting remark would be regarding the domain knowledge coupling. As it can be observed, one can assign a topic8 to an Annotation instance, from a chosen domain ontology. The way in which this assignment can be interpreted is twofolded. On one side, the topic can be seen as an annotation on the rhetorical, e.g., the claim is about that topic, while of the other side, it can be seen as an annotation on the document, e.g., the sentence is about that topic, and it can be found in the document at a particular location. This modeling approach oﬀers advantages especially for inferencing, as it places the domain knowledge at the intersection of the rhetorical with the provenance. However, attaching a topic to the publication’s content can be done without the existence of a rhetorical element. One can simply instantiate an Annotation concept, attach the topic to it and relate the annotation to a chosen TextChunk. To have a better view on how this linking works, Fig. 9 shows an annotation example. It starts at the rhetorical level, from a rhetorical relation (in

Fig. 9. Example of the rhetorical – annotation – document link 8

Please note that we have chosen the very generic ”topic” term to denote any domain knowledge entity. Thus, such a topic can represent a concept or instance of any external domain ontology, e.g., Gene, Protein, etc.

16

T. Groza, S. Handschuh, and S. Decker

Fig. 10. Publication – shallow metadata link

this case Purpose), that connects a claim to a support (their textual counterparts are given in the ﬁgure). Each rhetorical element is then linked to an Annotation instance in the annotation level, and each annotation instance annotates a TextChunk instance, that has start and end pointers attached. At the end, in this particular example, the two text chunks are connected to a Sentence instance, as they are both part of the same sentence. The second linking perspective is the publication – shallow metadata link, as depicted in Fig. 10. As we insisted on decoupling completely the structural information from any other information, the shallow metadata is not part of the document level, but of the annotation level. Thus, one can see how the shallow metadata properties, described in the previous section, are attached to the Publication concept. A more interesting aspect is the citation of references. Instead of just providing the relation between the publication and reference via the cites property, we decided to model also the citation context, as it can bring added value if used properly. Therefore, we have introduced the CitationContext entity that connects the Publication to the TextChunk in which this is actually referred. The Publication will have a pair of cites and hasCitationContext properties for each existing reference. As a consequence, one will be able to easily observe when a reference is present in a publication, without a proper citation (i.e., just added at the end).

4

Creating SALT Metadata

Manual creation of semantic metadata is generally seen as a cumbersome task, as people don’t see its immediate beneﬁts, unless it takes place in a highly interactive environment. Therefore, one way of smoothing the manual annotation process is by creating an interactive environment that will provide immediate feed-back to the user, and rewarding her for the added eﬀort. Obviously, the main drawback of such an approach is that it adds to the already long list of applications that a typical researcher uses, yet another one, which comes with an associated learning curve.

Capturing Rhetoric and Argumentation Aspects

17

Instead of developing a new environment, and with the same goal of reducing the annotation overhead, we propose the interleaving of the writing process with the annotation process, thus enabling concurrent authoring. The main advantage of our proposal is that the author will continue to use the same environment with which she is already familiar, and create annotations with the “tools” provided by the respective environment. For example, in the case of LATEX, the author is able to use any LATEX editor and the usual LATEX commands, plus in addition a small set of commands that we provide for the annotation of the argumentation and rhetorical. This realizes the perfect mixture for a smooth and uninterrupted writing process, associated with a small learning curve. In addition, it follows closely our objective of enabling an incremental externalization of the rhetorical aspects, as the writer can choose at any time what to annotate, or if to annotate at all. For the proof-of-concept, we chose two widely used writing environments, in which we embedded annotation mechanisms for the creation of SALT metadata: (i) LATEX, and (ii) MS Word 2003. Both mechanisms respect the incremental approach and achieve the second authoring objective of providing some elements with no user eﬀort, i.e., they extract the document structure and shallow metadata automatically, thus allowing the user to focus only on the rhetorical annotation. 4.1

Creating SALT Metadata in LATEX

LATEX is a high-quality typesetting system that allows authoring of documents in a programmatic manner. Instead of following a visual (component-driven) approach like MS Word, LATEX introduces a series of commands that the writer uses to produce the formatting and style of the text. Therefore, the LATEX author will be familiar and comfortable with this programmatic approach while writing the actual content of the publication. The usual process of authoring a publication in LATEX starts with “programming” the content of the publication, then is followed by the compilation of the publication’s “source”. The output of such a compilation can take several forms. Generally, when it comes to scientiﬁc publications, the Portable Document Format (PDF) is widely used. Other formats include PostScript (PS ﬁles) or DVI (DeVice Independent ﬁles). LATEX provides the most natural environment for the manual creation of semantic metadata. While using other writing environments, the annotation would impose a serious overhead, due to its characteristic “programming” feature, in LATEX this overhead, although still existing, is reduced to the minimum. In addition, the LATEX author has a special mindset, as one could generalize that the LATEX commands represent themselves an annotation for the content. This is the main reason for us to choose this particular writing environment as a playground for our semantic authoring framework, and to enable concurrent authoring, i.e., interleaving the writing process with the annotation process. To facilitate the actual SALT annotations, we introduced a series of own LATEX commands, representing the counter-part of each of the three sides of the Rhetorical Ontology:

18

T. Groza, S. Handschuh, and S. Decker

Fig. 11. Example of SALT annotation in MS Word

Rhetorical blocks annotation is realized via a set of LATEX environments, one for each type of rhetorical block, except for the implicit ones, i.e., Abstract and Entities. E.g.: \begin{motivation} . . . \end{motivation} Rhetorical elements annotation is materialized via a set of individual commands, such as: \claim for claims, \support for supports, etc. Rhetorical relations annotation is done as well based on a set of individual commands, one for each type of relation, e.g., \antithesis, \circumstance or \purpose. Argumentation elements annotation follows again the same approach, e.g., \position In addition to the annotation of these elements, we have also introduced a special command for connecting rhetorical elements or simple text spans to domain knowledge: \assignTopic. As already mentioned, the result of compiling a set of LATEX sources is a complete new document, having a diﬀerent encoding. Here, we focused on the PDF format, as it is the only one able to capture both the content and the associated metadata within the same document. Hence, the result of compiling a set of SALT annotated LATEX sources will be a semantic PDF document, containing SALT metadata instances in addition to the actual content. At the same time, in this particular case, besides the metadata explicitly created via markup (by using the above-listed commands), we also automatically extract the shallow metadata (e.g., title or authors) and the linear structure of the publication. Consequently, the compilation process will result into a full-ﬂedged SALT instance model of the publication. 4.2

Creating SALT Metadata in MS Word

Unlike the previously described LATEX environment, MS Word follows a strictly visual, component-based paradigm. The author of a publication is focused only

Capturing Rhetoric and Argumentation Aspects

19

on writing the actual content, while the formatting and style is mostly done by the environment with small hints from the writer. To cover also this shift in paradigm and have a comprehensive view on the creation of semantic metadata in diﬀerent environments, we have developed a MS Word module that allows the author to perform concurrent authoring, thus enriching their publications with SALT instances. Following the visual approach, our module provides the necessary visual components to create the markup, in forms of a contextual menu, a side panel and its associated bar menu options. Fig. 11 depicts the SALT visual components in the MS Word environment and an example of claim annotation. From the annotation result perspective, there are two main diﬀerences between the annotation in LATEX and the one in MS Word. Firstly, the resulting document in the latter case is a MS Word DOC ﬁle, as opposed to PDF, which is also able to capture both the metadata and the publication content. Secondly, due to implementation diﬀerences, the shallow metadata and the linear structure of the publication need to be explicitly marked-up in MS Word, as opposed to LATEX where they are automatically extracted at compilation time. This last remark is especially important because the diﬀerence is also made during the evaluation of the annotation task.

5

Evaluation

This section describes the evaluation we have performed for the overall framework, in the context of a manual annotation experiment. We start by introducing the general setting, then detail the evaluation results, based on the given tasks and ﬁnally we conclude with the ﬁndings resulted from the evaluation. 5.1

General Setting

Evaluation goals. We targeted two main categories. The ﬁrst category is the framework design, analyzing its soundness, the resulting beneﬁt and suitability for the task (i.e., semantic enrichment of scientiﬁc publications). The second category looks at the annotation task per se, considering its complexity, comfort and collateral eﬀects. A third and special interleaved category is the learning aspect, i.e., what is the learning curve of the framework as a result of the annotation task. Evaluators. A group of 15 researchers with knowledge of Semantic Web and Semantic Annotation technologies was used. They were well-used to reading and writing scientiﬁc publications and volunteered for the evaluation. Guidelines. Written guidelines (13 pages) were provided, comprising a detailed terminology, deﬁnitions for rhetorical elements, relations and blocks — including examples — and detailed guidelines for the LATEX and MS Word markup. Training. The participants received as training a 30 minutes presentation which consisted of a high-level overview of the SALT model, deﬁnitions and examples for the key concepts used throughout the evaluation (such as rhetorical element, rhetorical relation, etc), and speciﬁcally for rhetorical relations the actual discourse markers that signal the relations with associated examples.

20

T. Groza, S. Handschuh, and S. Decker

Materials and procedure. Each participant chose and annotated an own publication, thus resulting an annotation corpus of 15 publications. The experiment consisted of them reading the publication and performing a series of annotation tasks: (i) three general tasks: annotation of rhetorical blocks, annotation of claims and supports, and deﬁnition of the rhetorical relations connecting the claims and supports; (ii) one speciﬁc task for LATEX users: attaching domain knowledge concepts to claims, and (iii) two speciﬁc tasks for MS Word users: annotation of shallow metadata (title, authors, aﬃliation), and annotation of the linear structure of the publication. Evaluation measures. We used a comprehensive questionnaire (38 questions) with Likert-scale and free text answers, to measure the satisfaction of the participants. An important remark should be noted here. As the evaluators were the authors of the publications used throughout the experiment no inter-annotator agreement measures could have been applied, especially since in this particular case, the annotation is a highly subjective task. Additionally, our goal was not to measure the correctness of the annotation or the agreement between the participants on the annotations, but rather the authors’ comfort in using our annotation framework, together with the framework’s soundness and resulting beneﬁt. Thus the only applicable measure is rating this comfort and satisfaction is by using questionnaires. 5.2

Evaluation Results

Framework design analysis. The framework design analysis looks at three major aspects: – the framework’s suitability for enriching publications within the process of writing, and for annotation in general, or speciﬁc for LATEX or MS Word, – the framework’s soundness, i.e., which are the annotation elements that seem to be more plausible for the annotation task, or which are the ones easier to work with, and – the resulting beneﬁt, i.e., which annotation elements have an unclear beneﬁt. Figs. 12 and 13 depict the results of the framework’s overall suitability. We can observe that while the vast majority (75%) of the subjects consider SALT to

Fig. 12. SALT suitability for enriching publication with in the process of writing

Capturing Rhetoric and Argumentation Aspects

21

Fig. 13. SALT suitability for the annotation task

be well adapted for the semantic enrichment of publications within the process of writing, the situation is diﬀerent when considering the framework’s intuitiveness. This translates into the authors understanding the goals of the enrichment process and concepts, but only after consulting the guidelines and receiving training. More concisely, SALT’s annotation concepts are well adapted for semantic authoring, but are less intuitive than we assumed, especially due to the lack of use of the rhetorical linguistic terms on a daily basis. Some of the comments we have received, pointed to the fact that the framework provides too many possibilities to be intuitive and easy to grasp. The second aspect, the suitability for the annotation task, refers to the means of performing the annotation. Here, at the ﬁrst glance, the results are not as clear as we have expected. Our assumption was that both for LATEX and MS Word the implementation mechanisms blend naturally within the writing environment. Consequently, we expected the subjects to have a clear (positive) opinion on this matter. Unfortunately, only 66.6% considered the SALT methods to be wellsuited for annotation. We found two reasons for this result. In the general case, the subjects remarked the complexity of the framework, while in the speciﬁc case the issues were directly related to the imperfections of the implementation. Fig. 14 A shows that the plausibility of terminology and soundness of the model were considered high by 91% of the participants for rhetorical blocks, whereas fewer and fewer participants said this about the more complex annotation tasks — 66.7% found annotation of claims and support plausible while only 33% found rhetorical relations annotation plausible. Fig. 14 B supports the same tendency from the perspective of the actual annotation mechanism. 75% of the participants worked best with rhetorical blocks, while 41.7% worked best with claims and supports, and only 25% worked best with rhetorical relations. At the same time, Fig. 14 C presents the same trend from yet another angle: none of the participants found the beneﬁt of simple annotations (rhetorical blocks and claim/support) unclear, while 25% were unsure about the beneﬁt of rhetorical relations. This general tendency was expected — it seems intuitive that simpler annotations are easier to grasp, and as a result their usefulness is much more apparent. As a consequence, we see the development of tools which aid the user in the authoring process as crucial.

22

T. Groza, S. Handschuh, and S. Decker

Fig. 14. [A] – SALT soundness; [B] – SALT easiest to grasp concepts; [C] – SALT unclear beneﬁt

The annotation of domain knowledge within SALT, which is not on the scale of complexity of the other annotations, but rather a separate entity, was not well received by most participants — no participants found it plausible, while 33.3% found an unclear beneﬁt in the annotation of domain knowledge. The following section covers the possible reasons for this. Annotation task analysis. Similarly to the framework design analysis, the annotation task analysis looks a three majors aspects: – the annotation complexity, i.e., how complex did the subjects ﬁnd the annotation task when it dealt with each of the concepts captured by SALT, – the annotation comfort, i.e., if the participants considered the annotation to introduce unnecessary interruption or found it unnecessary complex, and – the collateral eﬀect, i.e., what were the collateral eﬀects that emerged from the annotation process. A general remark observed about the annotation complexity (depicted in Fig. 15) was that with increasing complexity of the annotation task at hand, the participants overall impression decreased. The annotation of rhetorical blocks, which is also the simplest of the tasks, was deemed most intuitive, while the annotation of rhetorical relations at the other end of the scale was deemed not intuitive, but instead rather complex by most participants. Outside the scale, we have two more elements. On one side, we ﬁnd again the annotation of domain knowledge. Following the tendency introduced in the previous section, the task of marking domain knowledge concepts was found very complex, if not impossible. There were two main reasons signaled to explain the results: (i) the lack of a single ontology to cover the domain of the publication, or of a mixture of several ontologies, although in this case we would face a diﬀerent

Capturing Rhetoric and Argumentation Aspects

23

Fig. 15. Annotation task complexity

issue, respectively, how to choose the right concept, especially in the presence of a particular concept in multiple ontologies, and (ii) the multitude of annotation possibilities and lack of an automatic tool to perform this task. On the other side, we have the annotation of shallow metadata and linear structure (speciﬁc for MS Word) which was found extremely intuitive and easy by all the participants.

Fig. 16. Annotation comfort

The second aspect we were interested in was the annotation comfort, which we tried to measure by checking whether the annotation task introduces unnecessary interruption or if it is too complex. As also observed in the framework’s suitability, the annotation task seems to be quite complex. In this case, as opposed to the conceptual perspective, the problem arises from the complexity of the annotation mechanism, which obviously leads then to unnecessary interruption. These elements are pointed out in the results presented in Fig. 16. Unfortunately, an extra item driving the results in the negative direction was the imperfection of the implementation, with the most of the participants signaling it to us as a major cause for the interruption. The last aspect we followed in this category of annotation task analysis is the collateral eﬀects manifestation (Fig. 17). The focus was on three eﬀects that we assumed they might emerge during the annotation experiment. The ﬁrst eﬀect was the spotting of weak points in the paper. Our belief was that by annotating claims and rhetorical relations between them, the participant will develop

24

T. Groza, S. Handschuh, and S. Decker

Fig. 17. SALT Annotation collateral eﬀects

a better understanding in how to make the strong points of the publication more prominent, and consequently how to spot which are the weak points of the publication. According to the results, the vast majority of participants (75%) experienced this eﬀect. The second eﬀect was the improving of the understanding in writing a publication. Following the same ideas mentioned above, we believed that the familiarization with the rhetorical relations will improve the way in which the subjects can structure their argumentation. In this case, the results are even better than in the previous, with 83.3% of participants agreeing on the eﬀect. For the last eﬀect we were examining, one which we thought will be the most prominent amongst all, i.e., improving the paper structure, the results were less satisfactory. The understanding of the rhetorical blocks in addition to the rhetorical relations, should have led to a better structuring of the overall publication, in parallel to the argumentation line. Apparently, the eﬀect did emerge, but only for 66.7% of the participants. One possible cause for the low percentage is the participant’s conservatism, i.e., once they learn a certain structure for scientiﬁc publications, it is hard to make them re-think it. Learning aspect. The last category, followed as part of the evaluation goals, was the learning curve. We wanted to observe and measure how does the learning aspect evolve in two particular stages: the initial stage, before the participants become familiarized with the framework and the mid and long-term stages (Fig. 19), after the familiarization process was completed. As we can see from Fig. 18, for the initial stage, we ﬁnd the same consistent results as in the case of the framework’s intuitiveness, i.e., most of the participants ﬁnd the

Fig. 18. SALT learning curve – initial stage

Capturing Rhetoric and Argumentation Aspects

25

Fig. 19. SALT learning curve – mid and long-term stages

annotation with SALT not really intuitive, and obviously they require the presence of the documentation. Positive results are, on the other hand, found in the second stage, where all the participants felt and immediate learning eﬀect and said that they could cope with SALT after not using it for a period of time. Practically, we are dealing here with a signiﬁcant change in the psychological context. In the beginning, before reading anything about it, the complexity of SALT transforms it into an unintuitive annotation method, whereas once the learning eﬀect emerges, this complexity decreases, and the participants realize that the concepts used by SALT represent natural entities used inherently when authoring a publication, but rarely actually mapped to a certain terminology. This leads us to believe that with appropriate tools, SALT could be easily adopted by any researcher. 5.3

Overall Findings

The overall ﬁndings of the evaluation can be split into two categories that complement each other: ﬁnding about the conceptual model and ﬁndings about the developed annotation mechanisms. Regarding the conceptual model, we can conclude that although, at a ﬁrst glance, it is not intuitive, one can easily learn it, adopt it and use it on a regular basis. As we saw, the complexity of the concepts increase with the decrease of their granularity. Nevertheless, the beneﬁt an author receives as a result of the annotation eﬀort tends to follow an inverse direction. Fig. 20 depicts a relative evolution of this beneﬁt, having as parameter the annotation eﬀort9 . As stated in our objectives, SALT is an incremental annotation framework. With no eﬀort, it can already deliver value, from the implicit linear document structure and shallow metadata. By making a simple annotation step (for claims / supports), the value increases proportionally, and the tendency continues with the rhetorical relations and, ﬁnally, with the domain knowledge. At the same time, we believe that with the use of automatic extraction tools, the eﬀort decreases while keeping the beneﬁt constant. 9

The graph represents an approximation of the beneﬁt – eﬀort pair, and in reality is not linear. We chose to show it as linear for an easier understanding of the evolution of beneﬁt, when mapped to the results of the experiment.

26

T. Groza, S. Handschuh, and S. Decker

Fig. 20. SALT incremental approach

The need for tools brings us to the second category of ﬁndings. The manual annotation mechanisms we have provided were appropriate but not suﬃcient to guide the user during the authoring and annotation process. This leads to the need of automatic extraction mechanisms embedded within the authoring environment which will give the author immediate reward for her eﬀort, in addition to an added value feeling. The presence of such tools will improve the user’s understanding of the conceptual model, which in turn will provide the author with the incentive of using the entire range of entities captured by the model via the tools. One aspect that this evaluation does not cover, is the beneﬁt that the presence of the SALT annotation could bring not only to the authors of scientiﬁc publications, but also the readers or reviewers of these publications. We did study, partially, this aspect in a diﬀerent context, however, as the focus of this work is on the authoring side, we have decide it not to feature it here. We can mention, nevertheless, that, readers of the enriched publications found an immediate beneﬁt from the SALT annotations, as they were able to align directly their perception on the publication with the authors’ perspective, by means of the externalized rhetorical structures. On the other hand, the way in which the markup is being presented to the reader also inﬂuences her reaction, which takes us back to the need for proper tool support.

6

Applications of SALT Metadata

Previously, we described the mechanism for manually creating SALT annotations. In this section, we focus on one application that uses to its full potential the metadata resulted from the annotation process. This application represents a semantic claim federation infrastructure, KonneXSALT . KonneXSALT [9] was mostly motivated by the need of an unique (global) access point to the metadata created and embedded into the semantic documents, by means of the mechanisms previously described. This resumed to providing a global storage for this metadata and proper ways for exposing it. The result, a semantic claim federation infrastructure, provides the means for

Capturing Rhetoric and Argumentation Aspects

27

Fig. 21. Browsing argumentative discourse networks in KonneXSALT

ﬁnding claims in scientiﬁc publications and support for browsing argumentative discourse networks, starting from a particular claim. Unlike any other search engine, KonneXSALT was conceived to act as a look-up service for externalized knowledge and realize eﬃciency through the minimization of the data to be indexed. In addition, it represents a pure Semantic Web application, as it handles only RDF data in the form of SALT instances. KonneXSALT aims at reducing the complexity of the search operation by relying on the authors’ incentive for externalizing their knowledge and on minimizing the search space only to this knowledge. As shown in Sect. 5, the beneﬁt an author receives by annotating their publications with SALT is proportional to the eﬀort they put in the annotation process. For the lowest possible level of annotation (i.e., claims and positions), the beneﬁt is already clear: one can easily ﬁnd the claims stated in one’s publications and browse the argumentation line in the related work starting from such a claim (as depicted in Fig. 21). At the same time, the author themselves might ﬁnd interesting connections between their claims and others, about which they might have not been aware. From an eﬃciency point of view, by indexing only the claims in publications, we restrain the search space to the minimum. Detailing the technical aspects of KonneXSALT is out of the scope of this paper. Nevertheless, it is worth mentioning that, except for the textual resources which are indexed with the help of a set of Lucene indexes10 , everything else is stored as RDF graphs into a Sesame11 repository. Practically, the entire information ﬂow within the application is realized via RDF graphs. At the same time, the external access to the metadata is also by default machine-processable. In order to keep this access uniform, with no duplicates and at the same time to make the metadata human readable, the KonneXSALT Web front-end is developed as a set of XSL sheets attached to the resulting RDF graphs. Consequently the burden of transformation is moved from the server to the client: the actual 10 11

http://lucene.apache.org/ http://www.openrdf.org/

28

T. Groza, S. Handschuh, and S. Decker

process of transformation is executed by the local web browser via its built-in XSLT processor. This method helps us to avoid duplication of data and provides ﬂexibility for both machines and humans, because it can be easily read by a human, and at the same time be analyzed by an application or fetched by a crawler.

7

Related Work

Relevant related work for our framework includes a long list of research areas. Practically, each ontology layer, and in the case of the Rhetorical Ontology each side of it, represents a research direction in itself. Therefore, we will take a stepwise approach and, in the following sections, we will analyze one direction at a time. At the end, we perform an overall comparative study of the models that are the closest to our goals and conceptualization and show where our approach brings novelty as opposed to them. As a note, it is worth mentioning that in this section we have considered only the eﬀorts that follow a model-driven approach to discourse representation. 7.1

Modeling the Rhetoric

Lately, several models for describing and structuring the scientiﬁc discourse were developed. Most of the models partly overlap on two aspects: (i) they have the same goal of modeling the rhetoric captured in the publications’ content, and (ii) they all share, at the abstract level, a core group of concepts representing the epistemic items, but use diﬀerent terminologies. The main diﬀerence is given by the types of relations connecting the discourse knowledge items and the use (or not) of the items’ placement in the linear structure of the publication. One of the ﬁrst models was introduced by Teufel [13] and tried to categorize phrases from scientiﬁc publications into seven categories based on their rhetorical role. The categories represented a block structure of the publication, similar to the Rhetorical Blocks in our Rhetorical Ontology. Their goal was to use extracted phrases, tagged with these categories, to realize summarization. The seven categories proposed by Teufel were: Aim, Textual, Own, Background, Contrast, Basis and Other. This approach is mainly driven by the polarity of the discourse knowledge items exposed by a publication. And while the authors make use of the linear structure of the publication (by means of statements mentioning it and not the structure per se), they don’t consider any explicit rhetorical relations that could appear between the extracted statements. Implicitly, one could argue that such relations are encoded in the categories proposed by the approach. One item that it is interesting to notice is the mixt approach followed, in terms of content granularity. The statements they extract can have the length of a sentence, and at the same time, if their underlying category is the same, they can be clustered into blocks, with the block inheriting the statement’s category. The most comprehensive model for rhetorical and argumentation structuring of scientiﬁc publications was introduced by Harmsze [14]. She focused on developing a modular model for the creation and evaluation of scientiﬁc articles.

Capturing Rhetoric and Argumentation Aspects

29

Although the corpus used as a foundation for analysis was on the subject of experimental molecular dynamics, the resulted model is uniformly valid for any scientiﬁc domain. The author splits the discourse representation in two main parts: a coarsegrained structure formed by modules and the links that one can create between these modules. In the following we will detail each of the parts. The six modules proposed by Harmsze are: Meta-information, Positioning, Methods, Results, Interpretation and Outcome. To connect the above listed modules, the author introduces two types of relations: (i) organizational links, (ii) scientiﬁc discourse relations. The organizational links provide the reader with the means to easily navigate between the modules composing the scientiﬁc publication. They connect only modules as entire entities and do not refer to the segments encapsulated in them, which in turn would identify the content (e.g., hierarchical, proximity or range-based links). On the other hand, the links between segments of modules are split into two main categories: relations based on the communicative function, with the goal of increasing the reader’s understanding and maybe acceptance of the publication’s content (e.g., elucidation or argumentation), and content relations, allowing the structuring of the information ﬂow within the publication’s content (e.g., resolution or causality). Although very well detailed and comprehensive, this model is unfortunately not easily applicable in practice, due to its high level of abstraction, especially in case of relations. Nevertheless, parts of the model will be present (implicitly or explicitly) in the approaches that were later developed. Such an approach, following a much more focused scope was the one of Buckingham Shum et al. [5]. They were the ﬁrst to propose a detailed model for rhetorical and argumentation structuring in scientiﬁc publications, using as links between the discourse knowledge items Cognitive Coherence Relations [15]. As opposed to the previous models, Buckingham Shum et al. do not make use of a coarse-grained rhetorical structure, or of the linear structure of the discourse, but concentrate strictly on organizing the coherence among the content segments. Their research resulted in a series of tools for the annotation, search and visualization of scientiﬁc publications, which represent our main inspiration. The most elementary items of their Discourse Ontology are the atomic nodes, that represent short pieces of text, succinctly summarizing the authors’ contribution in the publication. The granularity of these nodes is left for decision to the author, and thus, can vary from parts of sentences to blocks of sentences. Nodes can have several types (e.g., Data, Language, Theory), encoded in the links that connects them. In addition to the nodes, the model contains also two kinds of composite elements: (i) sets that group several nodes sharing a common type (or theme), and (ii) claim triples formed by linking sets or atomic nodes. In terms of relations, the Discourse Ontology comprises six main types: (i) causal links, e.g., predicts or envisages; (ii) problem related links, e.g., addresses or solves; (iii) similarity links, e.g., is identical to or shares issues with; (iv) general links, e.g., is about or impairs; (v) supports / challenges links, e.g.,

30

T. Groza, S. Handschuh, and S. Decker

proves or refutes; (vi) taxonomic links, e.g., part of or example of. Each relation has attached a polarity, positive or negative, and a speciﬁc weight. The link’s polarity denotes explicitly the author’s position in regards to particular statements present in the related work, similar to the approach of Teufel [13]. At the same time, the weight indicates how strong or weak is the author’s position. For example, the causal links envisages and causes have both a positive polarity, but diﬀerent weights, the former being considered weaker that the latter. Similarly, is unlikely to aﬀect and prevents have a negative polarity with diﬀerent weights, again the latter being considered stronger than the former. One of the latest models for discourse representation was proposed by de Waard [4]. They started with a rhetorical block structure for scientiﬁc publications called ABCDE, very similar to the IMRAD (Introduction, Material and Methods, Results and Discussion)12 structure. The title holds the acronym of the ﬁve types of blocks present in the model: (i) Annotations, (ii) Background, (iii) Contribution, (iv) Discussion, and (v) Entities. At a later stage, the authors enriched their model with a ﬁne-grained representation of the discourse, by adding a set of eight relations connecting elementary discourse knowledge items [16]. The connectors represents a mixture of argumentative and cognitive coherent relations, as follows: (i) proves / refutes, (ii) supports / contradicts, (iii) agrees / disagrees, and (iv) suggests / suggests that not. The authors suggest that this set is used as a common denominator for the many possible nuances of certainty of each type, that can be found in the scientiﬁc literature. This resumes basically at creating a uniform level for the weights attached to the relations, as opposed to the approach followed by Shum et al.. At the same, compared to the same model, the polarity attached to the relations is also made explicit. 7.2

Modeling the Argumentation

The ﬁrst foundational argumentation model ever proposed was IBIS (Issue Based Information Systems) [7]. IBIS introduced a methodology for argumentationbased decision making in information systems. In IBIS the key element was represented by an Issue. Each Issue can have multiple Positions, that are statements asserted to resolve the Issue. On their turn, Positions can have one or more Arguments that support or object to them. In terms of relations, the model consists of eight distinctive links that connect these three main concepts: Generalizes, Specializes, Questions, Is-suggested-by, Responds-to, Supports, Objects-to and Replaces. A typical IBIS discussion involves one raising an Issue, which is then open for debate. The methodology contains no stopping rule, or any way of signaling the solution to the Issue by some particular Position or agreement. Rather, the goal of the methodology is to create a common understanding on that Issue and to give each stakeholder the chance to express her opinion. At a later stage, IBIS served as main inspiration to a long list of models. The ﬁrst to take IBIS to the next level were Conklin et al. [17] with their gIBIS. Their 12

http://www.uio.no/studier/emner/hf/imk/MEVIT4725/h04/resources/imrad. xml

Capturing Rhetoric and Argumentation Aspects

31

approach followed closely the original IBIS model and based on a graph-oriented visualization, applied the methodology in team-based deliberation [17] and then in modeling exploratory policy discussions [18]. Other models inspired by IBIS are: – The Compendium methodology [19], that follows a semiotical approach for dealing with knowledge visualization and design rationale, while complementing argumentation with Cognitive Coherence Relations (as described in the previous section); – DILIGENT (distributed, loosely controlled and evolving engineering of ontologies) [8], which applies argumentation in ontology engineering, – SWAN (Semantic Web Applications in Neuromedicine) [20], modeling social debate in Alzheimer research and, in general, in Neuromedicine, – CiTO (Citation Typing Ontology)13 , describing the nature of reference citations in research articles All the models share conceptually the same ideas, but use diﬀerent terminologies, or to be more precise, shape the abstract concepts like Issue or Position to terms speciﬁc to the particular ﬁeld, as Discourse Element, Comment or Idea. At the same time, driven by the application use case, they are enriched with complementary models as SIOC [21], FOAF, DublinCore, Creative and Science Commons or the Open Biomedical Ontologies14. 7.3

Document Modeling

One can look at document modeling from diﬀerent perspectives, depending on the instantiation (or role) that the document concept takes. In our case, we are particularly interested in modeling the linear structure 15 of the document, where the document has the role of (or is instantiated as a) publication. Regarding this topic, we found two related eﬀorts to be particularly relevant. The ﬁrst, and closest to our goals, is the Liquid Publications project, while the second, only conceptually similar to our model, is the SIOC (Semantically-Interlinked Online Communities) initiative. The Liquid Publications projects [22] aims at changing the way in which we publish, evaluate and consume the knowledge emerged from research. The authors propose a quite controversial model for structuring and managing publications, associated with a series of dissemination rules. Inspired from physics, they suggest three possible transition states through which a publication can pass, i.e., gas, liquid and solid, each having attached some constraints. The main idea is to hinder the dissemination of research results by creating publications from 13 14 15

http://imageweb.zoo.ox.ac.uk/pub/2009/citobase/cito-20090415-1.2-RC3/ cito-content/owldoc/ http://www.obofoundry.org/ Although the intrinsic structure of a document is, in most cases, a tree comprising hierarchically organized composites and elementary objects, we usually traverse it in a depth-ﬁrst manner, thus creating a linear serialization of this structure. Consequently, we use the term linear structure to denote this particular serialization of the actual tree structure of a document.

32

T. Groza, S. Handschuh, and S. Decker

already existing publications plus an added delta. Thus, instead of publishing incremental deltas in separate publications, people would maintain ﬂowing publications, with interested readers being updated when the state of publications changes. In order to realize the management of these incremental deltas, their model targets a ﬁne-grained level description of the publication’s content, using as main entities Scientiﬁc Knowledge Objects [23]. The model has several levels of organization: (i) the data representation level organizing the data at the ﬁnest-grained level, (ii) the knowledge representation level composing and ordering data to create new knowledge, and (iii) the collection representation level performing a categorization of related data and knowledge into collections. Compared to our document modeling approach, their approach is more abstract and can cover not only publications but any other tree-based document structure. The concepts present in the Document Ontology may be seen as specializations of their entities, thus providing a more focused perspective, that ﬁts better in our overall image of scientiﬁc publication modeling. The second document modeling eﬀort we have mentioned is the Semantically Interlinked Online Communities (SIOC) initiative [21]. SIOC aims to enable the integration of online community information by providing an ontology for representing rich data from Social Web sites in RDF. The ontology consists of the SIOC Core ontology and three ontology modules: SIOC Access, SIOC Services and SIOC Types. The SIOC Core ontology deﬁnes the main concepts and properties required to describe information from online communities on the Semantic Web. This was created with the terms used to describe web-based discussion areas such as blogs and message boards: namely Site, Forum and Post. Users create Posts organized in Forums which are hosted on Sites. Post can reply to other Posts. Higher level concepts (data spaces, containers and content items) were added to SIOC as it evolved. By using these classes and related properties SIOC allows us to structure the information in online community sites and distinguish between diﬀerent kinds of Social Web objects. At the ﬁrst sight one might not see the similarities between our document model and SIOC. Nevertheless, if we abstract from the speciﬁc context, i.e scientiﬁc publications and social media sites, we observe that, for example, a Forum is a document, that has multiple composites as Posts, which on their turn have several Topics. This is the same modeling approach as in our Publication – Section – Paragraph case. Consequently, the LiquidPub model represents a good abstraction for both our Document Ontology as well as for SIOC. 7.4

Overall Comparison of the Most Relevant Models

For building a comprehensive comparison of the most relevant related eﬀorts, we created a list of common features, based on which we will perform the overall analysis. The feature list consists of the following elements: – Coarse-grained rhetorical structure – identiﬁes the existence of a coarsegrained rhetorical structure modeling part in the overall approach. Its goal is

Capturing Rhetoric and Argumentation Aspects

33

Fig. 22. Comparative overview of related approaches

–

– – – –

– –

to capture the semantics of larger blocks of the publication’s content that have a rhetorical role. Fine-grained rhetorical structure – considers the smallest content items carrying a rhetorical role and between which, usually, relations are materialized. Relations – looks at the types of relations used in the model for linking the ﬁne-grained structure in an unitary block. Polarity – speciﬁes if the model includes explicitly the polarity (i.e., positive or negative) of the relations. Weights – speciﬁes if the model considers explicitly the weights of the relations Provenance – indicates whether the model encapsulates also the provenance information and attaches it to the ﬁne-grained rhetorical structure (e.g., the accurate localization of the text chunk that represents the textual counterpart of the rhetorical element). Shallow metadata support – shows if the model has embedded support for shallow metadata (e.g., authors, titles, etc) Domain knowledge – analyses the close coupling of the model to particular domain knowledge areas.

Fig. 22 presents concisely our approach compared with six models we have chosen as being the most related. Considering this overview, we believe that we have created a well-balanced model for the externalization of the rhetoric and argumentation hidden in the content of the scientiﬁc publications. By following a novel and unique layered approach, our model is the only one to clearly decouple the rhetoric and argumentation from the provenance information, from the shallow metadata and from domain knowledge, while still covering all these aspects. Therefore, we are able to provide an opportunity for an independent evolution of each of the layers, as well as their replacement with diﬀerent ones, according to the speciﬁc needs imposed by the use case.

34

T. Groza, S. Handschuh, and S. Decker

Similar to Harmsze and de Waard, we opted for a combined coarse-grained and ﬁne-grained design of the rhetoric, thus capturing the semantics of the content at diﬀerent granularity levels. However, we consider our model to be more expressive than the two previously mentioned, due to our mixture of argumentative and rhetorical relations. Speciﬁcally from the ﬁne-grained rhetorical structure perspective (i.e., rhetorical elements and rhetorical relations holding among them), our model can be considered complementary to the ScholOnto or the SWAN models. While our rhetorical relations provide the scaﬀolding for capturing the content’s coherence, the argumentative relations that the two models implement, complete the semantics of the content with its argumentative side16 . Considering the polarity and weights features, we observed the tendency of using implicit polarities for the rhetorical relations, i.e., do not explicitly categorize the relations into positive and negative rhetorical relations (the only exception being ScholOnto), with no weights attached (again with one exception – ScholOnto). From our perspective, the SALT rhetorical relations do not really carry a speciﬁc polarity, as they rather model the rhetorical coherence of the content, while the argumentation concepts have an implicit polarity based on their semantics. At the same time, we believe that the weights should be used only in conjunction with a computational linguistic approach, as using such discrete quantiﬁers does not really provide any added value for the author / reader. Finally, we look at the last three aspects mentioned in the overview, i.e., provenance, shallow metadata support and domain knowledge. Starting with the last two, it is quite obvious that most of the approaches (including ours) have shallow metadata support, and follow an open direction with regards to domain knowledge (with the exception of SWAN that was built speciﬁcally for the biomedical domain, and CiTO that does not consider it at all). On the other hand, the provenance information provides an interesting perspective, especially since it has the important role of relating an ontological instance to its textual representation. There are, in principle, two ways of dealing with this aspect: (i) duplicating the textual content in the instance information (e.g., by using a text property), or (ii) following a pointer-based approach, in which instead of duplicating the content, one uses a pointer speciﬁcation framework to describe the exact localization of the textual content. This is one of the aspects in which SALT brings innovation. Without considering the approaches that don’t mention provenance (e.g., de Waard or Harmsze), SALT is the only framework that explicitly models the provenance information as pointers in the original document.

8

Conclusions

In this paper we provided a comprehensive description of the SALT framework. SALT enables the externalization of the argumentation and rhetoric captured 16

As already presented, the argumentation side of SALT is represented by a set of concepts, rather than relations. We considered that by following this approach, we open the door for a richer model, to be further extended by domain speciﬁc applications.

Capturing Rhetoric and Argumentation Aspects

35

in the publication’s content. The novelty of the framework is its layered organization, based of three ontologies: the Document Ontology, the Annotation Ontology and the Rhetorical Ontology, which should be interpreted as modeling the rhetorical and argumentation as an annotation for the publication’s structure and content. This also enables the maintenance of the provenance information close to the rhetorical and argumentation. SALT can be used for the semantic enrichment of scientiﬁc publications independently of the authoring environment. Nevertheless, as a proof-of-concept, we developed the appropriate mechanisms for using it within LATEX and MS Word. In the case of LATEX we introduced a series of commands that enable concurrent authoring, i.e., inter-leaving the writing process with the semantic annotation process. The result of the compilation of LATEX sources is a semantic PDF document that keep together the actual content of the publication and the SALT instances. In the other case, MS Word, we adopted a visual-oriented approach in which several visual controls were developed to be used by the authors to mark their publications. The evaluation we have performed revealed a series of interesting results. Firstly, SALT follows an incremental approach to semantic annotation, the beneﬁt the user receives being proportional with the eﬀort she puts into the annotation process. Secondly, the framework is both sound and suitable for enriching publications with semantic metadata. Last, but not least, although seen as quite complex and to introduce unnecessary interruption, the annotation task has some collateral eﬀects (such as, spotting weak points in the publication or improving the understanding of the publication) that can increase the author’s incentive of performing it.

Acknowledgments The work presented in this paper has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).

References 1. Nonaka, I., Takeuchi, H.: The Knowledge-Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford University Press, Oxford (1995) 2. Tsujii, J.: Reﬁne and PathText, which combines Text Mining with Pathways. Keynote at Semantic Enrichment of the Scientiﬁc Literature 2009, SESL 2009 (2009) 3. Bush, V.: As We Think. The Atlantic Monthly 176(1), 101–108 (1945) 4. de Waard, A., Tel, G.: The ABCDE format - enabling semantic conference proceeding. In: Proceedings of 1st Workshop: SemWiki 2006 - From Wiki to Semantics, Budva, Montenegro (2006) 5. Shum, S.J.B., Uren, V., Li, G., Sereno, B., Mancini, C.: Modeling naturalistic argumentation in research literatures: Representation and interaction design issues. Int. J. of Intelligent Systems 22(1), 17–47 (2006)

36

T. Groza, S. Handschuh, and S. Decker

6. Mann, W.C., Thompson, S.A.: Rhetorical Structure Theory: A theory of text organization. Technical Report RS-87-190, Information Science Institute (1987) 7. Kunz, W., Rittel, H.: Issues as elements of information system. Working paper 131, Institute of Urban and Regional Development, University of California (1970) 8. Tempich, C., Pinto, H.S., Sure, Y., Staab, S.: An Argumentation Ontology for Distributed, Loosely-controlled and evolvInG Engineering processes of oNTologies (DILIGENT). In: G´ omez-P´erez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 241–256. Springer, Heidelberg (2005) 9. Groza, T., Handschuh, S., M¨ oller, K., Decker, S.: KonneX-SALT: First Steps Towards a Semantic Claim Federation Infrastructure. In: Bechhofer, S., Hauswirth, M., Hoﬀmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 80–94. Springer, Heidelberg (2008) 10. DeRose, S., Maler, E., Daniel Jr., R.: Xpointer xpointer() scheme (2002) 11. Grosso, P., Maler, E., Marsh, J., Walsh, N.: Xpointer element() scheme (2003) 12. Sure, Y., Bloehdorn, S., Haase, P., Hartmann, J., Oberle, D.: The SWRC Ontology – Semantic Web for Research Communities. In: Bento, C., Cardoso, A., Dias, G. (eds.) EPIA 2005. LNCS (LNAI), vol. 3808, pp. 218–231. Springer, Heidelberg (2005) 13. Teufel, S., Carletta, J., Moens, M.: An annotation scheme for discourse-level argumentation in research articles. In: Proc. of the 9th Conf. on European Chapter of the ACL, pp. 110–117. ACL, Morristown (1999) 14. Harmsze, F.A.P.: A modular structure for scientiﬁc articles in an electronic environment. PhD thesis, University of Amsterdam (2000) 15. Sanders, T.J.M., Spooren, W.P.M., Noordman, L.G.M.: Coherence Relations in a Cognitive Theory of Discourse Representation. Cognitive Linguistics 4(2), 93–133 (1993) 16. de Waard, A., Kircz, J.: Modeling Scientiﬁc Research Articles – Shifting Perspectives and Persistent Issues. In: Proc. of the 12th Int. Conf. on Electronic Publishing, ElPub 2008 (2008) 17. Conklin, J., Begeman, M.L.: gIBIS: A Hypertext Tool for Team Design Deliberation. In: HYPERTEXT 1987: Proceeding of the ACM Conference on Hypertext, pp. 247–251. ACM Press, New York (1987) 18. Conklin, J., Begeman, M.L.: gIBIS: A Hypertext Tool for Exploratory Policy Discussion. In: CSCW 1988: Proceedings of the 1988 ACM Conference on ComputerSupported Cooperative Work, pp. 140–152. ACM Press, New York (1988) 19. Mancini, C., Shum, S.B.: Modelling discourse in contested domains: a semiotic and cognitive framework. International Journal of Human-Computer Studies 64(11), 1154–1171 (2006) 20. Ciccarese, P., Wu, E., Wong, G., Ocana, M., Kinoshita, J., Ruttenberg, A., Clark, T.: The SWAN biomedical discourse ontology. J. of Biomedical Informatics 41(5), 739–751 (2008) 21. Breslin, J., Decker, S., Harth, A., Bojars, U.: SIOC: An Approach to Connect Web-Based Communities. The Int. J. of Web-Based Communities 2(2) (2006) 22. Casati, F., Giunchiglia, F., Marchese, M.: Liquid Publications: Scientiﬁc Publications meet the Web. Technical Report DIT-07-073, Informatica e Telecomunicazioni, University of Trento (2007) 23. Giunchiglia, F., ChenuAbente, R.: Scientiﬁc Knowledge Objects V.1. Technical Report DISI-09-006, Ing. e Scienza dell’Informazione, Univ. of Trento (2009)

A Reference Model for Semantic Peer-to-Peer Networks Abdul-Rahman Mawlood-Yunis1 , Michael Weiss2 , and Nicola Santoro1 1

2

School of Computer Science, Carleton University {armyunis,santoro}@scs.carleton.ca Department of Systems and Computer Engineering, Carleton University 1125 Colonel By Drive, Ottawa, ON, Canada K1S 5B6 [email protected]

Abstract. Current research directions in Semantic Peer-to-Peer(SP2P) networks are evolving to combine two complementary technologies: Peerto-Peer(P2P) networks and formally-structured information, Ontology. SP2P systems incorporate several additional features not present in P2P networks. However, the current SP2P research eﬀorts have generated many and diverse realizations and architectures. This diversity in implementation and architecture in turn has led to an ambiguity and incompatibility in deﬁning domain abstracts and concepts and as such has hampered progress in this area. For instance, system comparison as well as their translation into practical implementation have been hindered. In this work, we describe a reference model for SP2P systems in an eﬀort to model the emerging decentralized computing paradigm in a generic and high level abstraction. The potential contribution of the reference model to the advancement of the current SP2P systems spans various areas. These include: 1) an establishment of common terminologies for the domain. This leads to better understanding and communication among members of the community. 2) providing guidelines for comparison among individual systems. Individual systems could be compared with each other in terms of their compliance with the generic model, and their implementation of the generic features. Keywords: System modeling and architecture, Ontology, Semantic overlay networks, P2P, Information system.

1

Introduction and Motivation

The decentralization of control, the autonomy and the dynamicity of peers and the eﬀective sharing of resources makes P2P networking attractive for large-scale distributed systems and applications. However, data and resource descriptions held by peers in a P2P networks lack explicit semantic. Attempts to solve the problems pertaining to the lack of data semantic have focused on explicating the meaning of the information content, i.e., semantics augmentation. The backbone for exploring semantic based solutions is ontology, which is about deﬁning a common conceptualization of the domain of interest plus a commitment of the involved parties to the conceptualization [21]. Using ontology for modeling S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 37–65, 2011. c Springer-Verlag Berlin Heidelberg 2011

38

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro Table 1. SP2P System Types and Instances SP2P Types System Instances P2P Knowledge management KEx[6] P2P Information System P2PSLN [25], Observer [40], P2PISM [52] P2P Database coDB [18], Piazza [26], Hyperion [32], PeerDB [43] P2P Emergent Semantics Chatty Web [3], DisES [17] P2P Semantic Web BiBSter [23], Edutella [42], Somewhere [47] P2P Web Services ESTEEM[5]

information resources or resource descriptions, concepts are deﬁned in terms of their properties and relations to other concepts; concept deﬁnitions provided elsewhere on the Web or foreign peer repositories are reused using metadata; and new facts are inferred using the existing ones [20]. In order to harness the power of P2P networks, current research directions in P2P computing are evolving to combine two complementary technologies: P2P networks, and ontologies. From this combining emerges Semantic Peer-to-Peer systems (SP2P). SP2P systems represent the next step in the evolution of P2P networks as they incorporate several additional features that are not present in P2P networks. These include 1) formally-structured information (ontology), 2) local mapping, and 3) semantic based routing; see section 3 for more details. The incorporation of ontologies in P2P networks has been previously reported in the scientiﬁc literature in various research precedents: creation of semantic networks on existing P2P networks which has been referred to as semantic overlay network; semantic-based query routing; adaptive query routing; etc. SP2P systems which employ ontologies include several types: P2P knowledge management systems, P2P Databases, P2P Semantic Web, P2P Emergent Semantics, P2P Web Service, and P2P Information Systems. Table 1 lists these system types. In SP2P systems resource are stored in numerous peers to be queried for [42]. Query execution process in SP2P networks is comprised of several steps [3,25,26]. Peers join network after ﬁnding the ﬁrst peer with compatible knowledge representation. That is, peers establish mappings to semantically related peers where mapping refers to semantic relationship between concepts from different ontologies. Subsequently, peers submit queries to their neighboring peers using concepts in their own personalized local ontologies. Upon receiving a query, each peer starts processing the query locally. If the concepts used to formulate the query are compatible with concepts in its local ontology, it sends back query results to the querying peer (query initiator). Otherwise, it routes the query to other peers for which they have a direct mapping, after invoking the mapping component. Query forwarding will continue, until either: (1) the query reaches the query initiator, (2) the query exceeds a speciﬁed number of query forwards (”hops”) or (3) the time to live for the query message expires. The querying peer collects all answers returned, and evaluates them. If the answers are

A Reference Model for Semantic Peer-to-Peer Networks

39

satisfactory, then the query initiator will inform the neighbors involved about the result. Thus, the entire translation paths will be informed of the result of a successful query. Ontologies are advantageous in P2P networks because they provide the possibility for improving search and content retrieval. While most successful P2P networks are used for exchanging music and streaming ﬁles, e.g., BitTorrent, eMule, KaZaA, SP2P networks will open up new possibilities beyond ﬁle-sharing and streaming. SP2P networks can enable richer and more useful descriptions of peers, services, and shared artifacts. This will facilitate new ways of sharing knowledge, data management, and collaborative working within academic communities, research labs, universities, various emergency service departments, and hospitals and pharmacies [3,32]. Current research eﬀorts on SP2P systems, however, have generated many diverse realizations and architectures. This diversity has, in turn, led to ambiguity and incompatibility in deﬁning domain abstracts and concepts and, as such, has hampered progress in this area. For instance, system comparisons as well as their translation into practical implementation have been hindered. This diversity of SP2P implementations results from the variety of backgrounds (e.g., knowledge and database management, information retrieval, P2P) of the diﬀerent researchers and the still nascent state of the ﬁeld. In this work we describe a reference model for SP2P systems in an eﬀort to model the emerging decentralized computing paradigm in a generic and high level abstraction. This work is an extension of preliminary work on the reference model for SP2P systems [36]. Two essential parts, model applicability and model development method parts, are been developed and added to our previous work. Detailed problem context, along with more complete model concept deﬁnitions are also described. The promising feature of the model is the high level abstraction away from implementation details as such a particular system, e.g. Chatty Web, Piazza, can be instantiated from the reference model components. The potential contribution of the reference model to the advancement of the current SP2P development includes the following. First, common terminologies are established for the domain. This will lead to better understanding and communication among members of the community. Second, guidelines are provided for comparison among individual systems. Individual systems can now be compared with each other in terms of their compliance with the generic model and their implementation of the generic features. The rest of this work is organized as follows: In Section 2, related work will be reviewed. In Section 3, some features of SP2P systems are brieﬂy described. In Section 4, the development method used for identifying the model components, features and properties is described. In Section 5, the key constructs of SP2P systems are discussed and a class diagram for the reference model is built. In Section 6, the model applicability and validation is shown using a representative set of semantic SP2P systems. Finally, in Section 7 the paper is concluded and future research directions are identiﬁed.

40

2

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

Related Work

To the best of our knowledge, there are only few works which directly address the problem of building reference models for P2P networks. Some of these works are described here. In [1] a reference model for unstructured P2P networks have been presented. In addition to identifying core components of P2P networks, [1] discusses the network’s essential design decisions. It also provides a brief comparison of some relevant P2P networks. Similarly, a reference model for structured P2P networks have been provided in [13]. From high-level abstraction view, we consider the current work to be an extension/adaptation of the mentioned P2P reference models to a new environment. In this new environment semantic aspects play essential roles in modeling and building P2P network. That is, In addition to the components described in [1,13], components such as semantic mapping, semantic neighborhood, query formulator, semantically enhanced resource description/representation are SP2P’s speciﬁc model components. Other related works are [34,48]. In [48] authors show only preliminary steps toward modeling semantic overlay networks. The eﬀorts in [34], on the other hand, is more spent on discussing diﬀerent query routing strategies rather than generic model. There are also some related works in a closely related domain, i.e. grid domain, for example [45]. These works were helpful for understanding system layers and describing components from high level perspective.

3

Diﬀerences between P2P and SP2P Systems

SP2P is a latest development in P2P networking progress. SP2P systems incorporate several additional characteristics not present in P2P networks. We reviewed existing SP2P systems [3,6,17,25,23,32,40,49,52] and other research works on semantic P2P systems [8,9,24,30,33], and came to the conclusion that there are several characteristics that distinguish P2P systems from SP2P systems. This include: 1) formally-structured information, 2) local mapping, 3) autonomous peer resource management, and 4) semantic based routing. Data or information managed by peers in SP2P systems is structured and formal (e.g. meta-data about learning objects in Edutella [42] and domain ontologies in observer [40]). The purpose of formally-structured data is to enrich data semantics and support inferences which in turn improve search performance and the quality of retrieved information. Local mapping in SP2P systems is used as a translational capability to forward queries between the peers under the conditions when the peers possess diﬀerent data schema or knowledge representations. Autonomous peer resource management pertains to peers’ control over own resources. That is, in contrast to conventional P2P networks, resources in SP2P are neither replicated nor assigned to other peers in the network in order to be used by network peers for processing queries. This is because the focus of SP2P systems are mostly applications where replication of resource is not permissible. Examples of such application include a collaboration between diﬀerent health departments, research labs, and universities where replication of resource is not

A Reference Model for Semantic Peer-to-Peer Networks

41

permissible [14,32,26,42]. However, in semantically-enhanced P2P ﬁle sharing systems this characteristic can be relaxed. Query routing in SP2P systems is diﬀerent than non-semantic P2P systems. This is mainly due to the fact that SP2P systems are unstructured P2P networks. In other words, the unstructured SP2P systems are diﬀerent than structured P2P networks such as Chord [12] or Pastry [46] and other distributed hash table based systems. In SP2P, semantic based peer selection procedure relates peers with similar domain knowledge, and form semantic neighborhood, and the semantic neighborhood is used for query routing process. We see the above described system aspects to be prominent characteristics that diﬀerentiate SP2P systems from the conventional P2P systems like [19,22,41].

4

Model Development Method

In this section we describe the model development method that has been used for identifying the model components, features and properties. The study started with a thorough review of relevant works on SP2P networking, semantic mapping, and SP2P simulations and modeling. The literature review revealed that there are in existence diﬀerent SP2P systems and types (e.g. Piazza, Chatty Web, KEx, Somewhere, Hyperion, PeerDB, coDb, Esteem, Observer, Edutella). These systems are incompatible with each other, employ diﬀerent architectures, and were developed by professionals from diﬀerent backgrounds. The review provided us with necessary information to start developing a reference model for SP2P systems. In order to obtain key constructs of the reference model, the primary focus was put on the identiﬁcation of prominent or distinctive features of existing SP2P systems. The features are user-visible aspects or characteristics of prominent SP2P systems and other related work. They deﬁne both common aspects of the SP2P systems as well as the diﬀerences between them. The applied method is known as a Feature-Oriented Domain Analysis(FODA)[31]. We choose several SP2P systems and prominent research works to extract the distinctive features of SP2P systems. Four of these systems (Chatty Web [3], KEx [6], P2PSLN [25] and Piazza [26]) and their corresponding component, component properties and component relationships are described below. We focus on these systems for their contribution, novelty and types, i.e, Emergent semantics, Chatty Web; Knowledge management, KEx; Information retrieval, P2PSLN; and Database, Piazza. The examples demonstrate two important aspects of the described reference model: ﬁrst, they illustrate our analysis and examination approach of the existing SP2P systems, and second, the distinctive features of these systems form the basis for the described reference model. The detailed information about component features, properties, and relationships as well as class models of SP2P systems selected for deriving reference model components are as follow:

42

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

4.1

KEx [6]

– Peers In KEx a peer manages a set of resources, and own a unique id. A peer also has reference to other peers with similar knowledge, and a schema description to be used for identifying peers with similar knowledge. – Resources Each node stores context, data, and meta-data. The context is the perspective that peers have on the domain knowledge. Meta-data refers to links to other resources and mappings to context stored at other peers. KEx uses XMLSchema speciﬁcation to represent contexts. Special hierarchal notation called CTXML [7] is used for the context representation. – Query maker KEx has a special component called query maker. The query maker is used to select concepts from local repository and to compose queries. Queries are simple search messages. That is, queries are composed without the use of query specialized language. Query content is made of one or more local concepts and context, called focus in KEx. The concepts describe the resource that the user needs to retrieve. The context explicates query content semantic; it is the entire path(s) from the concept(s) to the root of the concept hierarchy. For a detailed example on how concepts are selected, queries are composed and posed on the neighboring peers, the readers are encouraged to see [6]. – Peer federation and knowledge links In KEx, semantic neighborhood takes two forms: peer federation and knowledge links. First, peer federation refers to peers with semantic compatible resources that agree to act as one entity, i.e. answer query requests from other peers, group together and form a federation. Peers form federations with their acquaintance(peers with similar schema). Second, knowledge links refer to peers that are able to discover the semantic relationship between their own resources and other peers’ resources during query answering and mapping. Peers store the discovered information and use it to build a network of knowledge links. A peer p incorporates into its semantic knowledge network another peer p¯, if concepts of peer p¯ correspond, completely or partially (see [7]), to concepts of peer p. – Semantic mapping KEx implements a highly expressive matching algorithm. It allows for the representation of relations between concepts at diﬀerent abstract levels, i.e. synonyms, hypernym, hyponym, disjoint, and compatible relation. Mapping is performed on the service provider side, queried peer. Each provider, after receiving a query applies a runtime match to the query’s focus, this way interpreting the query from its perspective. The matching algorithm compares both the syntax and semantics of query concepts to concepts of local context; a concept syntax is a concept label, and concept semantic is a tree path from the concept to the root of the tree. The matching algorithm is based on WordNet use to resolve semantic mapping. That is, to determine the diﬀerent abstract relation between concepts.

A Reference Model for Semantic Peer-to-Peer Networks

43

– Query routing KEx uses adaptive query forwarding strategy. Peers send queries to other peers known through discovery method as well as peers known through previous interactions, i.e. knowledge links. A knowledge link refers to extra information, peer classiﬁcation information, that a peer saves in its the local context about other peers. Query forwarding is controlled using several query control forwarding policies. These include time-to-live (TTL), the number of hops or the list of peers already reached. Even though KEx does not have a policy for preventing queries from rotation around cycles in the network, one can see that implementing such a policy is feasible. This is could be achieved through applying one or more cycle prevention policies, similar to query control forwarding policies applied in KEx. – Query answerer In KEx the service provider peers match query content concepts against their local ontologies in order to come up with proper answers, and the system user determines the correctness (incorrectness) of the query answer. This is because what is considered to be correct by a provider is not necessarily correct by the querying peer. Once the system user decides on the correctness of the answer, the related document is provided directly to the querying peer. That is, in KEx querying peers receive answers to their queries directly form queried peers, service providers. Figure 1 summarizes KEx’s essential classes, class properties and their association that we have developed from the system description.

Fig. 1. KEx classes, class properties and their association that we derived from the description of the system

44

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

4.2

P2PSLN [25]

– Peer A Peer in a P2PSLN is an active and intelligent soft device that contributes data or information to overall system resources. – Resources Resources in P2PSLN are XML ﬁles along with XML Schemas to describe and constrain the content of XML data ﬁles. – Query formulator P2PSLN provides a graphical interface for posing queries on peers. Queries are composed using special query language that has been developed for P2PSLN. Query contents are simple text keywords. – Semantic neighborhood In P2PSLN peers are linked to each other based on the semantic relation between their schema elements, schema structures and descriptions. That is, a peer p constructs a semantic link with another peer p¯ if its XML schema elements, schema structure, and semantic description are related to peer p¯. Semantic links have types, and there are eight diﬀerent link types. Link types represent semantic relation strength. Two peers have equal relation types, the highest strength relation, when their schema elements, schema structure and description are equal. The link relation type becomes empty when there is no semantic relation between two peers, and any other relation lies between equal and empty two link types. P2PSLN uses both selective joining, described in section 5, and mapping to create semantic neighborhoods. Peers use selective joining method to build their initial semantic relations. The initial relations are then followed by schema exchange and semantic mappings in order for furthermore improvement be determined. – Semantic mapping P2PSLN provides three diﬀerent semantic mappings, namely node, path and clique semantic mappings. The mapping considers not only peers’ XML Schema, but schema structures as well. Mapping is carried out on the sender side, querying peer. That is query initiator peer performs the mappings. P2PSLN makes use of the global dictionary, WordNet, to implement mappings. The similarity degree between a set of terms, e.g. synonymy, abbreviations, that has been deﬁned in the global dictionary indicates P2PSLN’s support for a detailed and involving mappings. P2PSLN check mappings for corruption. Peers that participate frequently in answering queries but whose answers are incorrect are double checked for schema changes. If changes are detected, the querying peer updates its schema mapping, semantic link type, and the similarity degree between the two peers. – Query routing P2PSLN uses adaptive query routing. That is, querying peers selects proper successors for query forwarding. Query forwarding is controlled using TTL query forward control policy. P2PSLN does not provide information on how it is handling the issue of query rotations along the network cycles.

A Reference Model for Semantic Peer-to-Peer Networks

45

– Query answer evaluation In P2PSLN, query answers are evaluated for more than just whether or not they are correct. Query answer evaluation covers several aspects. The aspects are query response time, traﬃc overhead, precision, etc. Query results are returned directly to the querying peer by peers who hold the answers, and the system user decides on the appropriateness of query answers. The essential classes, class properties and their association that we have developed from the system description is summarized in Figure 2.

Fig. 2. P2PSLN classes, class properties and their association that we have developed from the system description

4.3

Piazza [26]

– Peer Peers have unique id, contribute resources to the overall system, and select other peers to connect with. Piazza peers do not use profile matching to connect to other peers. This is because, Piazza uses P2P infrastructure, but lacks the dynamic property of P2P networking. – Resources In Piazza the resources contributed by each peer include: 1. data instance, e.g. XML or RDF data instances, and 2. data models, e.g. XML schemas or OWL ﬁles. Peers may also supply computed data, cached answers to queries. – Query formulator Piazza uses the XQuery language for composing queries. Queries in Piazza are always posed from the perspective of a given peer’s schema, which deﬁnes the preferred terminology of the user. Queries are comprised of three

46

–

–

–

–

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

constructs: variables, predicates, and equivalent class deﬁnitions. XQuery uses XPath expressions to bind XML nodes to the variables. A predicate speciﬁes a condition for variable bounding; they are deﬁned in the XQuery WHERE clause. The equality operators used by predicates in the queries specify the equivalence classes between query variables. Piazza uses OWL’s owl:equivlentClass construct to deﬁne equivalency between two classes. Semantic neighborhood In Piazza, a semantic neighborhood of peer p comprises all nodes that are related to elements of p’s schema by semantic mappings. When a new peer is added to the system, it will connect to a subset of the existing network peers that are semantically related. Peers are free to choose with whom they would like to establish semantic connections. Semantic Mapping In Piazza mapping is carried out using queries (views). Piazza distinguishes between two diﬀerent levels of mapping: data-instance mapping level and schema/ontology mapping level. At the schema level, mapping has two characteristics: 1. it is highly expressive, and 2. it has the ability to transform schema structures. Mapping can be highly expressive in that mapping may involve attributes or class correspondence, class containment, subsume, overlap, or disjoint. Support for transforming schema structure of one peer to those in a second peer is due to mapping’s ability in Piazza to combine multiple entries and to break a single entry into multiple fragments. At the data-instance level, mapping is based on the concordance table. The concordance table in turn is assumed to be constructed using the existing matching algorithms and techniques. Mappings are directional, and are carried out by both the query sender and the query receiver. Analyzing mapping for its correctness has been acknowledged by Piazza creators as an area that needs to be addressed in future work. Query routing Piazza employs adaptive query routing by posing queries only over semantic neighbors. Given a query q posed over the schema of node p, Piazza rewrite q into a query q¯ over the neighbors of p using local mappings. In Piazza, query forward termination is accomplished using mapping paths exploration. That is, query forwarding terminates when no useful paths remain to be traversed. Piazza does not adhere to the P2P network characteristics completely. For example, a Piazza network is cycle free. Hence, the problem of repeated query processing is a non-issue in Piazza. Query answerer Piazza relies on the system user to evaluate query answers. Thus, the system user determines correctness (incorrectness) of query answers. Piazza is concerned with two aspects of query answers: the percentage of total relevant answers that have been retrieved, i.e. recall rate, and the length of time that is needed for correct answers to be retrieved. In Piazza, answers are directly sent to the querying peers.

Figure 3 summarizes Piazza’s essential classes, class properties and their association that we have developed from the system description.

A Reference Model for Semantic Peer-to-Peer Networks

47

Fig. 3. Piazza classes, class properties and their association that we have developed from the system description

4.4

Chatty [3]

– Peer In Chatty Web each peer p has a unique id, maintaining a database according to a schema. Peers are able to identify their schema, either by explicitly storing it or by keeping a pseudo unique schema identiﬁer, obtained for example by hashing. – Peer resources In Chatty Web a resource could be a database, a website or set of ﬁles or documents a peer maintains in a P2P network. In its system data model, however, Chatty Web considers each peer to maintain a database according to its schema where the schema consists of a single relational table, and the data that a peer stores consists of a set of tuples t1 , t2 , ... tr of the same type. – Query formulator Peers compose queries using their local concepts. Queries are basic relational algebra operators, e.g. selection, projection, and mapping, that is, specialized query encoding languages such as SQL or XQuery are not considered. Queries are issued to any peer through a query message. The query message holds a query identiﬁer, query content, querying peer address and translation trace to keep track of the translations already performed. – Semantic neighborhood Chatty Web Peers learn about each others’ schema through ﬂooding networks with ping messages and receiving pong messages. Peers are able to learn about each others’ schema because they incorporate their schema

48

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

identiﬁer into the pong message. Each peer maintains a neighborhood n of semantically relevant peers. There are two types of peers in the neighborhood of peer p: i. those that share the same schema with peer p and ii. those that have a diﬀerent schema. A peer p includes another peer p¯ with a diﬀerent schema into its neighborhood n if it knows how to translate queries against its own schema to queries against the foreign schema. – Semantic mapping Chatty Web regards mapping as an establishment of local agreements between peers. Chatty Web does not implement its own mapping component; rather, it relies on reusing exiting mappings. Chatty Web, however, does not put any restrictions on mappings, e.g. its expressiveness level, in order to be reused. Chatty Web envisions mapping as partial translations between schemas that is been carried out using views. That is, transformation operation provides a view of schema SP 2 according to schema SP 1. Hence, transformation is performed on the query sender side. Mapping incorrectness is counted for in Chatty Web and the standard maximum-likelihood technique is used for translation error rate estimation in the system. – Query routing Information obtained by applying diﬀerent quality mapping assessment is used to direct searches in the network; i.e., the adaptive query routing strategy is applied. Chatty Web avoids processing the same query more than once by incorporating query forwarding path information into the query structure. Query forwarding leads to query content changes. The change is the result of query transformation among peers with diﬀerent schemas. Query changes are used for query forward termination. Query forwarding stops after a query becomes too diﬀerent, either from a syntactic or a semantic point of view, from its original. – Query answerer In Chatty Web, the querier peer evaluates query results automatically. An answer is considered to be correct if the received document conforms to the peer’s local document categorization. Furthermore, Chatty Web calculates semantic relation values between query answer concepts and querying peer’s local concepts to determine semantic mapping correctness along query forwarding cycle. Peers that provide answers send their answers directly to the querying peer. Figure 4 shows Chatty Web’s essential classes, class properties and their associations that we have developed from the system description. 4.5

Edutella

Edutella [42] enables sharing of learning resources among distributed and independent educational resource providers. Edutella uses the JXTA [28] protocol for its P2P network infrastructure and RDF [29] to describe resources. Queries are routed using the JXTA group construct. Edutella is not used for deriving the reference model constructs, but rather is utilized for validating the proposed reference model(See Section 6). For Edutella’s class model see [42].

A Reference Model for Semantic Peer-to-Peer Networks

49

Fig. 4. Chatty Web’s essential classes, class properties and their associations

5

SP2P Reference Architecture

There are many and diverse SP2P system realizations and architectures. This is primarily due to the involvement of a variety of researchers from diﬀerent backgrounds into a still recent and evolving area. The proposed SP2P reference model meets the essential requirements of the generic architecture, i.e., it models the essential aspects of the existing systems. A particular SP2P system, e.g. Chatty Web, Piazza, can be instantiated from the reference model components. The model is a high level abstraction which hides implementation detail from the implementers. However, it is deﬁned in a way which makes deriving concrete systems possible. Systems built based on this model should be easy to change and modify. The SP2P reference model is made of seven key constructs. The constructs are Peers p, Resources r, Query Formulator qf, Semantic Neighborhood sn, Mapping m, Routing t, and Query Answerer qa: SP2P = . The seven constructs comprise the minimum components required for any SP2P system and it can be used for assessing SP2P system conformance. In the following we will describe each of these model constructs. In Table 2, a brief deﬁnitions of some key model abstractions are provided for quick referencing. The model constructs are represented in UML classes and diagrams. This will enable transparent translation of model constructs into code using high level programming languages such as Java or C++. In [35] we translated the UML class diagrams into Java classes and packages and various SP2P systems have simulated. 5.1

Peers

A Peer P= represents an active object or an entity in the network. Each peer has a unique identiﬁcation id, a set of resources r that it manages, a

50

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro Table 2. Deﬁnitions of some of the SP2P model concepts

Model concept A peer Autonomous joining

Short definition of the model concept Represents an active object or an entity in the network A process in which a peer autonomously select which other peers it is going to connect with Mapping The semantic relationship between concepts from independent information sources (ontologies) Network degree The constrain on the number of relations a peer could make Peer discovery Process or protocol in which peers discover their acquaintance(peers with similar proﬁle) Peer neighbors The references a peer possesses to other peers in the network Peer proﬁle The description of peer’s domain knowledge, the information content oﬀered by a peer, description of peer’s schema, or expertise and services a peer provide comprise peer’s proﬁle Semantic neighborhood Connected peers with compatible information resources comprise semantic neighborhood Similarity function A function to measure the strength of the relation, the semantic aﬃnity, between any two proﬁles

proﬁle o and a set of neighbors n, i.e., references to other peers in the network. Example of the resources that a peer could manage include sets of documents ﬁles, bibliography ﬁles or learning objects ( video, audio, image, etc) that peers willing to exchange. The proﬁle is the description of peer’s domain knowledge, expertise or services and is used in building semantic neighborhood. For example, a subset of data model’s key concepts could comprise peer’s proﬁle. Figure 5 is a class view of a peer construct for the proposed SP2P reference architecture. 5.2

Resources

The Resources r= are one of the fundamental building blocks of any SP2P system. Peer resources comprise data model or ontology, dm, the actual data, i, and meta-data, md. Examples of resource artifacts include

Fig. 5. Peer construct

A Reference Model for Semantic Peer-to-Peer Networks

51

sets of documents ﬁles or media objects that peers need to exchange. Peers could have their data represented in diﬀerent data model. Examples of data model include Relational table, XML Schema, RDF data model, and OWL ﬁle. Meta-data such as an ontology name space or peers knowledge about other peers’ resources is a reference to other external resources available on the network (see [8] for more information on reference to external resources). In contrast to conventional P2P networks, resources in SP2P are neither replicated nor assigned to other peers in the network in order to be used by network peers for processing queries. In cases where SP2P is used for ﬁle sharing, this characteristic might not hold. The choice of data model is important, and SP2P systems could be diﬀerentiate from each other based on the choice of the data model. This is due to the following two features of the highly structured data: 1) Support for Semantics The choice of data model determines data semantic transparency. Semantic transparency in turn enables automatic machine processing of data as well as improving query result precision. 2) Support for Inferences The choice of data model determines the extent of the system’s ability to answer queries. For example, data models such as RDF and OWL support knowledge inferences. Systems with these types of data model are able to answer queries where information is not explicitly stored in their repository. This might be diﬃcult for other systems with diﬀerent data models to do so. Figure 6 is a class view of the Resource construct for the proposed SP2P reference architecture.

Fig. 6. Resource construct

5.3

Query Formulator

Query Formulator qf =<sc, cq, pq, q, l>, often a graphical user interface component, is a separate component on top of the resource layer for constructing queries. Peers use their own Query formulator to select concepts, sc, from the local resource repositories, compose queries content, cq, and place queries, pq, on the neighboring peers n.

52

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

Fig. 7. Query formulator construct

Query objects, q, are diverse based on the system’s endorsement for the query’s explicit semantics, i.e. peer’s data model (see subsection 5.2). For example, query content could incorporate references to local or global ontologies for supporting query concept meanings, or when a tree-like data representation is used as a resource, e.g. XML format, a query concept could be substituted by a tree path. A tree path refers to the concept, its ancestors, and descendant concepts. Another important aspect relevant to the query formulation module is the query language, l. The choice of the query language restricts the explicit semantic of query content. Figure 7 represents a class view of the Query Formulator construct for the proposed SP2P reference architecture. 5.4

Semantic Neighborhood

Discovering and grouping together peers with compatible semantic information, i.e. forming semantic neighborhood sn = , is a distinguishing characteristic of SP2P systems. That is, SP2P network topology is unstructured and semantic based. Two popular methods for forming a semantic neighborhood include: Autonomous Joining(aj): Peers autonomously select which other peers they are going to connect with. Peers are responsible for identifying semanticallyrelated peers, and constructing semantic mapping(s) between their own information resources or ontology and ontologies of related peers when their domain representations are diﬀerent. Peer Discovery (v): Peers exchange their proﬁle o and use similarity function sim to discover semantically related peers. The exchange of proﬁles can happen at network startup time or when new peers join an already established semantic based network. Peers interested in connection with other peers broadcast their proﬁle and relevant peers respond to the querying peer by sending their id and proﬁle. Querying peer computes the strength of the relation, i.e. the semantic aﬃnity between the two proﬁles, and either accepts or drops the answer for connection. Peers can have only a limited number of connections. The network degree d represents this limitation in the model. Figure 8 represents a class view of the Semantic Neighborhood construct for the proposed SP2P reference architecture. The two methods have some common behaviors, compute proﬁle similarity. The common behaviors need to be put in a common super class to be inherited and used by individual sub-classes for easy implementation.

A Reference Model for Semantic Peer-to-Peer Networks

53

Fig. 8. Semantic neighborhood construct

5.5

Routing

Routing t= is an essential component of any SP2P system. The Routing component is responsible for delivering query content q from the query initiator pi , to one or more query receiver in neighborhood n. There are three different design aspects relevant to routing queries in SP2P systems. These aspects are: i. query forwarding strategy, fs, ii. cycle handling, ch, and iii. routing termination policy, tt. Existing SP2P systems have diﬀerent takes on these aspects. Below, each of the issues is described brieﬂy. Forwarding strategy (fs): There are several routing methods for SP2P networks. These include: ﬂooding, expertise-based selection, adaptive query routing [34,50,49,51]. These strategies are diﬀerent from each other, among other things on their usage of number of messages (queries) and time eﬃciency in retrieving query answers. Adaptive query routing (see e.g [34] for discussion on adaptive query routing strategies), is the most widely used technique. SP2P systems with an adaptive routing strategy utilize learning techniques to enable eﬃcient routing, in other words, peers use their past interaction experience to determine future query routing. In this regard, each peer could consider only its own experience in making decisions on future routing, or in addition to its own experience, it could make use of other peers’ recommendation as well. The central idea in the adaptive strategy technique is to make use of the extra information existing in the network, e.g. information about the peers that most likely provide correct query answers, to send queries only to the most relevant peers (experts). Cycle handling (ch): Another important issue of querying SP2P systems is how to deal with query repetitions. Repetitions are commonly identiﬁed by using either query unique identiﬁers (qid) and/or query path information (path). A peer may receive the same query from diﬀerent paths or via a cycle in the network. Alternatively, a peer could receive a more speciﬁc query (or a more general one) via diﬀerent paths or cycles in the network after multiple translations by semantically related peers. The way repeated queries are dealt with has an impact on the number of query message exchanges and result completeness. While terminating already seen queries can preclude the opportunity to provide some important

54

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

answers, processing repeated queries increases the number of query messages a system would exchange. Query termination policy (tt): When query forwarding is going to stop is another important matter of routing queries in SP2P systems. Current common techniques for stopping query forwarding depend on either counting the number of hops or setting the query time-to-live (TTL). Using the hop-counting approach, a system administrator sets the length of a network path that a query message could traverse before terminating. On the other hand, the TTL approach is time based: a query message could traverse the network for the period of time that is speciﬁed in the query. As a query message traverses the network, its TTL value decreases. When the TTL value becomes zero, message forwarding stops. Note that, these techniques have an impact on the query results that could be obtained. For instance, peers will continue to forward queries to their related neighbors even when they already have answers to the query as long as the speciﬁed constraints permit. As a result the number of query results will be aﬀected. Yet, another query forwarding termination policy is to use query content to decide whether the Query should be forwarded or not. Using such a policy Queries will not be forwarded when their content becomes empty. Query contents become empty as result of concepts dropping during translation process. Based on the dropping/not dropping uncomprehended Query concepts during Query translation, we divided SP2P systems into two groups: Irreducible SP2P system (IRSP2P) and Reducible SP2P system (RSP2P) and they are deﬁned as follow: Definition 1. IRSP2P system is an SP2P system with the property that it does not discard uncomprehended query concepts during query translation and forwarding among multiple peers. Definition 2. RSP2P system is an SP2P system with the property that it does discard uncomprehended query concepts during query translation and forwarding among multiple peers. Piazza system, for example, belongs to IRSP2P group and Chatty Web is an instance of RSP2P systems. Figure 9 represents a Router Class and its associated forwarding policy. 5.6

Query Answerer

Query answerer (Qa)= is concerned with two important aspects of query answers: i. query answer evaluation, ae, and ii. query answer selection, as. Query answers need to be evaluated for their correctness, in other words, correctness (incorrectness) of query answers needs to be determined. L¨oser et al. [34] suggest that the query result evaluation strategy, among others, is an important aspect of adaptive query routing in semantic overly networks. For the SP2P systems to be dependable, they need to employ correct result evaluation function. Incorrect evaluation functions could prevent semantically related peers from teaming-up together. This in turn, based on the working application, could have far reaching consequences on the performance and dependability of the system.

A Reference Model for Semantic Peer-to-Peer Networks

55

Fig. 9. Routing construct

The way answer evaluations are determined, qd, could be automatic or manual. In manual query answer determination, system users decide on the correctness (incorrectness) of query answer. Automatic query answer determination is about the system peer’s ability to conclude the query answer’s correctness (incorrectness). In the latter case, the system designer needs to design a set of criteria to empower SP2P systems with the ability to decide on the correctness (incorrectness) of query answers. An example of such measurement includes calculating the semantic relation between query answer concepts and query concepts. The SP2P system’s answer correctness is evaluated using common precision and/or recall metrics. Answer selection (AS): , on the other hand, deﬁnes a set of criteria for selecting an answer when multiple correct answers are generated for a single query – each from a correct translation sequence. This could include Answer Precision ap, the length of mapping path, lp, and the level of trust the querying peer has in the peers participating in the result generation, i.e. peer weight w. Another important element of query answer handling is peers’ capacity on partial answer integration, pai. Some query results might be partial answers, hence the need for the peers’ ability to integrate multiple partial answers. That is, Peers need to be able combine partial answers data and give a uniform view of the results to users and other peers. Answers could arrive, av, to the querying peer either directly or indirectly. Direct answers are those answers that responding peers provide directly to the querying peer and without passing through intermediary peers. Indirect answers refer to those that travel along query mapping path to reach the querying peer. Figure 10 is a Query Answerer construct of the proposed SP2P Model. 5.7

Mappings

The semantic mapping m=<me, mi, mc, mw, mm> refers to semantic relationship between concepts from independent information sources (ontologies). It is a fundamental design building block for any SP2P system, and a topic undergoing

56

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

Fig. 10. Query answerer construct

in-depth research. Using semantic mapping with SP2P systems involves decision making on various issues including mapping expressiveness, me, mapping implementation, mi, mapping correctness, mc, mapping ownership, mw, and mapping maintenance, mm. Below, a short description of each of these mapping constructs is highlighted. At the beginning of mapping procedure, a check is required to determine whether the query should be just passed or needs to be translated into neighboring ontology concepts. Further, SP2P systems may support query reformulation, i.e. splitting the query, or completely reordering the query in a totally diﬀerent but equivalent query. In such cases the mapping component needs an additional query evaluator procedure to support query evaluation and reformulation. Mapping expressiveness(me): Semantic mapping in its simplest form could be just a matter of ﬁnding query concept synonyms among diﬀerent ontologies. In more expressive mappings, logical relations are used for ﬁnding relationships among concepts, concept properties and attributes. The set of logical relations commonly used to deﬁne relationships among the peers’ ontology concepts are {≡, , , ∗, ⊥}. In this case, c1 ≡ c2 means that the two concepts are synonyms. In other words, c1 and c2 are diﬀerent concepts with similar or identical meanings and are interchangeable. For example, notebook and laptop are synonyms concepts. The relation c1 c2 means c1 is hypernym of c2 . That is, c1 is more generic or broad than c2 . For example, the system software concept is more generic or broad than the operating system concept. The relation c1 c2 , means that the c1 have a hyponymyous relationship with c2 , i.e. c2 is more generic or broad than c1 . For example, the book concept is less generic than the publication concept. The relation ⊥ means that two concepts have no semantic relation with each other. For example, bank as ﬁnancial institution and bank as river-bank have no semantic relation. Any other relations between concepts other than those described above can be captured by the ∗ relation.

A Reference Model for Semantic Peer-to-Peer Networks

57

Mapping expressions have an eﬀect on the extent of query results. They could increase or decrease the extent of query results based on the permissible logical expressions of the mappings. Systems demanding exact mappings could relax some of their constraints (i.e. allow for less restricted mapping logics to take place) to increase query recall. For example, let us assume that the University concept from one ontology and an Educational Institute concept from a second ontology are synonymous, i.e. University ≡ Educational Institute and mapping operation returns a value equal to 1, map(University, Educational Institute)=1.0. This assumption is valid since both concepts can be mapped to a common concept, Institute. Now let us consider the following query been posed on either ontologies. Query:

list the name of all Research Institutes in the area.

The restricted query result will be null, since no semantic relationship between Research Institute and University or Educational Institute can be asserted. However, if we relax the synonymous relationship between University and Educational Institute to a related, i.e. the relationship between University and Educational Institute have been deﬁned and the operation map(University, Educational Institute)=0.25, the result of the previous query will be a set of University names which they might carry out some research. This is because the relationship between Research Institute and Educational Institute will be asserted, both Research Institute and Educational Institute are Institute. Information on the mapping operation’s numerical result can be found in [37]. Mapping implementation(mi): How mapping is carried out is an important design issue. Peers could use a local copy of thesauruses such as WordNet, build their own dictionaries, construct mapping tables or when feasible, exchange ontologies (schemas) to translate concepts between ontologies. The choice of the approach to carry out mapping is aﬀected by the scope of the application. For small and domain speciﬁc applications, peers could exchange local ontologies or build their own local dictionaries for translation. Larger and semantically acute applications on the other hand, may require local thesauruses which are capable of performing some inference rather than just simple concept-to-concept mappings associated with local dictionaries and tables. Mappings could be carried out automatically, semi-automatically or manually. Mapping correctness measurement(mc): Correct semantic mapping is fundamental to SP2P systems. Various research eﬀorts have been devoted to the classiﬁcation of possible types of faults, measuring the quality of the mapping and estimating of information loss during query translation. The correctness of mapping is measured in two diﬀerent ways: numerical and logical measurement. Numerical measurement pertains to the numerical values returned from the mapping tool. For example, a mapping operation could conclude that the semantic relationship between a Laptop concept and a Notebook concept is equal to 1.0: map(c1,c2)=1.0, and the semantic relationship between an Operating system concept and a Software concept is equal to 0.5: map(c3,c4)=0.5, or some other

58

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

values. A detailed example related to the numerical values use in the mapping operation can be found in [37]. If the numerical value returned from a mapping operation is ≥ δ (threshold), mapping is considered to be correct. The numerical values associated with semantic relationships between ontology concepts are system designer decision. For example when a concordance table is used by an SP2P system for mapping, the values assigned to the relationships between any two concepts in the table will be declared and latter used in the mapping process. The logical measurement, on the other hand, is the logical relationships that have been concluded during the mapping operation, that is, whether or not the relationship between two concepts satisﬁes one of these logical operations {≡, , , ∗, ⊥}. For example, the logical relationship between publication and book is . The two methods could be modiﬁed such that the logical relation could return numerical values and vice versa. Mapping ownership (mw): An important decision that SP2P system designers have to make is who (i.e., sender or receiver peer) is going to carry out the mapping. That is, whether query translation takes place before sending the query or after receiving the query. This is important because it will have an eﬀect on query routing, to the extent that the querying peer will ﬁrst perform mapping and then submit to only semantically related peers (i.e. if the outcome of mapping is above a certain threshold). This constraint can be used as a strategy for terminating query forwarding. Since the receiving peer performs mappings after receiving a query, this means that any query could be posed to any peer (i.e., there is no restriction on query forwarding). Query receiving peers either answer queries (i.e., if they could translate them to their local representation), or forward them to some other peers. Mapping maintenance (mm): Recently, several studies have focused on the mapping maintenance issue and its eﬀects on the SP2P systems reliability [4,11,37,38]. These studies have concluded that mapping between diﬀerent ontologies (schemas) need maintenance. This is because mapping could get outdated as a result of ontology changes. Outdated mapping puts the entire system at risk of failure. Hence, there is a need for 1. semantic mapping maintenance, 2. mapping corruption detection, and 3. mapping corruption tolerance: Mapping maintenance is needed to prevent it from corruption, corruption detection is required so it can be ﬁxed, and lastly, mapping corruption tolerance is necessary in order to limit the level of the damage that mapping corruption may have done to the system. Figure 11 is a class view of the Mapping construct for the proposed SP2P reference architecture. In Figure 12 all model constructs are put together and the reference model class diagram has been created. The class diagram shows the dependency between model components. The model diagram encompasses, among others, four enumeration classes. The SP2P implementer may choose one or more of the enumerated options and ignore others. For example, an SP2P implementer may decide to only implement ”hopCounting” and not ”TTL” or ”destinationArrival” to stop query forwarding. The sequence of interaction between model components during query processing is represented in Figure 13.

A Reference Model for Semantic Peer-to-Peer Networks

Fig. 11. Mapping construct

Fig. 12. SP2P class model

59

60

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

Fig. 13. The sequence of interaction between model components during query processing

6

Model Applicability and Validation

In order to show the model applicability, the system architectures described in the paper (i.e., KEx, P2PSLN, Piazza, and Chatty web system architectures) are mapped onto the reference model. Individual systems are checked on whether they comply with the generic model, and how they implement the generic features. Table 3 illustrates that the described state-of-the-art systems posses the model’s component, however, they are diﬀerent on the implementation of the component properties and component relations. The Table also manifests the comparative advantages (disadvantages) that the systems have over one another in relation to the model. Furthermore, Table 3 shows that the identiﬁed components, features and properties of the reference model exists in the Edutella system as well. Edutella system was not used for deriving the reference model components, yet it has the components and functionality need in SP2P systems. In other words, Edutella illustrates that the described reference model being valid and rigorous.

autonomous joining X peer discovery network degree X

autonomous joining peer discovery X network degree (implicit) similarity function X

autonomous joining X peer discovery network degree X

similarity function

expressiveness (low) implementation (reuse existing mapping) correctness ownership (sender) maintenance X query forwarding cycle handling

autonomous joining X peer discovery (implicit) network degree X

similarity function X

expressiveness implementation correctness X ownership (sender) maintenance X query forwarding cycle handling (implicit) query termination (implicit) evaluation selection X determination (manual) precision recall integration arrival (indirect) Piazza system

evaluation selection determination X precision X recall X integration arrival (indirect) Edutella System

evaluation selection X determination (manual) precision recall X integration X arrival (indirect) P2PSLN system

query termination (implicit)

expressiveness implementation (global dictionary) correctness ownership (sender) maintenance query forwarding cycle handling

evaluation selection X determination (manual) precision X recall integration X arrival (indirect) KEX system

query termination

expressiveness (high) implementation (WordNet) correctness X ownership (receiver) maintenance X query forwarding cycle handling

Query Semantic Mapping Routing formulator neighborhood

evaluation selection X determination (automatic) precision X recall X integration X arrival (direct) Chatty Web system

query termination (implicit) query termination

expressiveness (high) implementation (concordance table) correctness ownership (sender/receiver) maintenance X query forwarding cycle handling

similarity function

autonomous joining X peer discovery network degree X

query language X

Resources

similarity function

query language

query language

query language

query language

select concepts compose queries place queries

select concepts compose queries place queries

Peer

select concepts X compose queries place queries

select concepts X compose queries place queries

select concepts X compose queries place queries

data instance

data instance X

data instance

data instance X

data instance X

id neighbors proﬁle (implicit) resource data model meta-data

id neighbors proﬁle resource data model meta-data

id neighbors (implicit) proﬁle resource data model meta-data

id neighbors proﬁle resource data model meta-data

id neighbors (implicit) proﬁle resource data model meta-data

Table 3. Concepts and relationships manifested in ﬁve SP2P systems

A Reference Model for Semantic Peer-to-Peer Networks 61

Query answerer

62

7

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

Conclusion

In this research work, we have identiﬁed that current SP2P research eﬀorts have generated many and diverse realizations and architectures. This diversity in implementation and architectures has led to an ambiguity and incompatibility in deﬁning SP2P abstracts and concepts. We have described a reference model for SP2P systems in an eﬀort to model the emerging decentralized computing paradigm in a generic and high level abstraction. The model contributes to the advancement of the current SP2P networks in diﬀerent ways. The minimum necessary constructs to build the SP2P networks and their related design issues have been identiﬁed and deﬁned. This can enable researchers and network architects to focus on core components of SP2P systems and their related design issues. The reference model also can reduce the conceptual ambiguity of semantics and meanings of SP2P systems constructs. Furthermore, the model helps building new systems and simulations seamlessly; we were able to transform the model diagrams into the implementation and simulate diﬀerent SP2P systems (Chatty web, Piazza, P2PSLN). The simulations show that SP2P systems built based on the reference model would be easy to change and modify. The simulation and results are presented in [35]. The described reference model is an essential step toward building a comprehensive API for SP2P networks. We consider building such an API to be an important future work. Combining prominent features from diﬀerent SP2P systems to come up with new systems is yet another promising future work. lastly, a comprehensive model validation is a viable future work. Currently, the model validation limited to one system, Edutella system. The model validation can be checked further by showing how the various concepts and relationships manifested in other systems can be mapped to the model features, properties and associations.

References 1. Aberer, K., Alima, L.O., Ghodsi, A., Girdzijauskas, S., Haridi, S., Hauswirth, M.: The essence of P2P: a reference architecture for overlay networks. In: Proc. Fifth IEEE International Conference on P2P Computing, pp. 11–20 (2005) 2. Aberer, K., Cudr´e-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: Building Internet-Scale Semantic Overlay Networks. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 107–121. Springer, Heidelberg (2004) 3. Aberer, K., Cudr´e-Mauroux, P., Hauswirth, M.: Start making sense: The Chatty Web approach for global semantic agreements. Journal of Web Semantics 1(1), 89–114 (2003) 4. An, Y., Borgida, A., Mylopulos, J.: Discovery and maintaining Semantic Mappings between XML Schemas and Ontologies. Journal of Computing Science and Engineering 2(1), 44–73 (2008) 5. Bianchini, D., De Antonellis, V., Melchiori, M., Salvi, D., Bianchini, D.: Peerto-peer semantic-based web service discovery: state of the art, Technical Report, Dipartimento di Elettronica per l’Automazione Universit` a di Brescia (2006)

A Reference Model for Semantic Peer-to-Peer Networks

63

6. Bonifacio, M., Bouquet, P., Mameli, G., Nori, M.: Peer-mediated distributed knowledge management. In: van Elst, L., Dignum, V., Abecker, A. (eds.) AMKM 2003. LNCS (LNAI), vol. 2926, pp. 31–47. Springer, Heidelberg (2004) 7. Bouquet, P., Don` a, A., Scraﬁni, L., Zanobini, S.: ConTcXtualizcd Local Ontology Speciﬁcation via CTXML. In: AAAI Workshop on Meaning Negotiation, pp. 64–72 (2002) 8. Castano, S., Ferrara, A., Montanelli, S.: H-Match: an Algorithm for Dynamically Matching Ontologies in Peer-based Systems. In: The 1st VLDB Int. Workshop on Semantic Web and Databases (SWDB), pp. 231–250 (2003) 9. Castano, S., Montanelli, S.: Enforcing a Semantic Routing Mechanism based on Peer Context Matching. In: Proc. of the 2nd Int. ECAI Workshop on Contexts and Ontologies: Theory, Practice and Applications (2006) 10. Choi, N., Song, I., Han, H.: A survey on ontology mapping. SIGMOD Rec. 35(3), 34–41 (2006) 11. Colazzo, D., Sartiani, C.: Mapping Maintenance in XML P2P Databases. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 74–89. Springer, Heidelberg (2005) 12. Dabek, F., Brunskill, E., Kaashoek, M.F., Karger, D.: Building peer-to-peer systems with Chord, a distributed lookup service. In: Proc. 8th Wshop. Hot Topics in Operating Syst. (HOTOS-VIII) (May 2001) 13. Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., Stoica, I.: Towards a Common API for Structured Peer-to-Peer Overlays. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735. Springer, Heidelberg (2003) 14. Doulkeridis, C., Vlachou, A., Nørv˚ ag, K., Kotidis, Y., Vazirgiannis, M.: Eﬃcient search based on content similarity over self-organizing P2P networks. Peer-to-Peer Networking and Applications Journal (2010), doi:10.1007/s12083-009-0058-2 15. Ehrig, M.: Ontology alignment: bridging the semantic gap. Springer publishing, Heidelberg (2007) 16. Euzenat, J., Shvaiko, P.: Ontology matching. Springer publishing, Heidelberg (2007) 17. Fergus, P., Mingkhwan, A., Merabti, M., Hanneghan, M.: Distributed emergent semantics in P2P networks. In: Proc. of the Second IASTED International Conference on Information and Knowledge Sharing, pp. 75-82 (2003) 18. Franconi, E., Kuper, G., Lopatenko, A., Zaihrayeu, I.: Queries and updates in the coDB peer to peer database system. In: Proc. of 30th International Conference on Very Large Databases VLDB 2004, pp. 1277–1280 (2004) 19. Freenet, http://www.freenetproject.org 20. Guarino, N.: Formal ontology and information systems. In: Proc. of Formal Ontology in Information Systems, pp. 3–15 (1998) 21. Gruber, T.R.: The Role of Common Ontology in Achieving Sharable, Reusable Knowledge Bases. In: Proc. of the 2nd International Conference on Principles of Knowledge Representation and Reasoning, pp. 601–602 (1991) 22. Gnutella, http://www.gnutella.wego.com 23. Haase, P., Broekstra, J., Ehrig, M., Menken, M.R., Mika, P., Olko, M., Plechawski, M., Pyszlak, P., Schnizler, B., Siebes, R., Staab, S., Tempich, C.: Bibster – A Semantics-based Bibliographic Peer-to-Peer System. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 122–136. Springer, Heidelberg (2004)

64

A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro

24. Haase, P., Siebes, R., van Harmelen, F.: Peer Selection in Peer-to-Peer Networks with Semantic Topologies. In: Bouzeghoub, M., Goble, C.A., Kashyap, V., Spaccapietra, S. (eds.) ICSNW 2004. LNCS, vol. 3226, pp. 108–125. Springer, Heidelberg (2004) 25. Hai, Z., Liu, J., Feng, L., Sun, X., He, C.: Query Routing in a Peer-to-Peer Semantic Link Network. Computational Intelligence 21(2), 197–216 (2005) 26. Halevy, A., Ives, Z., Mork, P., Tatarinov, I.: Piazza: Mediation and integration infrastructure for semantic web data. Web Semantics: Science, Services and Agents on the World Wide Web 2(1), 155–175 (2004) 27. http://www.cnscenter.future.co.kr/hot-topic/p2p.html 28. https://www.jxta.dev.java.net/ 29. http://www.w3.org/RDF/ 30. Joseph, S.: Neurogrid: Semantically Routing Queries in Peer-to-Peer Networks. In: Proc. Intl. Workshop on Peer-to-Peer Computing (2002) 31. Kang, K.: Feature-Oriented Domain Analysis. Technical Report No. CMU/SEI-90TR-21, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA (1990) 32. Kementsietsidis, A., Arenas, M., Miller, R.: Managing Data Mappings in the Hyperion Project. In: Proc.of the 19th Intl. Conf. on Data Engineering, pp. 732–734 (2003) 33. Liu, L., Xu, J., Russell, D., Antonopoulos, N.: Self-Organization of Autonomous Peers with Human Strategies. In: Proc. of ICIW 2008, pp. 348–357 (2008) 34. L¨ oser, A., Staab, S., Tempich, C.: Semantic social overlay networks. IEEE Journal on Selected Areas in Communications 25(1), 5–14 (2007) 35. Mawlood-Yunis, A.-R., Weiss, M., Santoro, N.: From P2P to Reliable Semantic P2P Systems. Peer-to-Peer Networking and Applications Journal, special issue, 1–19 (2010), doi:10.1007/s12083-009-0066-2 36. Mawlood-Yunis, A.-R., Weiss, M., Santoro, N.: Reference Model for Semantic Peerto-Peer Networks. In: Babin, G., Kropf, P., Weiss, M. (eds.) E-Technologies: Innovation in an Open World. Lecture Notes in Business Information Processing, vol. 26, pp. 319–334. Springer, Heidelberg (2009) 37. Mawlood-Yunis, A.-R., Weiss, M., Santoro, N.: Fault-Tolerant Emergent Semantics in P2P Networks. In: Cardoso, J., Lytras, M. (eds.) Semantic Web Engineering in the Knowledge Society, pp. 161–187. IGI Global (2008) 38. Mccann, R., Alshebli, B., Le, Q., Nguyen, H., Vu, L., Doan, A.: Mapping maintenance for data integration systems. In: Proc. of the 31st International Conference on VLDB, pp. 1018–1029 (2005) 39. McDougall, P.: The power of peer-to-peer. Information week (August 28, 2000), http://www.informationWeek.com 40. Mena, E., Illarramendi, A., Kashyap, V., Sheth, A.P.: OBSERVER: an approach for query processing in global information systems based on interpretation across pre-existing ontologies. Distributed and Parallel Databases 8(2), 223–271 (2000) 41. Napster, http://www.napster.com 42. Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palm´er, M., Risch, T.: EDUTELLA: a P2P networking infrastructure based on RDF. In: Proc. of the 11th International Conference on World Wide Web, pp. 604–615 (2002) 43. Ng, W.S., Ooi, B.C., Tan, K.-L., Zhou, A.: PeerDB: a P2P-based system for distributed data sharing. In: Proc. of 19th International Conference on Data Engineering, pp. 633–644 (2003) 44. Oram, A. (ed.): Peer-to-Peer: harnessing the power of Disruptive Technologies. O’Reilly and Associates, Inc. publishing, Sebastopol (2001)

A Reference Model for Semantic Peer-to-Peer Networks

65

45. Parashar, M., Member, S., Browns, J.C.: Conceptual and Implementation Models for the Grid. Proc. of IEEE Journal 93(3), 653–668 (2005) 46. Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Liu, H. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001) 47. Rousset, M., Chatalic, P., Adjiman, P., Chatalic, P., Goasdoue, F., Simon, L.: Somewhere in the Semantic Web. In: Intl. Workshop on Principles and Practice of Semantic Web Reasoning, pp. 84–99 (2006) 48. Schmitz, C., L¨ oser, A.: How to model Semantic Peer-to-Peer Overlays? In: Proc. P2PIR Workshop, Informatik, vol. (1), pp. 12–19 (2006) 49. Staab, S., Stuckenschmidt, S.: Semantic Web and Peer-to-Peer. Springer Publishing, Heidelberg (2006) 50. Tempich, C., Staab, S., Wranik, A.: Distributed semantic query: Remindin’: semantic query routing in peer-to-peer networks based on social metaphors. In: Proc. of the 13th International Conference on World Wide Web, pp. 640–649 (2004) 51. Voulgaris, S., Kermarrec, A., Massouli´e, L., van Steen, M.: Exploring Semantic Proximity in Peer-to-Peer Content search. In: 10th International Workshop on Futeru Trends in Distriubed Computing Systems (2004) 52. Zaihrayeu, I.: Towards Peer-to-Peer Information Management Systems. PhD Dissertation, International Doctorate School in Information and Communication Technologies, DIT - University of Trento (2006)

Discovery of Probabilistic Mappings between Taxonomies: Principles and Experiments R´emi Tournaire1 , Jean-Marc Petit2 , Marie-Christine Rousset1 , and Alexandre Termier1 1

Universit´e de Grenoble, UJF/ Grenoble INP / UPMF / CNRS, LIG UMR 5217 681, rue de la Passerelle, 38402 St-Martin d’H`eres Cedex, France [email protected] 2 Universit´e de Lyon, CNRS INSA-Lyon, LIRIS UMR 5205 69621 Villeurbanne Cedex, France

Abstract. In this paper, we investigate a principled approach for deﬁning and discovering probabilistic mappings between two taxonomies. First, we compare two ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. Then we describe a generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. Finally, we provide an experimental analysis of this approach.

1

Introduction

The decentralized nature of the development of Web data management systems makes inevitable the independent construction of a large amount of personalized taxonomies used for annotating data and resources at the Web scale. Taxonomies are hierarchical structures appropriate to data categorization and semantic annotation of resources. They play a prominent role in the Semantic Web since they are central components of OWL [13] or RDF(S) [33] ontologies. A taxonomy constrains the vocabulary used to express metadata or semantic annotations to be classes that are related by structural relationships. Taxonomies are easy to create and understand by humans while being machine interpretable and processable thanks to a formal logical semantics supporting reasoning capabilities. In this setting, establishing semantic mappings between taxonomies is the key to enable collaborative exchange of semantic data. In this work we focus on discovering mappings between populated taxonomies like Web Directories or folksonomies used for example for organizing musical songs. Web Directories like the Yahoo! Directory1 and the Google directory2 (see Figure 2) can be considered as taxonomies because they are hierarchical 1 2

dir.yahoo.com dir.google.com

S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 66–101, 2011. c Springer-Verlag Berlin Heidelberg 2011

Discovery of Probabilistic Mappings between Taxonomies

67

(a) Taxonomy T1

(b) Taxonomy T2 Fig. 1. 2 Taxonomies and associated instances

structures categories organizing web pages. Folksonomies like those represented in Figure 1 are small taxonomies created by ﬁnal users and populated with musical ﬁles, for example. Providing mappings between two taxonomies makes possible an exchange of documents (instances) by querying one taxonomy with the vocabulary of the other one. For instance, mappings between Yahoo! and Google directories allow to query both of them for pages about “Autos” while covering both the pages indexed by Yahoo! and those indexed by Google. With folksonomies of Figure 1, a local query of songs about “XXth Vocal” may be completed by songs populating the distant class XXth Opera thanks to a provided mapping representing the fact that XXth Opera is included in XXth Vocal. We claim that such an application of joint querying requires a grounded and semantic way for deﬁning the mappings and measuring their degree of conﬁdence. Manually ﬁnding such mappings is clearly not possible at the Web scale. Therefore, the automatic discovery of semantic mappings is the bottleneck for

68

R. Tournaire et al.

(a) Sample directory from Yahoo!

(b) Sample directory from Google

Fig. 2. Top levels of Yahoo! and Google directories

scalability purposes. Many techniques and prototypes have been developed to suggest candidate mappings between several knowledge representations including taxonomies, ontologies or schemas (see [48,53] for surveys). Particular Focus and Applications of the Proposed Approach Most of the proposed approaches rely on evaluating the degree of similarity between the elements (e.g., classes, properties, instances) of one ontology and the elements of another ontology. Many diﬀerent similarity measures are proposed and often combined. Most of them are based on several syntactic, linguistic or structural criteria to measure the proximity of the terms used to denote the classes and/or their properties within the ontology. Some of them exploit characteristics of the data declared as instances of the classes (e.g. [17,12,34,16]). Almost all the existing matching systems return for every candidate pair of elements a coeﬃcient in the range [0,1] which denotes the strength of the semantic correspondence between those two elements [24,41,8]. Those coeﬃcients are the basis for yearly international comparative evaluation campaigns [22,20]. Those approaches usually consider each candidate mapping in isolation. In particular, they do not take into account possible logical implications between mappings, which can be inferred from the logical inclusion axioms declared between classes within each ontology. This raises a crucial issue: the similarity coeﬃcients returned by the existing ontology or schema matching systems cannot be interpreted as probabilities of the associated mappings. On the other hand, some approaches for detecting semantic mappings by logical reasoning have been proposed (e.g., [52]). By construction, such logical methods are limited to discover mappings that are certain.

Discovery of Probabilistic Mappings between Taxonomies

69

We claim that uncertainty is intrinsic to mapping discovery. It is ﬁrst due to the methods employed for detecting them. Another important reason is that the mappings are usually interpreted as simple semantic relations such as subsumption or equivalence relations between classes, which is often an oversimpliﬁcation of the complex overlapping relation existing in the reality between two classes of diﬀerent and independenty developed ontologies. In this article, we focus on inclusion mappings. Inclusion mappings are of primary importance for ontology alignment applications, but they have received only little attention until now. In addition, inclusion mappings with a probabilistic semantics are useful and easily interpretable by human or automatic systems, because they combine several theoretical and practical advantages: 1. Inclusion mappings are more likely to happen than equivalences (theoretically at least twice), and so our method discovers more information than methods based on equivalences. 2. Inclusion mappings are logically weaker than equivalences but have a clear logical semantic that is directly useful for integration and query rewriting settings. For example, the inclusion mapping Google/Software/Games Yahoo/Entertainment

3.

4.

5.

6.

allows to make a query about Yahoo/Entertainment (or any of its superclasses) and leads to answer this query by considering that all pages in the category Google/Software/Games should be in the answer. In this way, an alignment between Yahoo! and Google directories constituted by inclusion mappings may enable an access to all referenced pages of both Yahoo! and Google with a single query. A large amount of relevant mappings (like the above mapping) could not have been discovered by methods that return only equivalences. The probabilities associated to mappings can be used by others systems that require probabilities, for example in order to know what are the most reliable knowledge. Properly handling uncertainty is a key point for the Semantic Web due to the heterogeneous and unreliable nature of available data. In particular, probabilistic inclusion mappings directly ﬁt the framework of probabilistic databases [18] and probabilistic reasoning [1]. These works contribute to probabilistic query answering which constitutes a major way to exchange data between heterogeneous schemas and ontologies. In a probabilistic query answering setting, users can obtain probabilistic answers to their queries, thanks to a probabilistic reasoning using probabilistic mappings. From an user point-of-view, a probability has a well-understood meaning when standing for conﬁdence: a probability of 0.5 means there is an even chance the mapping would be correct. Probability theory is quite simple, general, popular and grounded for dealing with uncertainty, and the generality and reusability is a key point in computer science. Finally, from a theoretical point of view in the ontology matching domain, there is a need to better understand the fundations of the uncertainty in data and schema integration [54].

70

R. Tournaire et al.

Main Points and Organization of this Article In this paper, we propose an approach to discover automatically probabilistic inclusion mappings between classes of taxonomies, in order to query multiple folksonomies or Web Directories in a way that is grounded, robust and easily interpretable. First, we investigate and compare two instance-based ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. In those two probabilistic models, the probability of a mapping relies on the joint probability distribution of the involved classes. They diﬀer on the property of monotony of the corresponding probability function with respect to the logical implication. For estimating the mappings probabilities, we follow a Bayesian approach to statistics by exploiting the description of the instances categorized in each taxonomy as observations for the involved classes. The point is that to estimate the joint probability distribution of two classes C1 and C2 of diﬀerent taxonomies, we have to determine among the instances that are declared in C1 the ones that can be classiﬁed in C2 (based on their description), and similarly for the classiﬁcation in C2 of instances that are declared in C1 . Diﬀerent classiﬁers can be used for that purpose. Based on the above probabilistic setting, we have designed, implemented and experimented a generate and test algorithm for discovering the mappings whose probability is greater than a given threshold. In this algorithm, the monotony of the probability function is exploited for avoiding the probability estimation of as many mappings as possible. We have performed thorough experiments on controlled synthetic data to measure the performances of such a systematic approach in fonction of the number of possible mappings and the number of instances per classes. We have also performed qualitative experiments to measure the impact of the classiﬁers used to estimate the probabilities on the precision and recall of the mappings returned by our algorithm. The paper is organized as follows. Section 2 presents the formal background and states the problem considered in this paper. Section 3 is dedicated to the deﬁnition and computation of mapping probabilities. In Section 4, we present the algorithm that we propose for discovering mappings with high probabilities (i.e., greater than a threshold). Section 5 surveys the quantitative and qualitative experiments that we have done on synthetic controlled data. Finally, in Section 7, we compare our approach to existing works and we conclude.

2

Formal Background

We ﬁrst deﬁne taxonomies as a graphical notation and its interpretation in the standard ﬁrst-order logical semantics, on which the inheritance of instances is grounded. Then, we deﬁne mappings between taxonomies as inclusion statements between classes of two diﬀerent taxonomies. Finally, we set the problem statement of matching taxonomies that we consider in this paper.

Discovery of Probabilistic Mappings between Taxonomies

2.1

71

Taxonomies: Classes and Instances

Given a vocabulary V denoting a set of classes, a taxonomy TV is a Directed Acyclic Graph (DAG) where each node is labelled with a distinct class name of V, and each arc between a node labelled with C and a node labelled by D represents a specialization relation between the classes C and D. Each class in a taxonomy can be associated with a set of instances which have an identiﬁer and a content description modeled with an attribute-value language. By a slight abuse of notation, we will speak of the instance i to refer to the instance identiﬁed by i. Figure 1 shows two samples of taxonomies related to the Music domain. Bold arrows are used for representing specialization relations between classes, and dashed arrows for membership relation between instances and classes. In both taxonomies, some instances, with attribute-value description denoted between brackets, are associated to classes. For example, #102 is an instance identiﬁer and [Wagner, Tristan und Isold, ...] its associated description. The instances that are in the scope of our data model can be web pages (whose content description is a set of words) identiﬁed by their URL, RDF resources (whose content description is a set of RDF triples) identiﬁed by a URI, or audio or video ﬁles identiﬁed by a signature and whose content description may be attribute-value metadata that can be extracted from those ﬁles. We consider only boolean attribute-value description. Such a description could be obtained by discretization of attribute-value pairs given in a more complex language, like in Figure 1 where textual values are used. We consider that, possibly after a preprocessing which is out of the scope of this paper, the instances are described in function of a ﬁxed set of boolean attributes {At1 , . . . , Atm }. Then, for an instance i, its description, denoted descr(i), is a vector [a1 , . . . , am ] of size m such that for every j ∈ [1..m], aj = 1 if the attribute Atj belongs to the content description of i, and aj = 0 otherwise. Taxonomies have a logical semantics which provides the basis to deﬁne formally the extension of a class as the set of instances that are declared or can be inferred for that class. 2.2

Logical Semantics

There are several graphical or textual notations for expressing the specialization relation between a class C and a class D in a taxonomy. For example, in RDF(S) [33] which is the ﬁrst standard of the W3C concerning the Semantic Web, it is denoted by (C rdf s:subclassOf D). It corresponds to the inclusion statement C D in the description logics notation. Similarly, a membership statement denoted by an isa arc from an instance i to a class C corresponds in the RDF(S) notation to (i rdf :type C), and to C(i) in the usual notation of description logics. All those notations have a standard model-theoretic logical semantics based on interpreting classes as sets: an interpretation I consists of a non empty domain of interpretation ΔI and a function .I that interprets each class as a non empty subset of ΔI , and each instance identiﬁer as an element of ΔI . The classes declared

72

R. Tournaire et al.

in a taxonomy are interpreted as non empty subsets because they are object containers. According to the unique name assumption, two distinct identiﬁers a and b have a distinct interpretation (aI = bI ) in any interpretation I. I is a model of a taxonomy T if: – for every inclusion statement E F of T : E I ⊆ F I , – for every membership statement C(a) of T : aI ∈ C I . An inclusion G H is inferred by a taxonomy T (denoted by T |= G H) iﬀ in every model I of T , GI ⊆ H I . A membership C(e) is inferred by T (denoted by T |= C(e)) iﬀ in every model I of T , eI ∈ C I . Let D be the set of the instances associated to a taxonomy T . The extension of a class C in T , denoted by Ext(C, T ), is the set of instances for which it can be inferred from the membership and inclusion statements declared in the taxonomy that they are instances of C: Ext(C, T ) = {d ∈ D/ T |= C(d)} 2.3

Mappings

The mappings that we consider are inclusion statements involving classes of two diﬀerent taxonomies T1 and T2 . To avoid ambiguity and without loss of generality, we consider that each taxonomy has its own vocabulary: by convention we index the names of the classes by the index of the taxonomy to which they belong. For instance, when involved in a mapping, the class XXth Opera of the taxonomy T2 of Figure 1 will be denoted by XXth Opera2 while the class XXth V ocal of the taxonomy T1 will be denoted by XXth V ocal1 . Mappings between T1 and T2 are of the form A1 B2 or A2 B1 where A1 and B1 denote classes of T1 and A2 and B2 denote classes of T2 . For a mapping m of the form Ai Bj , its left-hand side Ai will be denoted lhs(m) and its right-hand side will be denoted rhs(m). A mapping Ai Bj has the same meaning as a specialization relation between the classes Ai and Bj , and thus is interpreted in logic in the same way, as a set inclusion. The logical entailment between classes extends to logical entailment between mappings as follows. Definition 1 (Entailment between mappings). Let Ti and Tj be two taxonomies. Let m and m be two mappings between Ti and Tj : m entails m (denoted m m ) iﬀ every model of Ti , Tj and m is also a model of m . It is straightforward to show that is a (partial) order relation on the set M(Ti , Tj ) of mappings between the two taxonomies Ti and Tj . If m m , we will say that m is more speciﬁc than m (also that m entails m ) and that m is more general than m (also that m is an implicate of m). The following proposition characterizes the logical entailment between mappings in function of the logical entailment between the classes of their left hand sides and right hand sides.

Discovery of Probabilistic Mappings between Taxonomies

73

Proposition 1. Let m and m be two mappings between two taxonomies. Let Ti be the taxonomy of lhs(m), and Tj the taxonomy of rhs(m). m m iﬀ – lhs(m) and lhs(m ) are classes of the same taxonomy Ti , and – Ti |= lhs(m ) lhs(m) , and – Tj |= rhs(m) rhs(m ) For example, two mappings between taxonomies T1 and T2 of Figure 1 are illustrated in Figure 3: – the mapping XXth Opera2 XXth V ocal1 is more speciﬁc than the mapping XXth Opera2 XXth Century1 , – and the mapping RecentClassical2 XXth Instrumental1 is more speciﬁc than the mapping Ravel2 Classical M usic1.

Fig. 3. 2 mappings between T1 and T2

2.4

Problem Statement

Among all possible mappings between two taxonomies, we want to determine those that are the most probable given the descriptions of the instances associated to each class of the taxonomies. More precisely, the main problem addressed in this paper is the design of an eﬃcient generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. The mappings returned by this algorithm will be said probabilistically valid (valid for short). Two subproblems are emphasized. The ﬁrst subproblem to handle is the choice of an appropriate probabilistic model for deﬁning the probability of a mapping. As mentioned in the introduction, a probabilistic semantics of mappings cannot be independent of the logical semantics. In particular, it is expected that a mapping logically entailed by a mapping with a high probability (i.e., whose probability exceed a threshold) will also get a high probability. The second subproblem is then to ﬁnd a good probability estimator to compute mapping probabilities, given two taxonomies and the description of their instances.

74

3 3.1

R. Tournaire et al.

Mapping Probabilities: Models and Estimation Modeling Probabilistic Mappings

We have considered two relevant probabilistic models for modeling uncertain mappings. They are both based on the discrete probability measure deﬁned on subsets of the sample set representing the set of all possible instances of the two taxonomies. From now on, we will denote P r(E) the probability for an instance to be an element of the subset E. The ﬁrst model deﬁnes the probability of a mapping Ai Bj as the conditional probability for an instance to be an instance of Bj knowing that it is an instance of Ai . It is the natural way to extend the logical semantics of entailment to probabilities. The second model comes directly from viewing classes as subsets of the sample space: the probability of Ai Bj is the probability for an element to belong to the set Ai ∪ Bj , where Ai denotes the complement set of Ai in the sample set. These two models are described in the following deﬁnition. Definition 2 (Two probabilities for a mapping). Let m be a mapping of the form Ai Bj . – Its conditional probability, denoted Pc (m), is deﬁned as: Pc (m)=P r(Bj |Ai ). – Its implication probability Pi (m), is deﬁned as: Pi (m) = P r(Ai ∪ Bj ). The following proposition states the main (comparative) properties of these two probabilistic models. In particular, they both meet the logical semantics for mappings that are certain, and they can both be equivalently expressed using joint probabilities. Proposition 2. Let m be a mapping between two taxonomies Ti and Tj . The following properties hold: 1. Pi (m) ≥ Pc (m). 2. If m is a certain mapping (i.e., Ti Tj |= m): Pc (m) = Pi (m) = 1. 3. Pi (m) = 1 + P r(lhs(m) ∩ rhs(m)) − P r(lhs(m)) 4. Pc (m) =

P r(lhs(m)∩rhs(m)) P r(lhs(m))

They diﬀer on the monotony property w.r.t the (partial) order corresponding to logical implication (cf. Deﬁnition 1): Pi veriﬁes a property of monotony whereas Pc veriﬁes a property of weak monotony as stated in the following theorem. Theorem 1 (Property of monotony). Let m and m two mappings. 1. If m m then Pi (m) ≤ Pi (m ) 2. If m m and lhs(m) = lhs(m ) then Pc (m) ≤ Pc (m )

Discovery of Probabilistic Mappings between Taxonomies

75

The proof [57] results from Proposition 1 and Proposition 2 which relate mappings with the classes of their left hand sides and right hand sides for logical entailment and probabilities respectively, and from considering (declared or inherited) class inclusions within each taxonomy as statements whose probability is equal to 1. 3.2

Bayesian Estimation of Mappings Probabilities

As shown in Proposition 2, the computation of Pi (m) and Pc (m) relies on computing the set probability P r(lhs(m)) and the joint set probability P r(lhs(m) ∩ rhs(m)). These values are unknown and must be estimated. They are the (unknown) parameters of the underlying Bernoulli distributions modeling the membership function to a set as a random variable taking only two possible values 0 or 1. Following the Bayesian approach to statistics [14], we model these (unknown) parameters as continuous random variables, and we use observations to infer their posterior distribution from their prior distribution. In the absence of any particular knowledge, the prior distribution is usually set to the uniform distribution. In probability theory, a natural way of estimating the value of a parameter modeled by a random variable is to take its expected value. All this is summarized in the following deﬁnition. Definition 3 (Bayesian estimator of P r(E)). Let E be a subset of the sample set Ω. Let O be a sample of observed elements for which it is known whether they r(E), is the belong or not to E. The Bayesian estimator of P r(E), denoted P expected value of the posterior distribution of P r(E) knowing the observations on the membership to E of each element in O, and setting the prior probability of a random set to 12 , and of the intersection of two random sets to 14 . Setting the prior probabilities to 12 and 14 depending on whether E is a class or a conjunction of classes corresponds to the uniform distribution of instances among the classes. Let Ext(E, O) be the set of observed instances of O that are recognized to be instances of E. According to a basic theorem in probability theory (Theorem 1, page 160, [14]), if the prior distribution of the random variable modeling P r(E) is a Beta distribution of parameters α and β, then its posterior distribution is also a Beta distribution the parameters of which are: α+|Ext(E, O)| and β +|O|. The Beta distribution is a family of continuous probability distributions parameterized by two parameters α and β which play an important role in Bayesian statistics. If its two parameters are equal to 1, it corresponds to the uniform disα tribution for the associated random variable. Its expected value is: α+β . In our setting, the set O is the union of the two (possibly disjoint) sets Oi and Oj of instances observed in two distinct taxonomies Ti and Tj . This raises the issue of computing the set Ext(E, Oi ∪ Oj ), specially when E is the conjonction of a class Ci of the taxonomy Ti and a class Dj of the other taxonomy Tj . In this case: i ∩ Dj , Oi ∪ Oj ) = Ext(C i , Oi ∪ Oj ) ∩ Ext(D j , Oi ∪ Oj ) Ext(C

76

R. Tournaire et al.

Since the two taxomomies have been created and populated independently by diﬀerent users, the only information that can be extracted from those two taxonomies are the extensions of each class within each taxonomy: Ext(Ci , Ti ) and Ext(Dj , Tj ). By construction, it is likely that their intersection contains very few instances or even no instance at all. Therefore, we use automatic classiﬁers to compute Ext(E, O). The machine learning community has produced several types of classiﬁers that have been extensively (theoretically and experimentally) studied (see [44] for a survey) and have been made available through open source platforms like Weka [61]. They are based on diﬀerent approaches (e.g., Naive Bayes learning, decision trees, SVM) but they all need a training phase on two sets of positive and negative examples. Let C be a classiﬁer. Let E be a class of one of the two taxonomies that we denote by Ti , the other one being denoted Tj . For computing Ext(E, O) we follow the same approach as [17]: – C is trained on the descriptions of the elements of the two sets Ext(E, Ti ) and Oi \Ext(E, Ti ) taken as the sets of positive and negative examples respectively, – C is then applied to each instance of Oj to recognize whether it belongs to E or not. As a result, the following theorem provides a simple way to compute the Bayesian c (m) of the two probabilities Pi (m) and Pc (m) deﬁned estimations Pi (m) and P in Deﬁnition 2. Theorem 2 (Estimation of mapping probabilities). Let m : Ci Dj be a mapping between two taxonomies Ti and Tj . Let O be the union of instances i , O)|, Nj = |Ext(D j , O)| and observed in Ti and Tj . Let N = |O|, Ni = |Ext(C i ∩ Dj , O)|. Nij = |Ext(C 1+Nij i − 1+N 4+N 2+N 1+Nij 2+N × 1+N 4+N i

– Pi (m)

=1+

c (m) – P

=

1+N

ij i It is worth comparing the (Bayesian) ratios 1+N 2+N and 4+N appearing in the c (m) in Theorem 2 with the corresponding formulas for computing Pi (m) and P Nij Ni ratios N and N that would have been obtained by following the standard (frequency-based) approach of statistics (as it is the case for instance in [17]). The corresponding ratios converge to the same expected value when there are many instances, but the Bayesian ratios are more robust to a small number of instances. In contrast with the frequency-based approach, they are deﬁned even in the case where no instance is observed: their respective values (i.e., 12 and 1 ) in this particular case correspond to the probability of random sets and the 4 joint probability of of two random sets respectively for a uniform distribution of instances in the sample set.

Discovery of Probabilistic Mappings between Taxonomies

4

77

Discovery Process of Probabilistic Mappings

Given two taxonomies Ti and Tj (and their associated instances), let M(Ti , Tj ) be the set of all mappings from Ti to Tj (i.e., of the form Ci Dj ). The ProbaMap algorithm determines all mappings m of M(Ti , Tj ) that verify two probabilisticbased criterion of Pu (m) ≥ Su and Pc (m) ≥ Sc , where Su and Sc are two thresholds in [0, 1]. Candidate mapping generation - The principle of ProbaMap algorithm is to generate mappings from the two sets of classes in the two taxonomies ordered according to a topological sort [10]. Namely, (see the nested loops (Line 4) in Algorithm 1) it generates all the mappings Ci Dj by enumerating the classes Ci of Ti following a reverse topological order and the classes Dj of Tj following a direct topological order. The following proposition is a corollary of Proposition 1. Proposition 3. Let Ti and Tj two taxonomies. Let ReverseT opo(Ti ) be the sequence of classes of Ti resulting from a reverse topological sort of Ti . Let T opo(Tj ) be the sequence of classes of Tj resulting from a topological sort of Ti . Let m : Ci Dj and m : Ci Dj two mappings from Ti to Tj . If m is an implicant of m (i.e., m m), then Ci is before Ci in ReverseT opo(Ti ) or Ci = Ci and Dj is before Dj in T opo(Tj ). This proposition ensures that the mappings are generated in an order that respect the logical implication, according to the knowledge of the taxonomies. Pruning the candidate mappings to test - Based on the monotony property of the probability function Pi (Theorem 1), every mapping m implicant of a mapping m such that Pi (m) < Su veriﬁes Pi (m ) < Su . Therefore, in ProbaMap, we prune the probability estimation of all the implicants of every m such that Pi (m) < Su . We shall use the notation Implicants(m) to denote the set of all mappings that are implicants of m. Similarly, based on the property of weak monotony of the probability function Pc (Theorem 1), when a tested candidate c (m) < Sc we prune the probability estimation of all mapping m is such that P the implicants of m having the same left-hand side as m. We shall denote this set: Implicantsc(m). Based on Proposition 1, Implicants(m) and Implicantsc(m) can be generated from Ti and Tj . The ProbaMap Algorithm - The resulting ProbaMap algorithm is described in Algorithm 1, in which – primitive functions like Implicants and Implicants c returns the implicants of the mapping in argument and the implicants having the same class at the left-hand side. – Topo andReverseTopo return the sequences of classes of a taxonomy that respect the logical implication order (respectively in the direct and reverse directions). – exti and extj represent the two maps that store the extension of each class of respectively Ti and Tj , or the extensions obtained by classiﬁcation when it is enabled.

78

R. Tournaire et al.

Algorithm 1 returns mappings directed from Ti to Tj . In order to obtain all valid mappings, it must be applied again by swapping its inputs Ti and Tj . Algorithm 1. ProbaMap Require: Taxonomies (DAG) Ti , Tj , thresholds Sc , Si , Instances maps exti , extj c (m) ≥ Sc } Ensure: return {m ∈ M(Ti , Tj ) such that Pi (m) ≥ Si and P 1: MV al ← ∅ 2: MNV al ← ∅ 3: for all Ci ∈ ReverseTopo(Ti ) do 4: for all Dj ∈ Topo(Tj ) do 5: let m = Ci Dj 6: if m ∈ MNV al then 7: if Pi (m, exti , extj , Ti , Tj ) ≥ Si then c (m, exti , extj , Ti , Tj ) ≥ Sc then 8: if P 9: MV al ← MV al ∪ {m} 10: else 11: MNV al ← MNV al ∪Implicants c(m, Tj ) {Pruning using the weak monotony} 12: else 13: MNV al ← MNV al ∪Implicants(m, Ti , Tj ) {Pruning using the strong monotony} 14: return MV al

The discovered valid mappings are stored in the set MN V al . Mappings that are pruned are stored in the set MN V al . The nested loops in Line 4 in Algorithm 1 generate all the mappings Ci Dj from the most general to the most speciﬁc mapping. Based on this, Proposition 3 shows that this algorithm maximizes the number of pruning. When a mapping is generated and if it is not already pruned, it is ﬁrstly tested with regard to Pi (m) ≥ Si in Line 7, then if it is positive, it is tested with Pc (m) > Sc in Line 8. In the case Pi (m) < Si , all the implicants of m are marked as pruned (Line 13), thanks to the strong monotony property of Pi . If Pi (m) ≥ Si but Pc (m) < Sc , then the property of weak monotony conducts to prune all the implicants of m with the same left-hand side, in Line 11 and they are added to MN V al . Hence, MN V al set is kept up to date by containing all the pruning mappings from the beginning. The general structure for the implementation of ProbaMap is pictured in Figure 4. 1. Computation of the transitive logical closure of the two taxonomies and their instances given as input. (in order to obtain quickly the class extensions and to optimize the pruning process) 2. Automatic classiﬁcation step in order to merge the instances of the two taxonomies. This step has been detailed in section 3.4. It can be disabled by the user. 3. Core of ProbaMap: generation-and-test of candidate mapping, according to the two thresholds as input parameters.

Discovery of Probabilistic Mappings between Taxonomies

79

Fig. 4. Structure of ProbaMap

The output is directly the set of mappings from the ﬁrst taxonomy to the second one for which Pi and Pc both exceed their respective thresholds Su and Sc . We insist on the fact that, once the parameters are set, the discovery process is fully automatic. As many matching methods provide equivalences as their results, we can add a postprocessing for ProbaMap in order to obtain equivalences logically implied by the inclusion mappings discovered. In this case, the thresholds Su and Sc should be lowered with respect to the discovery of pure implications, because equivalences are stronger relations than implicants, and then less likely to happen.

5

Experiments

We have performed three series of experiments : 1. on controlled synthetic data 2. on OAEI directory benchmark 3. on Web Directories, with a comparison with the method SBI [34] This section is structured according to these three series. 5.1

Experiments on Synthetic Data

For the purpose of systematic testing of our approach in various conditions, we have evaluated Algorithm 1 on synthetic data on which we can control important parameters and guarantee structural or distributional properties. Diﬀerent analysis have been conducted. We have measured the impact of the probabilistic models and of the thresholds involved in the validity criteria on

80

R. Tournaire et al.

the precision of the results and on the pruning ratio. The pruning ratio is the ratio of mappings that are pruned by adding in lines 13 and 11 Implicant(m) (or Implicantc(m)) to the set MN V al of unvalid mappings without estimating their probabilities. We have also measured the qualitative and quantitative impact of the choice of a classiﬁer. Automatic classiﬁcation is at the core of the estimation of Pi and c with the computation of Ext(C i ∩ Dj , O) (see Theorem 2). For evaluating P the quality of our results, we use the standard criteria of precision and recall [58]. Recall is the ratio of returned results that are expected w.r.t. all expected results. Precision is the ratio of returned results that are expected w.r.t. all returned results. In order to measure the robustness we have tested the robustness of Algorithm 1 to noisy data. Finally, we have conducted experiments on real-world data to check the scalability of the Algorithm 1 on large taxonomies and to make qualitative measurements. We ﬁrst describe the principles and the process of the data generator on which we have conducted the diﬀerent experiments. Then we describe the experimental protocol that we have followed. Finally, we summarize the main experimental results that we have obtained on synthetic and real-world data. Synthetic data generation is divided into three steps : generation of taxonomies with ﬁxed sizes, generation of the expected mappings to discover, and population of each class by generating a ﬁxed number of instances and associated description. Generation of taxonomies - Given constraints on number of classes n1 and n2 , we generate the structure of the two respective taxonomies T1 and T2 as a forest of general trees (unconstrained in-degrees) by using a Boltzmann sampler for unlabelled trees described in [19]. We use a reject method to get random forests with n1 and n2 nodes. This method is simple and eﬃcient while guaranteeing an uniform distribution among the trees with the same number of nodes. Then, we label each node by a distinct class name. In our experiments, we set n1 = n2 so the two taxonomies have the same size, which is the unique parameter of the taxonomies generation. Mappings generation - We initialize the generation of mappings to be discovered MG with a set MS of seed mappings, whose size is either ﬁxed or randomly chosen between 1 and the common size of the taxonomies. Each mapping m ∈ MS is generated by a random choice for the two classes lhs(m) and rhs(m) in T1 and T2 , or in T2 and T1 , depending on the mapping direction which is randomly chosen too. We reject mappings which logically entail class inclusions that are not entailed within each taxonomy (i.e., we forbid generated mappings to modify the knowledge of each taxonomy). The set MG of all mappings to discover will then be the set of mappings that can be logically entailed by MS and the two taxonomies.

Discovery of Probabilistic Mappings between Taxonomies

81

Following [21], the computation of precision and recall will be based on MG . Let R be the result of Algorithm 1. The recall is the proportion of mappings of MG actually returned by Algorithm 1: Recall =

MG ∩ R MG

The precision is the proportion of returned mappings that are actually in MG : P recision =

MG ∩ R R

Instances and description generation - For this step, we consider the two taxonomies and the mappings between them as a whole DAG of classes. The acyclicity of that graph is guaranteed by the constraints imposed in the production of the set MG of generated mappings described above. We ﬁrst generate a set of boolean attributes suﬃcient to associate a minimal intentional description of each class respecting the semantic partial order conveyed by the above DAG structure. Then, we use this intentional knowledge to generate accordingly the description of the instances with which we populate each class in the taxonomies. Generation of the intentional description of classes: We traverse the DAG of classes according to a reverse topological order [10] starting from the most general classes that constitute the level 0, and we iterate the following process for generating the intention of classes as sets of attributes: - For each class Ci0 of level 0, we generate a disjoint set of distinct attributes At0i and we set the intention of Ci0 , denoted Int(Ci0 ), to be At0i . - For each class Cij of level j (according to the reverse topogical order), we generate a set Atji of novel attributes (disjoint from the set of existing attributes) with a size ﬁxed to the out degree of Cij in the DAG of classes, and we set Int(Cij ) to be Atji ∪ Int(Cij−1 ), where the Cij−1 are the successors of Cij in k k the DAG. Population of classes: Let {At1 , . . . , Atm } be the set of attributes generated at the previous step. We populate each class with nP instances, and we associate to them descriptions that respect the corresponding intentional description, as follows: For each class C, each of its instances is described by a boolean vector [a1 , . . . , am ] obtained by: - setting to 1 each ai such that the corresponding attribute Ati is in the intention of the class C, - randomly setting the other values aj to 0 or 1. This way, by construction, all the instances in the extension of a class have in common that all the attributes in Int(C) are present in their description. In Section 5.1 we will use an oracle classiﬁer which classiﬁes an instance i in the class C iﬀ all the attributes of the intention Int(C) of the class C are present in the description of i.

82

R. Tournaire et al.

The results of data generation can be summarized into a table Tdata with m + nC columns where m is the number of generated attributes and nC is the number of generated classes, and each tuple [a1 , . . . am , c1 , . . . , cnc ] concatenates the description [a1 , . . . am ] of an instance in terms of attributes, and its categorization [c1 , . . . , cnc ] with respect to the classes: for each i ∈ [1..nC ] ci is equal to 1 if i ∈ Ext(C) and to 0 if it is not the case. Connection with Armstrong relations - Data generation in our context represents quite challenging issue when compared to other synthetic data generation, such as functional Armstrong relation for functional dependencies3 [4], or transactional databases for frequent itemsets [49]. Roughly speaking, we borrow the same principles than those developed for Armstrong relations [25]. Each generated mapping should be satisﬁed by the generated data, and each mapping that is not generated should be contradicted by the generated data. With regard to the Armstrong relations, our generation tackles additional issues. Firstly, the structure may be generated whereas Armstrong relations suppose the structure (schemas) given. Secondly, for relational databases, it is enough to generates two tuples that contradict one dependency for ensuring that the dependency is not satisﬁed. In our case where mapping probabilities are estimated from statistics on class extensions, the amount of instances that contradict a mapping has a strong impact on its validity, then we can not only generate one instance for each mapping to be contradicted. Experimental protocol - We ﬁrst explain the goals of the successive experiments that we have performed. For an overview of the complete setting, the connection between ProbaMap and the generator is pictured in Figure 5. Distinguishing between the parameters of ProbaMap (two thresholds for Pc and Pi and the kind of classiﬁer) and thoses of the generator is a key point. The ﬁrst goal is to analyze the impact on the precision of the thresholds Sc , Su , with the purpose to ﬁx appropriate thresholds for the next experiments. c and Pi on The second goal is to analyze the impact of the probabilities P the pruning ratio of the algorithm. The purpose is to determine among three validity criteria the one oﬀering the best performances : using Pi alone, using Pc alone, or using both of them. The third goal is to analyse and compare the impact both on precision/recall and on total running time of three real classiﬁers (Naive Bayes, C4.5 and SVM) for estimating the probabilities. The purpose is to determine the classiﬁer oﬀering the best tradeoﬀ between quality of results and running time. Note that we do not take the learning time of classiﬁers into account because we consider that this task can be precomputed for each taxonomy. 3

An Armstrong relation for a set of functional dependencies is a relation that satisﬁes each dependency implied by the set and does not satisfy any dependency that is not implied by it.

Discovery of Probabilistic Mappings between Taxonomies

83

Fig. 5. Structure of ProbaMap

The last goal with synthetic data is to analyse the robustness of the approach to noisy data. For all the experiments on synthetic data presented in this section, each point is obtained by averaging the results of 100 runs. For each run, a new synthetic dataset is generated with the appropriate parameters. Note that in our experiments we generate taxonomies with few dozens of classes. The number of random taxonomies of such sizes can be counted in billions. Thus, averaging over 100 runs for a point does not prevent from local variations, leading to curves that are not smooth. Our algorithm is written in Java and compiled using Sun Java version 1.6. We run all the tests on a quad-core Intel Q6700 Xeon at 2.66 GHz with 4 GB of memory. The OS is Ubuntu Linux 8.10. For all the experiments measuring run times, only one instance of our program and the OS are running on the machine, to avoid memory contention eﬀects with other programs that would aﬀect the results. Experimental results 1 - Impact of thresholds on precision We compare the inﬂuence of the thresholds Sc and Su associated to probabilities c and Pi on the quality of the results returned by Algorithm 1. For doing so, P c ≥ Sc and Pi ≥ Su . we run the algorithm with the validity criterion: P In this experiment, the computation of probabilities is performed using the oracle classiﬁer. The parameters in the synthetic generator are deﬁned such that |M(T1 , T2 )| = 320. We set the number of seed mappings |MS | = 4. Note that by logical entailment the total number |MG | of mappings to be discover may be much greater. For each couple of threshold (Sc , Su ) ∈ [0.78 : 0.995]2, we compute

84

R. Tournaire et al.

1

Precision contours Oracle classifier

0.98

0.99 0.98 0.97 0.96 0.95 0.94 0.93

0.96 0.94

Su

0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Sc

1

Fig. 6. Precision w.r.t. Sc and Su thresholds

the precision and the recall of the results of Algorithm 1. We observed that the recall remains constant at 1.0 independently of values of Sc and Su . This is because thanks to the oracle classiﬁer, estimated probabilities for the mappings of MG are very close to 1, and superior to all experimented thresholds, leading to a perfect recall. Thus, we only show the results for precision in Figure 6. The ﬁgure shows the contours of the diﬀerent precision levels, from a precision of 0.93 to a precision of 0.99. From the shape of these contours, it is clear that c and Pi have an inﬂuence on precision. As the relation Pi ≥ P c holds both P (Proposition 2), under the diagonal Pi has no inﬂuence on precision. c c is more discriminant than Pi . The ﬁgure shows that P The probability P inﬂuences the precision for a large range of values of the threshold Sc , while Pi only has an inﬂuence for very high values of Su . We have observed the estimated probabilities for diﬀerent mappings, and found that there is an important gap in c between valid and invalid mappings. This gap is much smaller the values of P for Pi . Pi giving higher probability values to invalid mappings, this explains why it can only have an inﬂuence on precision at very high Su values. Based on the curve of Figure 3, we ﬁx the thresholds at (Sc = 0.83, Su = 0.96) for experiments where the classiﬁer used to estimate the probabilities is the oracle. This gives a good precision of 0.95, and maps to a region where Pi has an inﬂuence on the quality of results. For the experiments in which a real classiﬁer is used to estimate the probabilities, we ﬁx the thresholds at (Sc = 0.85, Su = 0.90) to be tolerant to classiﬁcation errors. 2 - Impact of probabilities on pruning ratio We now study the impact of three probabilistic criteria for testing the validity of a mapping on the pruning ratio performed by Algorithm 1 : 1. using only Pi by setting Sc = 0 (bypass) 2. using only Pc by setting Su = 0 (bypass) 3. using both Pi and Pc Figure 7 shows the ratio of pruning made by the algorithm using the diﬀerent validity criteria, w.r.t. a naive approach not doing any pruning.

Discovery of Probabilistic Mappings between Taxonomies

85

Pruning factor w.r.t. number of candidate mappings Oracle classifier 0.5

Pu>Su Pc>Sc Pu>Su and Pc>Sc

0.45

Ratio

0.4

0.35

0.3

0.25

0.2 0

500

1000

1500 2000 2500 number of mappings

3000

3500

4000

c , Pi and P c and Pi Fig. 7. Pruning factor for validity computation with P

The validity computation using only Pi is the one that prunes the least mappings, computing probabilities for about 40% of all mappings of M(Ti , Tj ). Both c and P c and Pi does more prunings and obtain a signiﬁcant reduction of the P c obtains slightly better results than using search space. Combining Pi and P c ≥ Sc and Pc alone, so for the remainder of this experiments section, we use P Pi ≥ Su as validity criterion. It allows to compute the probabilities for only 20% of the mappings of M(Ti , Tj ), when the number of candidate mappings is high. For example, a search space of 3500 mappings is pruned up to 22% by combining Pc and Pi , that implies that there are only 770 examinated mappings for which Pi and Pc are estimated. 3 - Impact of the classifiers In this subsection, we replace the oracle classiﬁer with a real classiﬁer. We compare the results given by three well-known classiﬁers: Naive Bayes [44], C4.5 [47] Computation time (s) w.r.t. candidate mappings number 200 ins/class - 100 runs/pt 140

Naive Bayes C4.5 SVM

Computation time (s)

120 100 80 60 40 20 0 0

500

1000 1500 2000 Candidate mappings number

2500

3000

Fig. 8. Computation time (s) for diﬀerent classiﬁers

86

R. Tournaire et al.

and SVM [27]. We use the Weka implementation of these classiﬁers and have interfaced it with our code. The comparisons of running times are shown in Figure 8 and in log scale in Figure 9. A ﬁrst conclusion is that the running times are polynomial in the number of mappings, and are very similar, with Naive Bayes being slightly slower than C4.5 and SVM. Comparisons for precision and recall are shown in respectively Figure 10 and Figure 11. Naive Bayes has both the worst recall and the worst precision, the choice is thus between C4.5 and SVM. They seem to have similar results. However, the learning time of SVM (not shown here) is much longer than the learning time of C4.5. We thus choose C4.5 for further experiments, and analyse the impact of the number of instances per class on the classiﬁcation performance of Algorithm 1 with C4.5. Computation time (s) w.r.t. candidate mappings number - Logarithmic scale 200 inst/class - 100 runs/pt 5

Naive Bayes C4.5 SVM

log(computation time)

4

3

2

1

0

-1 5.5

6

6.5 7 log(candidate mappings number)

7.5

8

Fig. 9. Computation time in log scale for diﬀerent classiﬁers

Precision w.r.t. candidate mappings number 200 inst/class - 100 runs/pt 1

NB C4.5 SVM

0.95 0.9

Precision

0.85 0.8 0.75 0.7 0.65 0.6 0

500

1000 1500 2000 Candidate mappings number

2500

Fig. 10. Precision for diﬀerent classiﬁers

3000

Discovery of Probabilistic Mappings between Taxonomies

87

Recall w.r.t. candidate mappings number 200 inst/class - 100 runs/pt 1 0.99 0.98 0.97 Recall

0.96 0.95 0.94 0.93 0.92 NB C4.5 SVM

0.91 0.9 0

500

1000

1500

2000

2500

3000

Candidate mappings number

Fig. 11. Recall for diﬀerent classiﬁers

Computation time (s) w.r.t. nb inst/class Classifier:C4.5 - 100 runs/pt 18 16

Computation time (s)

14 12 10 8 6 4 2 0 0

50

100

150 200 250 300 number of instances/class

350

400

450

Fig. 12. C4.5 - Computation time (s) - impact of number of inst/class

We vary the number of instances per class nP between 10 and 450. The results for computation time, precision and recall are shown in Figures 12, 13, and 14. In this experiment, the number of classes and of mapping is constant, hence the number of classiﬁcations to perform is linear in the number of instances. The C4.5 algorithm takes linear time in the number of instances. In fact, all instances are classiﬁed into each class at the beginning of ProbaMap, this is also perfectly coherent. This linearity is also the case for Algorithm 1, as shown by Figure 9. Increasing the number of instances per class only increases slightly precision, whereas it strongly improves recall. The most important point to note is that excellent values of precision and recall are obtained with as few as 50 instances per class, as expected, with a use of Bayesian approach of statistics.

88

R. Tournaire et al.

Precision w.r.t. nb inst/class Classifier: C4.5 - 100 runs/pt 1 0.95 0.9

Precision

0.85 0.8 0.75 0.7 0.65 0.6 0

50

100

150 200 250 300 number of instances/class

350

400

450

Fig. 13. C4.5 - Precision - impact of number of inst/class Recall w.r.t. nb inst/class Classifier: C4.5 - 100 runs/pt 1

0.95

Recall

0.9

0.85

0.8

0.75 0

50

100

150 200 250 300 number of instances/class

350

400

450

Fig. 14. C4.5 - Recall - impact of number of inst/class

4 - Robustness to noisy data In order to test the robustness to noise of our algorithm, we deﬁne a new parameter θ corresponding to the quantity of noise to inject in the synthetic data. Each dataset produced by the synthetic data generator goes through a step of noise application, where each boolean corresponding to the value of an attribute for an instance can be reversed with a probability θ. The new dataset is then processed as usual by Algorithm 1. The variations of precision and recall for values of θ ∈ [0; 0.3] are show in Figure 15. The ﬁgure shows that recall gracefully degrades when noise increases. At 10% noise, the recall is nearly unaﬀected, at a value of 0.95. Values of noise superior to 15% have a more signiﬁcant impact and lead to poor recall. Precision, however, exhibits a diﬀerent behavior. It ﬁrst increases with noise, before abruptly decreasing for more than 24% of noise.

Discovery of Probabilistic Mappings between Taxonomies

89

Recall and precision w.r.t. noise level Classifier: C4.5 - 100 runs/pt 1 0.9 0.8 0.7 0.6 0.5 0.4 Recall Precision 0.3 0

0.05

0.1

0.15 Noise level

0.2

0.25

0.3

Fig. 15. C4.5 - Precision and recall - impact of noise level

In order to understand this phenomenon, we have investigated in details the classiﬁer results and the values of probabilities given to mappings. We found that for 0% noise, there are invalid mappings that are incorrectly given too high probabilities, and that appear as valid. This explains the non-perfect 0.88 precision value. The probability values for these mappings are close to the threshold. Increasing noise makes the classiﬁers more selective, and tends to decrease the values of all probabilities. So the probabilities of these invalid mappings go below the threshold for a moderate amount of noise, whereas the probabilities of valid mappings remain above the threshold. Thus the precision increases. 5.2

Real-World OAEI Data

We have made experiments on the directory set of OAEI [22,20] contest. This set is constituted by two large taxonomies of respectively 2857 and 6628 classes. For the contest, due to scalability issues, the taxonomies are split into the set of their branches, a small subset of which is given to the competitors for mapping alignment. In contrast, our algorithm is able to handle the two whole taxonomies, thus taking advantage of the complete structure of the taxonomies. It is important to note that without pruning, this would lead to a search space of 30 million mappings. For compensating the absence of available instances for these taxonomies, we use a method inspired by [52] to automatically populate the classes with synsets4 . We follow a similar method to Ctx-Match [52]. Intuitively, the principle of this population is based on associating each class C with a set of WordNet synsets (atomistic semantic unit in this thesaurus) that reﬂect the right meaning of the label of C in the context where it appears, i.e. the labels of the ancestor classes of C in its taxonomy. This help to disambiguate the meaning of a word: for instance, the label “Arizona” can correspond to a state of the U.S.A. or to a 4

In WordNet, a synset is a semantic unit, 2 synonyms correspond to the same synset.

90

R. Tournaire et al.

snake. If the Arizona class is a child class of “Animals”, the label “Animals” can be used to guess that “Arizona” means the snake species. On the two whole taxonomies, the population phase produces about 30000 instances and takes 5 hours while the mapping discovery algorithm itself only takes 11 minutes. These are very reasonable computational times for handling 30 million possible mappings. For evaluating the precision of the set of mappings discovered by our algorithm, we could only compute a lower bound based on the partial reference provided by OAEI. The results are promising, as for the thresholds Su and Sc respectively set to 0.9 and 0.8 we obtained a lower bound of precision of 67%5 . 5.3

Comparative Analysis on Collected Web Directories

In this section, we test ProbaMap on part of Internet directories from Yahoo! and Google (actually based on Dmoz) that are rooted with similar or very close label. These sub-directories are considered as taxonomies, and URLs referenced inside each class of the taxonomy as instances. The main diﬀerence with the sequence of experiments in the previous section is that the dataset contains original instances that are collected with their taxonomies, avoiding to process an artiﬁcial population. We compare our approach to the SBI algorithm of Ichise et al. [34,35], which is dedicated to the discovery of mappings between Internet directories, and the integration of such directories. Internet directories are trees of categories, which can be seen as taxonomies, categories being the classes. Each category contains a set of links (i.e. URLs to web sites), which can be seen as the instances of the class. Each link comes with a small text summary, whose words can be seen as instance attributes for classiﬁcation. Our datasets are corresponding locations in the Yahoo! and Google directories, that have also been used in the experiments of [34,35]: – Yahoo! : Recreation / Automotive & Google : Recreation / Autos – Yahoo! : Recreation / Outdoors & Google : Recreation / Outdoors – Yahoo! : Computers and Internet/Software & Google : Computers/ Software – Yahoo! :Arts / Visual Arts / Photography & Google : Arts /Photography The data from the directories was collected in June 2010, so is diﬀerent from the data of [35] and [34] which was collected in Fall 2001. Table 1 shows for each dataset the number of classes and instances in each class, and the number of instances shared between the Yahoo! and the Google directories. Two instances are considered shared if they correspond to the same URL in both directories. For a fair comparison, we have implemented both ProbaMap and the SBI algorithm in Java. 5

Fore more details, see http://disi.unitn.it/~pane/OAEI/2009/directory/result/

Discovery of Probabilistic Mappings between Taxonomies

91

Table 1. Statistics on data collected from subdirectories on Yahoo! and Google Yahoo! classes instances Autos Outdoors Software Photography

947 2428 323 168

4406 5511 2390 1851

Google classes instances 967 1441 2395 321

6425 13863 30140 3852

shared instances 837 623 572 286

Experimental protocol - An overview of the setting used for the comparative experiment is pictured in Figure 16. SBI takes as input one source taxonomy, one target taxonomy, and the instances declared for each class of both taxonomies. For each class Cs of the source taxonomy, SBI returns a rule Cs → Ctpredicted associating Cs to a target class Ctpredicted in the target taxonomy. In order to ﬁt the evaluation framework of SBI, we added a postprocessing to ProbaMap to obtain a similar form of results, i.e. a set of unique rules for each class of the source taxonomy. The complete process is the following: 1. Application of ProbaMap on T1 and T2 2. For each class C1 of T1 , among all C2 for which the two mappings C1 C2 and C2 C1 have been c (C1 C2 ), P c (C2 C1 )) discovered, select the class C2 for which min(P has the highest value. 3. For each class C1 of T1 , if there is no rule for C1 , associate to C1 the rule of its closest ancestor in T1

Fig. 16. Setting used for the comparative experiment between SBI and ProbaMap

92

R. Tournaire et al.

In this way we obtain an unique rule C1 → C2 for each class of T1 , like the SBI system. As there are a good proportion of shared instances for each directory that makes the classiﬁcation step not mandatory, we can and we will compare SBI against both ProbaMap versions with and without the classiﬁcation step. For the version without classiﬁcation, preliminary tests lead to set Sc = 0.6 and Su = 0.9. SVM is set to be the default classiﬁer by using the SMO implementation in weka. When using classiﬁcation, because of the small values of the joint probabilities of classes, we set Su = Sc = 0. In this case, the postprocessing step is modiﬁed to take into account the probabilities Pc and Pi of the reverse mappings. Comparative qualitative evaluation - The goal of our experiments is to compare the quality of Internet directories alignment for ProbaMap and SBI. For the discovery of mappings, ProbaMap and SBI receive a “training” set of instances which is a subset of the shared and annotated instances. The test set used for evaluation of the discovered mappings is constituted by the remaining instances among the shared ones. In the case where ProbaMap is set to use classiﬁcation, the training set is extended with all the non shared instances. The ratio of the number of shared instances used for training among all shared instances is a controlled variable on the experiments we have conducted. The classiﬁcation is performed using the SVM implementation SMO [27] in Weka [61], where the classiﬁcation attributes for an instance are the words of its summary. The experiments on controlled data have shown that SVM and C4.5 are comparable in quality but that C4.5 is faster than SVM. Nevertheless, we have conducted this experiment with SVM which is expected to perform better on sparse data. The evaluation is done by using the test set of instances. Each instance of this set belongs to a class Cs of the source taxonomy. Hence, we can compare: – the class Ctpredicted of the instance, predicted by the output rule Cs → Ctpredicted – the class Ct in which the instance is declared in the target taxonomy (each instance of the test set is common to the two taxonomies) The accuracy measures the ratio of the instances in the test set for which Ctpredicted = Ct . Accuracy is a standard micro-averaged evaluation measure which is based on instances, whereas recision and recall are macro-averaged measures based on mappings themselves. As there is no reference of mappings provided for the considered Yahoo! and Google subdirectories, but a suﬃcient proportion of instances are shared by them, we use the accuracy criterion to evaluate ProbaMap in this experiment. This enables also to ﬁt the evaluation framework of SBI (described in [35]). Results - The averaged results in Table 2 show that ProbaMap outperforms SBI in average, and that ProbaMap with classiﬁcation outperforms ProbaMap without classiﬁcation. The average is computed on the four directories of the Table 1 and for each direction of alignment.

Discovery of Probabilistic Mappings between Taxonomies Table 2. Averaged accuracy for SBI and ProbaMap Ratio of shared instances provided SBI ProbaMap ProbaMap + classif for training 0.5 0.9

0.23 0.29

0.28 0.33

0.33 0.38

Accuracy for Autos: Yahoo! into Google 1 ProbaMap ProbaMap + classif SBI

Accuracy

0.8

0.6

0.4

0.2

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.8

0.9

Training set size (ratio w.r.t. common instances)

(a) Autos: Yahoo! into Google Accuracy for Photography: Yahoo! into Google 1

ProbaMap ProbaMap + classif SBI

Accuracy

0.8

0.6

0.4

0.2

0 0.1

0.2

0.3 0.4 0.5 0.6 0.7 Training set size (ratio w.r.t. common instances)

(b) Photography: Google into Yahoo! Fig. 17. Comparative accuracy results (1)

93

94

R. Tournaire et al.

Accuracy for Outdoors: Google into Yahoo! 1

ProbaMap ProbaMap + classif SBI

Accuracy

0.8

0.6

0.4

0.2

0 0.1

0.2

0.3 0.4 0.5 0.6 0.7 Training set size (ratio w.r.t. common instances)

0.8

0.9

0.8

0.9

(a) Outdoors: Google into Yahoo! Accuracy for Software: Yahoo! into Google 1

ProbaMap ProbaMap + classif SBI

Accuracy

0.8

0.6

0.4

0.2

0 0.1

0.2

0.3 0.4 0.5 0.6 0.7 Training set size (ratio w.r.t. common instances)

(b) Software: Google into Yahoo! Fig. 18. Comparative accuracy results (2)

In particular, ProbaMap with classiﬁcation signiﬁcantly outperforms ProbaMap without Classiﬁcation (about 10% better) and SBI (about 20% better) for the datasets Autos and Photography, whatever the size of the training set. For the directories Software and Outdoors, ProbaMap without classiﬁcation and SBI both provide lower results. SBI performs a little better (no more than 5%). In these two cases, there are initially few instances by class, and then the classiﬁcation step allows to improve the results : ProbaMap with classiﬁcation outperforms SBI on the Software directory up to 60% of instances for the training set, whereas ProbaMap without classiﬁcation does not. For the Outdoors results pictured in Figure 18(a), ProbaMap with classiﬁcation is better for a small

Discovery of Probabilistic Mappings between Taxonomies

95

training set (≤ 20% of the shared instances). The reason is that, for this particular dataset, a very small ratio of instances is classiﬁed and then the extensions of classes do not change much after the classiﬁcation step. We have also checked that all these results both for ProbaMap and ProbaMap with classiﬁfcation are rougly better in average than the results of a modiﬁed version of SBI using Naive Bayes classiﬁcation so-called SBI-NB [35]. More precisely, both versions of Probamap perform better for 6 of the 8 tested directories. Note however that we are provided results for SBI-NB with an old dataset collected in 2004, so the comparison is not totally fair. Finally, SBI and ProbaMap (without classiﬁcation) both take a few seconds on each of the directories of Figure 1). The version of ProbaMap with classiﬁcation requires additionnal time (several minutes) for classifying each instance into each class, but this signiﬁcantly improves the quality of the results. These experiments show that the mapping discovery method that we propose gives good results on real-world datasets, and can take advantage of classiﬁcation techniques to compensate small training set sizes. This is an important quality for real world taxonomies built by diﬀerent people, that are unlikely to have many instances in common.

6

Related Work

As outlined in the introduction, semantic mappings are the glue for data integration systems. A wide range of methods of schema/ontology matching have been developed both in the database and the semantic web communities [23]. One of the principles widely exploited is terminological comparison of the labels of classes with string-based similarities or lexicon-based similarities (like WordNet) (e.g., TaxoMap [32], H-MATCH [8]) . Another widely used principle is structure comparison between labeled graphs representing ontologies (e.g., OLA [24]), based on similarity ﬂooding [43]. The underlying idea is that two elements of two distinct ontologies are similar when their adjacent elements are similar. This is applied by spreading similarities in the spirit of how IP packets ﬂood the network in broadcast communication. A category of methods (e.g., FCA-merge [55], [12]) are called “extensional” because they are instance-based: they rely on instances of classes to compute matches between them. It is shown in [36] that the reliability of instance-based methods depends on the applications and on the kind of correspondences that are considered. The instance-based method SBI [34] with which we have compared our method is based on the computation of the Fleiss Kappa coeﬃcients that are symmetric and then they cannot be interpreted as probabilities on inclusion mappings. Most of the instance-based methods make use of classiﬁers using a corpus of labelled instances (e.g., LSD [16], SemInt [38], GLUE [17],[45]). It should be pointed out that GLUE discovers symmetric correspondences that have no formal semantics. The associated conﬁdence value for each discovered correspondence are not real probabilities as they are based on similarities like the Jaccard

96

R. Tournaire et al.

similarity. In contrast, we consider mappings that denote inclusions between classes of diﬀerent taxonomies, and we deﬁne a formal probabilistic semantics for these inclusion mappings. sPLmap [45] is an approach for schema matching, and oPLmap [46] is a variant for ontology matching, that rely on classiﬁers and deal with probabilities. sPLMap ﬁnds the best set of mappings M between two schemas S, T that maximizes the probability that the tuples in the rewritten schema S using M (denoted SM ) are plausible in T and vice versa. At the end, each returned mapping is associated with a probability based on the conditional probability formula. In contrast with our work, sPLmap computes probabilities by combining scores provided by several weighted classiﬁers (working on text or data values of tuples). Other approaches have been investigated with machine learning techniques using a corpus of schema matches (e.g., [40],[60]). The work introduced in [60] uses classiﬁers directly on correspondences, by representing each correspondence in a vector space constructed from instances features. A training set of true correspondences should be provided. Then, for a tested correspondence between two classes A and B, the similarities between (i) the instances of A, (ii) the instances of B, and (iii) the instances of all examples correspondences allow to give to the tested correspondence a position in the correspondence space. Hence the classiﬁer can be used to determine if the tested correspondence is relevant or not, according to its position in the correspondence space and the learned examples. In fact, most of the existing matchers combine these elementary approaches in diﬀerent ways (e.g., COMA++ [3] and COMA [15], Cupid [41], H-MATCH [8], Lily [59], S-Match [30], Clio [9]). In [28], an extension of existing matching methods is proposed by considering the k ranked best alignments. The top-k alignment ranking is combined with the schema to match in order to generate the ﬁnal alignment that globally maximizes its score.

7

Conclusion and Future Work

It is standard practice for ontology and schema matchers to associate numbers with the candidate mappings they propose. The uncertainty is intrinsic to correspondences discovery because two classes or properties (for example) independently created are unlikely to exactly match. As stated in [54], there is still a need to better understand the foundations of modeling uncertainty that is primary important for improving the detection of correspondences causing inconsistencies, e.g., via probabilistic reasoning, or to identify where the user feedback is maximally useful, and for improving the quality of the interpretation of correspondences. However, those uncertainty coeﬃcients do not have a probabilistic meaning and are just used for ranking. In contrast, our approach promotes a probabilistic semantics for mappings and provides a method to compute mapping probabilities based on the descriptions of instances categorized in each ontology. It is

Discovery of Probabilistic Mappings between Taxonomies

97

important to note that even if we use similar classiﬁcation techniques as [17], we use them for computing true probabilities and not similarity coeﬃcients. The most distinguishing feature of our approach is that it bridges the gap between logic and probabilities by providing probabilistic models that are consistent with the logical semantics underlying ontology languages. Therefore, our approach generalizes existing works based on algebraic or logical representation of mappings as a basis for reasoning (e.g., Ctx-Match [52], Clio [9]). The work in [29] introduces a framework for modeling and evaluating automatic semantic reconciliation. It provides a formal model for semantic reconciliation and theoretically analyses the factors that impact the eﬀectiveness of matching algorithms. The proposed formal model borrows from fuzzy set theory for handling uncertainty by combinations of similarity measures and good properties. This work is not connected probability but is complementary to the approach in this thesis. The mappings that are returned by our algorithm can be exploited for mapping validation by probabilistic reasoning in the line of what is proposed in [6]. More generally, our approach is complementary of the recent work that has been ﬂourishing on probabilistic databases [5,11]. In particular, it ﬁts into the general framework set in [18] for handling uncertainty in data integration, for which it provides an eﬀective way for computing mapping probabilities. ProbaMap can be generalized to perform without instances at all, by using techniques that lead to correctly estimate P (A1 ) and P (A1 ∩ B2 ) (for a mapping A1 B2 ). Such a generalization should take care that the distributions of A and of A ∩ B are considered and estimated in a probabilistic framework. For example, a linguistic resource can be used for estimating thoses probabilities by taking the labels of A and B and returning probabilities for A and A ∩ B, not only coeﬃcients. This requires that the techniques to obtain such probabilities from the linguistic resource should respect some properties that ﬁt with the probability theory. The experiments that we have conducted on both real-world and controlled synthetic data have shown the feasibility and the scalability of our approach. In contrast with our approach that prunes the search space by an appropriate ordering for generating and testing the mappings, a lot of existing approaches compute a similarity matrix for all pairs of classes. The ordering used in SBI is not logically consistent. Several complex matching system use the logical consistency as a ﬁlter or a way to check their results, but not for pruning the search space. Up to now, the OAEI context does not focus on scalability, except for isolate cases like the anatomy 2006 6 dataset which contains large ontologies in the anatomy domain. Methods like PRIOR [42] and H-MATCH [7] has shown good performances on this dataset. In particular, the PRIOR system makes use of information retrieval techniques, by indexing a proﬁle for each class (or any ontology entity), and by using a query answering algorithm to rank and ﬁnd the best matches for a particular class by making a query about its own proﬁle. Note that the anatomy dataset does not contain any instance. 6

http://oaei.ontologymatching.org/2006/anatomy/

98

R. Tournaire et al.

Another possible approach investigated in [31] for handling large taxonomies is based on partitioning. In contrast, our approach is scalable while keeping the whole logical structure, which can be potentially lost by partitioning taxonomies. Perspectives We envision two main perspectives. First, we will study a setting for probabilistic query answering, in the spirit of probabilistic databases [18]. In such a setting, probabilities of mappings will be used by the query answering algorithm to give some probability values for each answer. In particular, it should be interesting to focus on reasoning-based query algorithm like in the Somewhere [2] setting. There are existing works on introducing probabilities in logic and inference process. Probability Logic [1] ﬁts our proposed model in the way that the classical implication and the conditional are both present in the language. Probabilistic description logics [37,6] are also based on conditional formulas too. Based on inference rules and their properties about probabilities, such probabilistic and logical framework can be extended to do probabilistic reasoning involving probabilistic mappings. Our second pespective tackles the central issue in ontology matching (see [54]), that is to provide a formal semantics to coeﬃcients returned as output of existing alignment methods. The idea is to design and implement a post-processing step in order to transform the returned coeﬃcients into coeﬃcients that can be interpreted as probabilities, i.e. that respect the property of monotony (Theorem 1). For this purpose, we plan to use the similarity ﬂooding principle [43], in the spirit of N2R [51] and OLA [24]. Coeﬃcients for mappings are initialized by those returned by the existing method to postprocess. The properties of monotony are then translated into strong inﬂuences between coeﬃcients of mappings connected by an entailment relation. These inﬂuences between mappings coeﬃcients are ﬁnally modeled by (non-linear) equations involving the maximum function. Like in N2R, the main issue would be to ﬁnd an iterative algorithm ensured to converge towards a ﬁxpoint that is the solution of the equation system.

References 1. Adams, E.: A Primer of Probability logic, CSLI. Stanford University, Stanford (1998) 2. Adjiman, P., Chatalic, P., Goasdou´e, F., Rousset, M.C., Simon, L.: Distributed reasoning in a peer-to-peer setting: Application to the semantic web. Journal of Artiﬁcial Intelligence Research (JAIR) 25, 269–314 (2006) 3. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD 2005: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 906–908. ACM, New York (2005) 4. Beeri, C., Dowd, M., Fagin, R., Statman, R.: On the structure of Armstrong relations for functional dependencies. Journal of the ACM (JACM) 31(1), 30–46 (1984)

Discovery of Probabilistic Mappings between Taxonomies

99

5. Benjelloun, O., Sarma, A.D., Halevy, A.Y., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006) 6. Castano, S., Ferrara, A., Lorusso, D., N¨ ath, T.H., M¨ oller, R.: Mapping validation by probabilistic reasoning. In: Bechhofer, S., Hauswirth, M., Hoﬀmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 170–184. Springer, Heidelberg (2008) 7. Castano, S., Ferrara, A., Messa, G.: Results of the H-MATCH ontology matchmaker in OAEI 2006. In: Proceedings of the ISWC 2006 Workshop on Ontology Matching, Athens, GA, USA (2006) 8. Castano, S., Ferrara, A., Montanelli, S.: H-MATCH: an algorithm for dynamically matching ontologies in peer-based systems. In: SWDB, pp. 231–250 (2003) 9. Chiticariu, L., Hern´ andez, M.A., Kolaitis, P.G., Popa, L.: Semi-automatic schema integration in clio. In: VLDB, pp. 1326–1329 (2007) 10. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (September 2001) 11. Dalvi, N.N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB, pp. 805–816 (2005) 12. David, J., Guillet, F., Gras, R., Briand, H.: An interactive, asymmetric and extensional method for matching conceptual hierarchies. In: EMOI-INTEROP Workshop, Luxembourg (2006) 13. Dean, M., Schreiber, G.: OWL web ontology language reference. W3C recommendation, W3C (February 2004) 14. Degroot, M.H.: Optimal Statistical Decisions (Wiley Classics Library). WileyInterscience, Hoboken (April 2004) 15. Do, H.H., Rahm, E.: COMA - a system for ﬂexible combination of schema matching approaches. In: VLDB (2002) 16. Doan, A., Domingos, P., Levy, A.Y.: Learning mappings between data schemas. In: Proceedings of the AAAI 2000 Workshop on Learning Statistical Models from Relational DatA (2000) 17. Doan, A., Madhavan, J., Domingos, P., Halevy, A.Y.: Learning to map between ontologies on the Semantic Web. In: WWW, pp. 662–673 (2002) 18. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: VLDB, pp. 687–698 (2007) 19. Duchon, P., Flajolet, P., Louchard, G., Schaeﬀer, G.: Boltzmann samplers for the random generation of combinatorial structures. Comb. Probab. Comput. 13(4-5), 577–625 (2004) 20. Euzenat, J., Ferrara, A., Hollink, L., Isaac, A., Joslyn, C., Malais, V., Meilicke, C., Nikolov, A., Pane, J., Sabou, M., et al.: Results of the ontology alignment evaluation initiative 2009. In: Fourth International Workshop on Ontology Matching, Washington, DC (2009) 21. Euzenat, J.: Semantic Precision and Recall for Ontology Alignment Evaluation. In: IJCAI, pp. 348–353 (2007) 22. Euzenat, J.: Ontology alignment evaluation initiative (July 2008), http://www.oaei.ontologymatching.org/ 23. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 24. Euzenat, J., Valtchev, P.: Similarity-based ontology alignment in OWL-Lite. In: ECAI, pp. 333–337 (2004) 25. Fagin, R.: Horn clauses and database dependencies. J. ACM 29(4), 952–985 (1982) 26. Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge (May 1998)

100

R. Tournaire et al.

27. Flake, G.W., Lawrence, S.: Eﬃcient SVM regression training with SMO. Mach. Learn. 46(1-3), 271–290 (2002) 28. Gal, A.: Managing uncertainty in schema matching with top-k schema mappings. Journal on Data Semantics 6 (2006) 29. Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. The VLDB Journal 14(1), 50–67 (2005), http://www.portal.acm.org.gate6.inist.fr/ citation.cfm?id=1053477 30. Giunchiglia, F., Shvaiko, P., Yatskevich, M.: S-match: an algorithm and an implementation of semantic matching. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 61–75. Springer, Heidelberg (2004) 31. Hamdi, F., Safar, B., Reynaud, C., Zargayouna, H.: Alignment-based Partitioning of Large-scale Ontologies. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds.) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol. 292, pp. 251–269. Springer, Heidelberg (2010), http://www.hal.inria.fr/inria-00432606/en/ 32. Hamdi, F., Zargayouna, H., Safar, B., Reynaud, C.: TaxoMap in the OAEI 2008 alignment contest. In: Ontology Alignment Evaluation Initiative (OAEI) 2008, Campaign - Int. Workshop on Ontology Matching (2008) 33. Hayes, P. (ed.) RDF Semantics. W3C Recommendation, World Wide Web Consortium (February 2004), http://www.w3.org/TR/rdf-mt/ 34. Ichise, R., Takeda, H., Honiden, S.: Integrating multiple internet directories by instance-based learning. In: International Joint Conference on Artiﬁcial Intelligence (IJCAI), vol. 18, pp. 22–30 (2003) 35. Ichise, R., Hamasaki, M., Takeda, H.: Discovering relationships among catalogs. In: Suzuki, E., Arikawa, S. (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 371– 379. Springer, Heidelberg (2004), http://www.citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.121.5336 36. Isaac, A., van der Meij, L., Schlobach, S., Wang, S.: An empirical study of instancebased ontology matching. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 253–266. Springer, Heidelberg (2007) 37. Koller, D., Levy, A., Pfeﬀer, A.: P-CLASSIC: a tractable probablistic description logic. In: Proceedings of the National Conference on Artiﬁcial Intelligence, pp. 390–397 (1997) 38. Li, W.S., Clifton, C.: SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000) 39. Lin, F., Sandkuhl, K.: A survey of exploiting wordnet in ontology matching. In: Artiﬁcial Intelligence in Theory and Practice II, pp. 341–350 (2008) 40. Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.: Corpus-based schema matching. In: International Conference on Data Engineering, pp. 57–68 (2005) 41. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. The VLDB Journal, 49–58 (2001), http://www.citeseer.ist.psu.edu/ madhavan01generic.html 42. Mao, M., Peng, Y.: PRIOR system: Results for OAEI 2006. In: Proceedings of the Ontology Alignment Evaluation Initiative, pp. 165–172 (2006) 43. Melnik, S., Garcia-Molina, H., Rahm, E., et al.: Similarity ﬂooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the International Conference on Data Engineering, pp. 117–128 (2002)

Discovery of Probabilistic Mappings between Taxonomies

101

44. Mitchell, T.: Machine Learning. McGraw-Hill Education (ISE Editions) (1997), http://www.amazon.ca/exec/obidos/ redirect?tag=citeulike09-20&path=ASIN/0071154671 45. Nottelmann, H., Straccia, U.: Information retrieval and machine learning for probabilistic schema matching. Information Processing and Management 43(3), 552–576 (2007) 46. Nottelmann, H., Straccia, U.: A probabilistic, logic-based framework for automated web director alignment. In: Ma, Z. (ed.) Soft Computing in Ontologies and the Semantic Web. Studies in Fuzziness and Soft Computing, vol. 204, pp. 47–77. Springer, Heidelberg (2006) 47. Quinlan, R.J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann, San Francisco (January 1993), http:// www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20\&path=ASIN/ 1558602380 48. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001) 49. Ramesh, G., Maniatty, W., Zaki, M.J.: Feasible itemset distributions in data mining: theory and application. In: PODS, pp. 284–295 (2003) 50. Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artiﬁcial Intelligence Research 11(95), 130 (1999) 51. Sa¨ıs, F., Pernelle, N., Rousset, M.C.: Combining a logical and a numerical method for data reconciliation. In: Spaccapietra, S. (ed.) Journal on Data Semantics XII. LNCS, vol. 5480, pp. 66–94. Springer, Heidelberg (2009) 52. Seraﬁni, L., Bouquet, P., Magnini, B., Zanobini, S.: An algorithm for matching contextualized schemas via SAT. In: Proceedings of CONTEXT 2003 (2003) 53. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005) 54. Shvaiko, P., Euzenat, J.: Ten challenges for ontology matching. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part II, pp. 1164–1182. Springer, Heidelberg (2008) 55. Stumme, G., Maedche, A.: FCA-MERGE: Bottom-Up Merging of Ontologies. In: Proc. of the 17th International Joint Conference on Artiﬁcial Intelligence, pp. 225–234 (2001) 56. Tournaire, R., Petit, J.M., Rousset, M.C., Termier, A.: Discovery of Probabilistic Mappings between Taxonomies: Principles and Experiments (technical report) (2009), http://www.membres-liglab.imag.fr/tournaire/longpaper.pdf 57. Tournaire, R., Rousset, M.C.: D´ecouverte automatique de correspondances entre taxonomies - internal report (in french) (2008), http://www.membres-liglab.imag.fr/tournaire/irap08.pdf 58. Van Rijsbergen, C.J.: Information retrieval. Butterworths, London (1975) 59. Wang, P., Xu, B.: Lily: Ontology alignment results for OAEI 2009. Shvaiko, et al [SEG+ 09] (2009) 60. Wang, S., Englebienne, G., Schlobach, S.: Learning concept mappings from instance similarity. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 339–355. Springer, Heidelberg (2008), http://www.portal.acm.org.gate6.inist.fr/citation.cfm?id=1483184 61. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

TwigTable: Using Semantics in XML Twig Pattern Query Processing Huayu Wu, Tok Wang Ling, Bo Chen, and Liang Xu School of Computing, National University of Singapore {wuhuayu,lingtw,chenbo,xuliang}@comp.nus.edu.sg

Abstract. In this paper, we demonstrate how the semantic information, such as value, property, object class and relationship between object classes in XML data impacts XML query processing. We show that the lack of using semantics causes diﬀerent problems in value management and content search in existing approaches. Motivated on solving these problems, we propose a semantic approach for XML twig pattern query processing. In particular, we design TwigTable algorithm to incorporate property and value information into query processing. This information can be correctly discovered in any XML data. In addition, we propose three object-based optimization techniques to TwigTable. If more semantics of object classes are known in an XML document, we can process queries more eﬃciently with these semantic optimizations. Last, we show the beneﬁts of our approach by a comprehensive experimental study1.

1

Introduction

XML query processing has been studied for over a decade. In most XML query languages, e.g., XPath [4] and XQuery [5], queries are expressed as twig patterns. Finding all occurrences of a twig pattern query in an XML document is considered the core operation for XML query processing. This process is also referred as twig pattern query processing or twig pattern matching. More background on XML queries and twig pattern matching is discussed in Section 2.1. Normally an XML query is composed of structural search and content search. Consider an XPath query Q1 that ﬁnds the subject name of the book with the title of “Network”, issued to the XML data in Fig. 1: Q1: //subject[//book/title=“Network”]/name In this query, //subject[//book/title]/name is a structural search, aiming to ﬁnd all matches in the document that satisfy this structural constraint; while the predicate title=“Network” is a content search, which ﬁlters the structural search 1

This is an extended and updated version of the previously published paper [30]. The major extension includes a discussion on semantics in XML, a technique to process queries across multiple twig patterns, discussions on potential problems of optimized tables, a new optimization scheme with semantics of relationship, and a comparison with the schema-aware relational approach in experiments.

S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 102–129, 2011. c Springer-Verlag Berlin Heidelberg 2011

Using Semantics in XML Twig Pattern Query Processing

103

result based on the speciﬁed value comparison. Most state-of-the-art XML query processing approaches only focus on employing eﬀective index, e.g., inverted lists (details shown in Section 2.2), to improve the eﬃciency of performing joins between nodes in more complex twig pattern queries, without distinguishing between structural search and content search. This attempt is proven eﬃcient for structural search without considering values. However, due to the diﬀerent characteristics between leaf value nodes and internal non-value nodes in XML data, using inverted lists to manage values and to process content search in the same way as structural search will cause problems in managing tremendous number of inverted lists and performing costly structural join for content search. Besides the ineﬃciency in content search, which is caused by ignoring the semantic information of value and non-value nodes, existing approaches may also suﬀer from eﬃciency problems in structural search. To look deeper into the semantics of non-value document nodes, we can ﬁnd that a non-value document node may further correspond to an object or a property. Most real life queries aim to ﬁnd desired objects based on the value predicates on their properties. However, none of existing approaches takes semantics of object and property into account when they manage inverted list index and process queries. The ignorance of such semantics would result in scanning many useless labels in inverted lists during structural search. The details are discussed in Section 3.2 and Section 5.1. In this paper, we propose a semantic approach for twig pattern query processing. Motivated by solving the problems caused by a lack of using semantics on object, property and value in existing approaches, we propose semantics-based relational tables incorporated with inverted lists of tags to aid twig pattern query processing. In particular, relational tables are used to store values, while inverted lists are used to index internal document nodes, including property nodes and object nodes, but not values nodes. We design TwigTable algorithm to perform content search and structural search separately with the two kinds of indexes in twig pattern matching. Content search is performed by table selection before structural search. Because content search is always a predicate between a property and a value, after performing content search the size of the inverted list of the relevant property node is reduced due to the selectivity of the predicate, and the twig pattern query can be simpliﬁed by removing value comparisons. Matching a simpliﬁed twig pattern with reduced inverted lists for several query nodes will reduce the complexity of structural search, and thus improve the twig pattern matching performance. Finally, the semantic tables can help to extract actual values to answer the queries that ask for property values or object details, which is not eﬃcient to achieve in other structural join based algorithms. We also need to highlight that the relational tables are constructed based on semantic information such as the relationship among object, property and value. The semantics of property is common for any XML document, i.e., the parent node of each value must be the property of that value. Based on this default semantic information, we initially store each value with its property in the corresponding property table. When more of an object’s semantics is known,

104

H. Wu et al. bookstore (1:5000,1) subject (2:307,2) name (3:6,3)

“computer” (4:5,4)

……

books (7:306,3) book (8:29,4)

book (30:55,4)

……

publisher title author price quantity publisher title author author price quantity (9:12,5) (13:16,5) (17:20,5)(21:24,5)(25:28,5) (31:34,5) (35:38,5) (39:42,5) (43:46,5) (47:50,5) (51:54,5) “Hillman” “Network” “Green” 45 30 “Elco” “Database” “White” “Brown” 35 15 (10:11,6) (14:15,6) (18:19,6)(22:23,6)(26:27,6) (32:33,6) (36:37,6) (40:41,6) (44:45,6) (48:49,6) (52:53,6)

Fig. 1. The bookstore document with all nodes labeled by containment scheme

we propose three optimization techniques to change the tables to be object based. We will show that using object-based tables, a query can be processed even more eﬃciently. In a word, with more semantic information known, our approach is more eﬃcient to process twig pattern queries. The rest of the paper is organized as follows. We ﬁrst describe some background information in Section 2. After that we revisit related work and discuss the motivation of our research in Section 3. The TwigTable algorithm with three semantic optimizations is presented in Section 4 and Section 5. We present the experimental results in Section 6 and conclude our work in Section 7.

2 2.1

Background Data Model and Twig Pattern Query

Normally an XML document is modeled as an ordered tree, without considering ID references. Fig. 1 (ignoring node labels) shows the tree structure of a bookstore document. In an XML tree, the internal nodes represent the elements and attributes in the document, and the leaf nodes represent the data values which are either a text in an element or an attribute value. Thus a node name is a tag label, an attribute name or a value. Edges in an XML tree reﬂect element-subelement, element-attribute, element-value, and attribute-value pairs. Two nodes connected by a tree edge are in parent-child (PC) relationship, and the two nodes on the same path are in ancestor-descendant (AD) relationship. The core query pattern in most standard XML query languages (e.g., XPath and XQuery) is also in a tree-like structure, which is often referred as a twig pattern. In particular, an XPath query is normally modeled as a twig pattern query, and an XQuery query is normally modeled as several twig patterns

Using Semantics in XML Twig Pattern Query Processing

105

subject book

name

book

book author

title

author

title

title “Network”

“Network”

(a) Twig pattern for XPath query Q1

t1

t2

(b) XQuery query Q2 can be expressed by joining the two twig patterns t1 and t2 based on author value

Fig. 2. Example twig pattern expressions

linked by joins. For example, the XPath query Q1 in Section 1, i.e., //subject[//book/title=“Network”]/name, can be represented as a twig pattern query in Fig. 2(a). Example 1 shows the twig patterns for an XQuery expression. Example 1. A query to ﬁnd the titles of all books written by the author of the book “Network” can be expressed by XQuery: Q2: For $a in distinct-values(doc(“bookstore.xml”) //book[title=“Network”]/author) For $b in doc(“bookstore.xml”)//book Where $b/author=$a Return $b/title A typical XQuery processor normally models a query by twig patterns to further process it. The XQuery expression Q2 can be transformed into two twig patterns t1 and t2, which are linked by a join on the author nodes, as shown in Fig. 2(b). In a twig pattern query, an edge can be either single-lined or double-lined, which constraints the two matched nodes in either a PC relationship or an AD relationship. Since a twig pattern normally models an XPath expression, we allow the leaf nodes of a twig pattern query to also be a range value comparison or even a conjunction/disjunction of several value comparisons, if the corresponding XPath expression contains such predicates. For example, the twig pattern representation of the XPath query //book[price>20 and price<30]/title contains a conjunction of value comparison “>20 and <30” under the query node price. The process to ﬁnd all the occurrences of a twig pattern in an XML document is called twig pattern matching. A match of a twig pattern Q in a document tree T is identiﬁed by a mapping from the query nodes in Q to the document nodes in T, such that: (i) each query node either has the same string name as or is evaluated true based on the corresponding document node, depending on whether the query node is an element/attribute node or a value comparison; (ii) the relationship between the query nodes at the ends of each “/” or “//” edge in Q is satisﬁed

106

H. Wu et al.

by the relationship between the corresponding document nodes. Matching Q to T returns a list of n-ary tuples to answer Q, where n is the number of nodes in Q and each tuple (a1 , a2 ,..., an ,) consists of the document nodes a1 , a2 ,..., an , which identify a distinct match of Q in T. 2.2

Document Labeling and Inverted List

Checking whether two document nodes satisfy the PC or AD relationship speciﬁed in a twig pattern pattern query is essential to twig pattern query processing. To facilitate the PC and AD relationship checking, we normally assign a positional label (label for short, if no confusion arises) to each document node. There are multiple labeling schemes proposed for XML data, among which the containment scheme [37] and the preﬁx scheme [26] are most popular. There are also several dynamic encoding schemes [16] [33] to avoid re-labeling for dynamic documents. In this paper, we use the containment labeling scheme for illustration, as shown in Fig. 1. It can be replaced by other schemes as long as they are suﬃcient for PC and AD relationship determination. Labels are usually organized by inverted lists. Normally, for each type of document node, there is a corresponding inverted list to store the labels of all nodes of this type in document order. To process a query, only relevant inverted lists are scanned to perform structural join based on the query constraints. Because in most algorithms, an inverted list is scanned in a streaming fashion, it is also referred as a label stream, or simply a stream in some work. 2.3

Semantics of Object, Property and Value in XML Data

Object (or entity) is an important information unit in data management, e.g., the ER approach to design a relational database. In data-centric XML databases, object also plays an important role, as most XML queries ask about information of objects. An object normally has several properties to describe it from diﬀerent aspects. Property is also referred as attribute. To diﬀerentiate it from attribute type in document schema (e.g., DTD), we use the term property in this paper. In our work, we use semantics of value, property, object and relationship among objects to improve the performance of XML twig pattern query processing. Usually, such semantic information can be provided by the XML database designer. Next we brieﬂy discuss how to discover the semantic information, in case it is not available from database designers. Identifying values is trivial. Each non-tag text in an XML document is a value, and it must appear as a leaf node in the corresponding document tree. Furthermore, we can also infer that the parent node of each value node in an XML tree corresponds to the property of that value. For example, in Fig. 1 the “Network” is a value and its parent node title is the property of this value. This inference of value and property always holds for any document, regardless of whether more semantics is provided or not. Thus our basic algorithm TwigTable constructs relational tables based on property and value. Semantics on object contributes to our optimizations to further improve query processing performance, as presented in Section 5. However, to discover semantics

Using Semantics in XML Twig Pattern Query Processing

107

of object and relationship between objects is not so trivial. Intuitively, we may consider the parent node of each property as the corresponding object. For example, in the document in Fig. 1, the subject nodes and the book nodes are all objects, because they are parent nodes of properties such as name, publisher, title, etc. However, in some documents, properties do not directly connect to their objects. For example, if the document further groups the properties of a book as shown in the two examples in Fig. 3, the parent node of a property may not be the associated object. book

publisher

title

authors

“Elco” “Database” author

book

basicInfo

price quantity

author 35

“White” “Brown”

15

publisher

title

author

saleInfo

author price quantity

“Elco” “Database” “White” “Brown” 35

15

Fig. 3. Two alternative design of book in the bookstore document

Two or more objects that are related to each other are normally reside along the same path in an XML tree. However, the properties of a relationship are not easy to identify. Because of the hierarchical structure of XML data, a property of the relationship between two objects is normally stored under the deeper object of the relationship. This brings diﬃculty to distinguish it from the properties of that deeper object. Later in Fig. 11 we show an example document in which quantity is a property of the relationship between branch and book, but this property appears as a child node of the deeper object book, together with other book properties. The situation may be even more complex when dealing with ternary or n-ary relationships. There are many attempts to discover the semantics such as object and relationship between objects in an XML document. Generally, there are categorized into three classes. 1. Using available tools and information. There are semantic rich models that work as a schema for XML documents. For example, the ORA-SS [17] model can distinguish between properties, objects and relationships, as well as specify the degree of n-ary relationships and indicate if a property belongs to an object or a relationship. If such model is available alongside an XML document, we can easily discover useful semantic information. Also as semantic web is rapidly developed, there are many ontologies available for diﬀerent domains. By exploring the ontology of relevant domains, we can have desired semantic information of an XML document. 2. Mining schema or document. Liu et al. [18] infer objects by analyzing DTD. [9] and [35] propose algorithms to discover the semantics such as keys and

108

H. Wu et al.

functional dependencies in XML documents, which can help to identify objects and relationships. There are also many data mining techniques, such as the decision tree, that can be adopted to infer semantic information. 3. User interaction. In many web-based information management systems [10] [25], they use mass collaboration to seek feedback from users to improve semantics identiﬁcation. In this paper, we do not oﬀer any new technique to discover objects and relationships in an XML document. All the existing semantics discovery techniques can be used in our work for semantic optimizations. The guideline is that our algorithm is initially built on the semantics of property and value. This information can be correctly inferred from the document structure. When more semantics on object is known, we can incorporate such information into relational table construction, to further improve the performance of query processing.

3 3.1

Related Work and Motivation Related Work

In the early stage, there are many research eﬀorts on storing and querying XML data using RDBMS. In those relational approaches, they shred XML data into relational tables, and convert XML queries into SQL statements to query the database. The node-based approach [37] [13], the edge-based approach [11] and the path-based approach [34] [22] shred XML documents based on diﬀerent kinds of tree components. However, they all suﬀer from eﬃciency problems when dealing with structural search. The node-based approach and the edge-based approach need too many costly table joins to process a twig pattern query. The path-based approach is not eﬃcient to handle “//”-axis. The schema-aware decomposition methods [24] [6] are proven more eﬃcient than schemaless methods [27], but they are still not eﬃcient for structural search when the document structure is complex. For example, consider a fragment ... in the TreeBank data [28]. The schema-aware approach can hardly decide what tables between VP and NP should be joined and how many times to join them for the query VP//NP. In short, the relational approach is only suitable for the XML data with regular structure, because of its weakness in structural search. Later, many native approaches are proposed to process twig pattern queries. One direct native approach is the navigational approach, which traverses treestructured XML documents to ﬁnd the occurrences of query patterns. Similar to the navigational approach, some works ( [29] [23]) transform XML documents and twig pattern queries into sequences, and perform subsequence matching. Both the navigational approach and the subsequence matching approach require a high I/O cost, as the whole document will be considered during query processing. The structural join based approach is an important class of native approaches to process XML twig pattern queries, and it has attracted most research interest. In early work, Zhang et al. [37] proposed a multi-predicate merge join algorithm

Using Semantics in XML Twig Pattern Query Processing

109

based on the containment labeling of an XML document, and showed the superiority over relational approaches. Later an improved stack-based structural join algorithm is proposed by Al-Khalifa et al. [3]. These two algorithms, as well as most of prior works decomposed a twig pattern into a set of binary relationships, i.e., parent-child and ancestor-descendant relationships. Twig pattern matching are done by matching every binary relationship and combining these basic binary matches. The main problem of such approaches is that the intermediate result size may be very large, even when the input and ﬁnal result sizes are more manageable. To overcome this limitation, Bruno et al. [7] proposed a holistic twig join algorithm, TwigStack, which could avoid producing an unnecessarily large intermediate result. However, this algorithm is only optimal for twig pattern queries with only ancestor-descendent relationships. There are many subsequent works [19] [15] [8] [20] [14] [36] to optimize TwigStack, or extend TwigStack to solve diﬀerent kinds of problems. In particular, Lu et al. [19] introduced a list structure to make it optimal for queries containing parent-child relationships under non-branching nodes. TSGeneric [15] improved the query performance by indexing each inverted list and skipping labels within one list. Chen et al. [8] divided an inverted list into several sub-lists associated to each preﬁx path or each (tag, level) pair and pruned some sub-lists before evaluating the twig pattern. Lu et al. [20] used Extended Dewey labeling scheme and scanned only the labels of leaf nodes in a twig query. [14] and [36] extended twig pattern query to support OR-predicate and NOT-predicate separately. However, all these structural join based work only focus on structural search. For the value node in each query predicate, they normally treat it the same as element node and perform structural joins for the whole query structure. As a result, they suﬀer from several problems as mentioned in the next section. 3.2

Motivation

The structural join based approaches are proven eﬃcient in structural search. However, because they do not consider the semantics of value and other types of document nodes, they suﬀer from several problems during query processing. 1. Inverted list management. In most structural join based approaches, all nodes including elements, attributes and values in an XML tree are labeled and the labels of each type of nodes are organized in an inverted lists. When we build inverted lists for values, the number of diﬀerent values causes a problem of managing a tremendous number of inverted lists. Based on our investigation, a 100MB XML document contains over 300 thousand diﬀerent values, which correspond to 300 thousand inverted lists. This number will linearly increase according to the document size increase. 2. Advanced content search.Since twig pattern query normally models XPath expression, the advanced content search, such as numeric range search, containment search or even conjunction/disjunction of several value comparisons, which often appear in XPath query predicates may also appear as a leaf node in a twig pattern query. Without handling values specially, existing approaches have diﬃculty in supporting these advanced content search.

110

H. Wu et al.

For example, to process a query to ﬁnd the books with the price greater than 15, it is time consuming to get all the inverted lists with the numeric names greater than 15, and combine labels in them by document order, to perform this range search. Also structural join with inverted lists can hardly support containment search, such as //book[contains(@title, ‘XML query’)]/price. 3. Redundant search in inverted lists. Inverted lists for values do not have semantic meanings. This may cause redundant search during inverted list scanning. For example, when a query is interested in books with the price of 35 in the bookstore document, structural search scans the inverted list for the value node ‘35’ (denoted by T35 ). Since in T35 we do not diﬀerentiate whether a label corresponds to price or quantity, we need to check all labels in this inverted list though many of them stand for quantity of 35, and deﬁnitely do not contribute to the query result. 4. Actual value extraction. To answer a query, what we need is not twig pattern occurrences represented as tuples of labels, but value results. For example, after ﬁnding a number of occurrences of the twig pattern query in Fig. 2(a), we need to know the value under each name node. One major advantage of the structural join based approaches is that they only need to load relevant inverted lists to process a query, instead of scanning the whole document with high I/O cost. However, when a query asks for values of a certain property, after getting a set of resulting labels of that property from pattern matching, they cannot ﬁnd the child value under each label using inverted lists. To extract actual values, they have to read the document again, which violate the initial objective in I/O saving. Motivated on solving all these problems, we propose a semantic approach that uses both inverted lists and relational tables to perform twig pattern matching.

4

TwigTable Algorithm

4.1

An Overview of TwigTable

In TwigTable, we pay attention to the semantics of value during index construction. We maintain inverted list index only for structural nodes (internal nodes). For each value node, we store it into a relational table index2 with the label of its property node, instead of labeling it and putting its label into an inverted list as other approaches do. Then the number of inverted lists is limited to the number of diﬀerent element/attribute types in the document. Query processing in TwigTable includes three steps. In the ﬁrst step, we perform content search for the value comparisons in query predicates, using relational tables. After that, the query is rewritten by removing all value comparisons. In the second step, we perform structural search for the simpliﬁed query pattern, using 2

To avoid the overhead on maintaining relational tables, the relational table index can also be replaced by other types of index to bidirectional map values and their properties. However, the trade-oﬀ is the inconvenience to support advanced content search and to meet the requirements in our object-related optimizations.

Using Semantics in XML Twig Pattern Query Processing

111

any structural join algorithms, e.g., TwigStack. The last step is to return result to users. If the output node is a property type, the value result can be extracted from tables, instead of accessing the original document. 4.2

Document Parsing in TwigTable

When we parse an XML document, we only label elements and attributes, and put the labels into corresponding inverted lists. Values in the document are not labeled, instead we put them into relational tables together with the labels of their parent property nodes. Normally this parsing step is only executed once for an XML document, and after all relevant indexes are properly built during the parsing step, the system is ready to process twig pattern queries over the given document. The detailed algorithm Parser is presented in Algorithm 1. Algorithm 1. Parser Input: A SAX stream of the given XML document Output: A set of inverted lists and a set of relational tables 1: initialize Stack S 2: while there are more events in SAX stream do 3: let e = next event 4: if e is a start tag then 5: //step 1: label elements 6: create object o for e 7: assign label to o 8: push o onto S 9: for all attributes attr of e do 10: //attributes are parsed in the same way as elements. 11: assign label to attr 12: put label of attr into the inverted list Tattr 13: insert the label of attr and the value of attr into the table Rattr 14: end for 15: //step 2: put labels of elements into inverted lists 16: put label of o into the inverted list Te 17: else if e is a value then 18: set e to be the child value of the top object in S 19: else if e is an end tag then 20: // step 3: Insert values with their parent element into tables 21: pop o from S 22: if o contains a child value then 23: insert label of o together with its child value into table Re 24: end if 25: end if 26: end while

We use the SAX to read the input document and transform each tag and value into events. Line 3 captures the next event if there are more events in the SAX stream. Based on diﬀerent types of events, diﬀerent operations are performed accordingly. Line 4-16 are executed if the event e is a start tag. In this case, the ﬁrst two steps are triggered. The system ﬁrst constructs an object for this element and assigns a label to it. It then puts the label into the inverted list for that tag. A stack S is used to temporarily store the object so that when an end tag is reached, the system can easily tell on which object the operation will be executed. At line 9-14, the system analyzes the attributes for an element if any. Based on the same operating steps, it labels the attributes and puts labels into inverted lists. The attribute values are treated in the same way as element

112

H. Wu et al.

values. Line 17-18 is the case that the event is a value type. Then the value is simply bound to the top object in S for further table insertion. When the event is an end tag in line 19-25, the last step is performed, which is popping the top object o from S and inserting the label of o together with its value into the relational table for o, if it has a value. Example 2. After parsing the bookstore document, the new labeled document tree is shown in Fig. 4. Comparing to the document tree in Fig. 1, we can see that TwigTable does not label value nodes. The advantages of this attempt are reported in Section 6. bookstore (1:2600,1) subject (2:167,2)

……

books (5:166,3)

name (3:4,3) “computer”

publisher (7:8,5)

book (6:17,4)

book (18:31,4)

……

title author price quantity publisher title author author price quantity (9:10,5) (11:12,5)(13:14,5)(15:16,5) (19:20,5) (21:22,5) (23:24,5) (25:26,5) (27:28,5) (29:30,5)

“Hillman” “Network” “Green”

45

30

“Elco” “Database” “White” “Brown”

35

15

Fig. 4. The bookstore document with only internal nodes labeled, during TwigTable parsing

Some example relational tables which store data values are shown in Fig. 5. The name of each table is a property name, and each table contains two ﬁelds, the label of the property node and the corresponding child value. We call these tables property tables. Rpublisher

Rtitle

Rauthor

label

value

label

value

label

value

(7:8,5)

Hillman

(9:10,5)

Network

(11:12,5)

Green

(19:20,5)

Elco

(21:22,5)

Database

(23:24,5)

White

…

…

…

…

(25:26,5)

Brown

...

...

Fig. 5. Example property tables

Similar to inverted list, relational table also plays an index role to aid twig pattern query processing. As a result, how to cope with document updates is an important issue for both inverted lists and relational tables. Inverted lists

Using Semantics in XML Twig Pattern Query Processing

113

have been widely used and well studied for years. There are diﬀerent ways to maintain an inverted list for updates, e.g., employing a B+ tree on each list. Here we discuss the maintenance of relational tables when the XML document is updated. Actually the relational table is easy to maintain. If the document is dynamic with frequent updates, we can adopt a dynamic labeling scheme to label the document so that the update will not cause re-labeling of remaining document nodes. Thus, any deletion or insertion only brings the update of the relevant tuple, without aﬀecting other tuples. 4.3

Query Processing in TwigTable

As mentioned in Section 4.1, query processing in TwigTable contains three steps: content search and query rewriting, structural search, and value extraction. Theoretically, the ﬁrst two steps, i.e., content search and structural search, can be reordered. The reason why we perform content search before structural search is that content search normally results in high selectivity. By performing content search ﬁrst, we can reduce the complexity of structural joins. This is similar to selection push-ahead in relational query optimizers. The pseudo-code of TwigTable query processing is presented in Algorithm 2. Algorithm 2. TwigTable query processing Input: A query Q and necessary inverted lists and relational tables Output: A set of value results answering Q 1: //step 1: perform content search, construct new inverted lists and rewrite the query 2: while there are more value comparisons in predicates of Q do 3: let c be the next value comparison, and p be its property (parent element or attribute) 4: create a new inverted list Tp for p 5: select the labels based on c from the table Rp , and sort the resulting labels by document order 6: put the selected labels into Tp 7: rewrite the query to replace the sub-structure p/c by p 8: end while 9: //step 2: perform structural search on the rewritten query with new inverted lists 10: process the rewritten pattern of Q using any existing eﬃcient structural join algorithm like TwigStack, to get labels for output nodes 11: remove newly created inverted lists 12: //step 3: return query answer 13: extract actual values with labels from corresponding tables, if the output node is a property node; otherwise access the document to return subtrees.

We ﬁrst perform content search in Line 2-8. The algorithm recursively handles all value comparisons in two phases: creating new inverted lists based on the predicates and rewriting the query to remove the processed value comparisons. In more details, Line 3-6 execute SQL selection in the corresponding property tables based on each value comparison, and then put all the selected labels, which satisfy the value comparison, into the new inverted list for the corresponding property node. Line 7 rewrites the query in such a way that every value comparison and its parent property are replaced by a new query node which has an identical name as the corresponding new inverted list. The second step is using TwigStack or other eﬃcient structural join algorithms to process the simpliﬁed query with new inverted lists in Line 10-11. Last in line 13, we return result

114

H. Wu et al.

based on labels of output nodes. In particular, we can extract actual values from the corresponding table, if the output node is a property node. Example 3. We use the twig pattern query in Fig. 2(a) to illustrate how TwigTable works. In the ﬁrst step, TwigTable identiﬁes the only predicate with value comparison is title=“Network”. During content search, TwigTable executes an SQL selection in the table Rtitle to get all the labels of the element title which have a value of “Network”. Then we put the selected labels into a new inverted list for title, Ttitle , and rewrite the twig pattern query to replace the sub-structure of the node title and its child node “Network” by title’, as shown in Fig. 6(a). The new query node title’ corresponds to the newly created inverted list Ttitle , in which all labels satisfy the constraint title=“Network”. To clearly explain title’ in the rewritten query, we use titleN etwork in Fig. 6(a). Finally we use a twig pattern matching algorithm, e.g., TwigStack to process the rewritten query in Fig. 6(a), with the inverted list Ttitle for the node titleN etwork . subject

name

subject

book

titleNetwork

name

book

>20

(a) Rewritten query example (b) Invalid query example Fig. 6. A rewritten query example and an invalid twig pattern query example

As described in Section 2.1, twig pattern query is an intermediate query representation of formal XML query languages, e.g., XPath and XQuery. Since in the predicate of an XPath or XQuery query, a value must link to a property through an operator, e.g., price>20, the value comparison in a twig pattern query must be a child (‘/’), instead of a descendant (“//”) of an internal query node. A twig pattern expression shown in Fig. 6(b) is invalid, as the value comparison follows a “//” edge. Semantically, this query cannot be well interpreted; and practically, this query will never appear in XPath or XQuery expressions. We do not consider such invalid twig patterns, thus our algorithm can perform any content search using property tables. Note that TwigTable saves I/O cost in value extraction when the output node is a property node, because we do not need to visit the original document, but only access relevant relational tables to ﬁnd values. However, if the output node matches some internal nodes with subelements in the document, the result should be the whole subtree rooted at each matched node, instead of a single value. In this case, TwigTable has no advantage in result return over other approaches. 4.4

Analysis of TwigTable

Label and inverted list management. TwigTable combines values to their parent elements, and avoids labeling value nodes separately. Then the

Using Semantics in XML Twig Pattern Query Processing

115

number of labeled nodes in memory will be greatly reduced. Moreover, TwigTable puts values into relational tables, instead of maintaining separate inverted lists for them. Thus the problem of managing a tremendous number of inverted lists in previous work can be solved. Content search. TwigTable organizes values based on their property semantics in tables. When the value in a query predicate has diﬀerent semantic meaning, i.e., corresponds to diﬀerent properties, TwigTable only accesses the correct property table to search the value. In contrast, other approaches have to scan all such values to perform structural join, though many of them correspond to other properties and deﬁnitely do not contribute to the result. Inverted list searching reduction. Performing content search before structural search in TwigTable can signiﬁcantly reduce the size of relevant inverted lists. Consider the query in Fig. 2(a). Assume there is only one book called “Network”. If the number of diﬀerent books is b, the size of the inverted list for the property title is also b in previous approaches. We need O(b) to scan all the labels in the inverted list for title. TwigTable processes selection in advance, so that the new inverted list for title is created based on the value “Network”. In this case the new inverted list has only one label inside based on our assumption. Normally, when the selectivity of a property is high, like in this example, TwigTable can signiﬁcantly improve the eﬃciency of structural search by greatly reducing the inverted list size for this property. Advanced search support. Since TwigTable can use any existing RDBMS to manage property tables, all the advanced searches which are supported by the relational system are also supported in TwigTable. We can observe that sequential scans and structural joins for labels of both property node and value node in previous work are replaced by selections in semantic tables in TwigTable. Actually in any relational database system, such table selection can be done very eﬃciently. It is not surprising that replacing structural join by selection for content search will improve the overall performance. Generally, TwigTable gains beneﬁt from performing content search ahead of structural search, and then reduce the complexity of structural search. Thus most advantages discussed in this section hold only for queries with value predicates, which are commonly seen in real life. When a query does not have value comparison as predicate, we just follow any existing structural join algorithm to perform structural search directly. 4.5

Queries across Multiple Twig Patterns

A twig pattern can be used to model a simple query, e.g., a query that can be represented by XPath. When a query is more complex, we need to model it with multiple twig patterns and value-based joins are used to connect these twig patterns. One example is shown in Fig. 2(b). As pointed out by [7], structural join based algorithms can only eﬃciently process single-patterned queries. When a complex query involves several twig patterns, either from the same document or across diﬀerent documents, structural join based algorithms will fail to work.

116

H. Wu et al.

The reason why structural join based twig pattern matching algorithms cannot process queries involving several twig patterns is that those algorithms cannot perform value-based join between twig patterns using their inverted list indexes. One naive approach is to match diﬀerent twig patterns in such a query separately. By considering each query node that are involved in value-based join as an output node, they can then access the original document to retrieve the child values for these query nodes. Last, they join the matching results from diﬀerent twig patterns based on the retrieved values. Obviously this attempt is I/O costly, and also may produce a large size of intermediate result. In TwigTable, we introduce relational tables to store values. This structure oﬀers an opportunity to process queries across multiple twig patterns. We observe that a join operation between two twig patterns is based on a value comparison between two properties in the two twigs. Using property tables, we can easily perform the value based join. We use an example to illustrate how TwigTable is extended to process such queries. Example 4. Consider the query in Fig. 2(b). There are two twig patterns t1 and t2 are involved in this query. First, TwigTable estimates the selectivity of each twig pattern. In this case, obviously t1 has a higher selectivity. Then TwigTable matches t1 to the document, to get the value result of query node author. Due to the high selectivity on title=“Network” w.r.t. the data in Fig. 4, the matching result only returns one author name, which is “Green”. In the next step, TwigTable joins the value result from the ﬁrst twig pattern to the property table which corresponds to the joining node in the second twig pattern. In this example, we join the only value “Green” to Rauthor , to get a list of labels such that all of them correspond to the value “Green”. Finally, we form a new inverted list with the selected labels for the author node, and match t2 to the document. Discussion: TwigTable uses both inverted lists and tables for twig pattern matching, which oﬀers a good opportunity to process queries across pieces of a document or across diﬀerent documents. However, to process such a query, an optimizer is necessary to decide which twig pattern should be matched ﬁrst, to reduce the searching space for other twig patterns. Such an optimizer is quite similar to a relational optimizer, and needs to estimate the cost to matching each twig pattern, the selectivity of each twig pattern, and also the cost and the selectivity of each value-based join between twig patterns. In this paper, we show the capability of TwigTable to process queries across multiple twig patterns. How to generate an optimal query plan to evaluate such queries will be further investigated.

5

Semantic Optimizations

Tables in TwigTable are built based on the semantic relationship between property and value. That is why we call them property tables. Using property tables to perform content search may still not be eﬃcient enough in some cases. Since object is an important information unit for most queries, we can optimize the property tables to be object based, to further improve the performance.

Using Semantics in XML Twig Pattern Query Processing

5.1

117

Optimization 1: Object/Property Table

Motivation: Using property tables may still suﬀer from redundant search in relevant inverted lists. Consider the query in Fig. 2(a). Supposing there are b books in the bookstore and only one of them is called “Network”. After TwigTable rewrites the query in Fig. 6, the size of the inverted list for title is reduced to 1. However the size of the inverted list for book is still b, though we know one label in it matches the label in the title inverted list. To solve this eﬃciency problem, we propose an optimization scheme based on the object semantics. Optimization: Instead of storing each value with the label of its associated property node, we can put the property value and the label of the corresponding object node into relational tables. For example, in the bookstore document we put values of publisher, title and so forth with labels of the corresponding object book into object/property tables as shown in Fig. 7(a). The ‘label’ ﬁeld of each table stores the label of the object and the following ‘value’ corresponds the value of diﬀerent properties in diﬀerent tables. When we perform a content search, we can directly select the object labels in the corresponding object/property tables and construct a new inverted list for the object. To process the query in Fig. 2(a), we perform the content search using Rbook/title to restrict the book labels based on the condition on title value. After that the query can be further rewritten accordingly, as shown in Fig. 7(b), where Tbook is the new inverted list for the element book and we use booktitle=“N etwork” to explicitly explain book’. Here we not only reduce the size of Tbook , but also further reduce the number of structural joins and the number of query nodes by one. Then we can get better performance when we execute the simpliﬁed query. Rbook/publisher

Rbook/title

Rbook/author

label

value

label

value

label

value

(6:17,4)

Hillman

(6:17,4)

Network

(6:17,4)

Green

(18:31,4)

Elco

(18:31,4)

Database

(18:31,4)

White

…

…

…

…

(18:31,4)

Brown

…

…

(a) Tables in TwigTable Optimization 1

subject

name

booktitle=“Network”

(b) Rewritten for Fig 2(a)

query

Fig. 7. Tables and rewritten query under TwigTable Optimization 1

This object-level optimization is general to all queries with a predicate on a property to constraint an object. For the case that the same property type belongs to multiple object classes, this optimization gains even more beneﬁts, as it avoids accessing those property nodes that do not belong to the required object class by distinguishing their object semantics. Discussion: [Ordinal Column.] This optimization may lose order information for multi-valued properties. Such information may be important in some cases.

118

H. Wu et al.

For example, the order of authors is important, but from the book/author table we cannot tell which author comes ﬁrst for a certain book. To solve this limitation, we can simply add an additional column in the corresponding object/property table for such a multi-valued property, to indicate the ordinal information. 5.2

Optimization 2: Object Table

Motivation: It is quite normal that some queries contain multiple predicates on the same object. Consider the query shown in Fig. 8(a), which aims to ﬁnd the subject of the book with the title of “Network” and the price less than 40. To answer this query, Optimization 1 needs to ﬁnd the labels of the books whose title is “Network” and the labels of the books whose price is less than 40 separately using the object/property tables, and intersect them. With more semantic information, we know that title and price are both properties of the object book. If we have one table for this object that contains the both properties, books satisfying these two constraints can be found directly with one SQL selection. subject

name

book subject

title

price name

“Network”

booktitle=“Network”&price<40

<40

(a) Query with multiple value predicates under the same object

(b) Rewritten query for Fig 8(a) under Optimization 2

Fig. 8. Example query with multiple value predicates under the same object and its rewritten query in Optimization 2

Optimization: A simple idea is to merge the object/property tables in Optimization 1 for each object class. For multi-valued properties, such as author in our example, it is not practical to merge it with other properties. In this case, we can merge all the single-valued properties of an object into one object table and keep the object/property tables for multi-valued properties. The resulting tables for the object class book in the bookstore document under this optimization are shown in Fig. 9. In Rbook , each label of book is stored with all the single-valued property values of that book. When we process queries with multiple predicates on single-valued properties of an object, we can do a selection in that object table based on multiple constraints in one time. For example, to process the query in Fig. 8(a), we can select book labels based on the two predicates in Rbook with one SQL selection. Then the original query can be rewritten as shown in Fig. 8(b). Comparing with the Optimization 1 approach, we further simplify the query and prune intermediate results for the two predicates.

Using Semantics in XML Twig Pattern Query Processing Rbook

119

Rbook/author

label

publisher

title

price

quantity

label

value

(6:17,4)

Hillman

Network

45

30

(6:17,4)

Green

(18:31,4)

Elco

Database

35

15

(18:31,4)

White

…

…

…

…

…

(18:31,4)

Brown

…

…

Fig. 9. Tables for book in the bookstore document under TwigTable Optimization 2

Discussion: [Mixed table selection.] If the multiple predicates involve both single-valued properties and multi-valued properties, we can intersect the selection result from the object table for the single-valued properties, with the selection result from the object/property tables for the multi-valued properties. [Rare property.] Properties may optionally appear under the associated objects. In some cases, the occurrence of certain properties may be rare. We call such properties rare properties. Suppose in the bookstore document, only a few books have a second title, then second title is a rare property. If we put this rare property as a column in the book table, there will be too many NULL entries. Some relational database systems can deal with sparse attributes in the physical storage. In case some other systems do not have this function, we can maintain a separate table specially for all rare properties. The rare property table contains: the object name, the rare property name, the object label and the property value. Suppose in the bookstore document, second title is a rare property of book and sale region is a rare property of magazine, the example rare property table is shown in Fig. 10. Queries involving rare properties are processed by accessing the rare property table with the object name and the property name. Rrare_property object

property

label

value

book

second_title

(76:89,4)

An introduction to data mining

magazine

sale_region

(128:143,4)

Singapore

book

second_title

(282:299,4)

A first course

…

…

…

…

Fig. 10. Table for rare properties

[Vertical partitioning.] An object table is obtained by merging the object/ property tables for all the single-valued property types under the same object class. When there are many single-valued property types under a certain object class, the tuple size of the corresponding object table will be very large, which results in a high I/O cost in selection. A common way in RDBMS design to reduce such I/O cost is the vertical partitioning of a table ( [21]). We can refer to the query history to see which properties often appear together in the same

120

H. Wu et al.

query, and then split the original object table into several partitions according to such information. Since vertical partitioning is not a new technique, we do not discuss it any more. 5.3

Optimization 3: Relationship Table

Motivation: The hierarchical structure of an XML document cannot reﬂect the relationships between objects explicitly, however, such relationships do exist usually. Consider another design of the bookstore document as shown in Fig. 11, in which the books are also grouped by diﬀerent branches. In Fig. 11, the quantity is not a property of book, but a property of the relationship between branch and book. Putting a relationship property into the object table of the property’s nearest object does not aﬀect the accuracy of query processing. Consider a query to ﬁnd the code of the branch that has some computer book with a low quantity, i.e., less than 20, expressed in Fig. 12(a). If we have no idea on the relationship between branch and book, but store the property quantity in the object table for book, Optimization 2 will rewrite the query as in Fig. 12(b) for further matching. However, since we aim to ﬁnd qualiﬁed branches, matching the book node is redundant. If we know the predicate is on the relationship between branch and book, we may ignore book during pattern matching to improve eﬃciency. bookstore (1:2800,1) subject (2:197,2)

……

name (3:4,3)

branches (5:196,3)

“computer”

code (7:8,5)

branch (6:107,4) book (9:20,5)

……

book (21:34,5)

……

title author price quantity publisher title “001” publisher author author price quantity (10:11,6) (12:13,6) (14:15,6)(16:17,6)(18:19,6) (22:23,6) (24:25,6) (26:27,6) (28:29,6) (30:31,6) (32:33,6) “Hillman” “Network” “Green”

45

30

“Elco” “Database” “White” “Brown”

35

15

Fig. 11. Another design of the bookstore document

Optimization: If we have the semantic information about the property of relationship, we can introduce relationship tables. A relationship table store the property value and the label of the participating objects of each relationship instance. The example relationship table for the document in Fig. 11 is shown in Fig. 13(a). When a relationship involves more than two objects, the corresponding relationship table will include the labels of all the objects. Using the relationship table, the query in Fig. 12(a) can be rewritten as in Fig. 13(b).

Using Semantics in XML Twig Pattern Query Processing

subject

name

subjectname=“computer”

branch

“ computer”

121

code

branch

book

code

bookquantity<20

quantity

<20

(a) Example twig pattern query

(b) Rewritten query for Fig 12(a) under Optimization 2

Fig. 12. Example query with predicate on relationship property and its rewritten query in Optimization 2 subjectname=“computer”

Rbranch-book labelbranch

labelbook

quantity

(6:107,4)

(9:20,5)

30

(6:107,4)

(21:34,5)

15

…

…

…

(a) Relationship table

branchbranch-book.quantity<20

code

(b) Rewritten query for Fig 12(a) under Optimization 3

Fig. 13. Example relationship table and rewritten query in TwigTable Optimization 3

Compared to Optimization 2, the query is further simpliﬁed with the semantics of relationship, and the size of the inverted list of the branch node is reduced. Then the query processing performance will be further improved. Discussion: [Overlapping predicates.] A query node can be involved in both an object predicate and a relationship predicate. For example, if the query in Fig. 12(a) has an additional predicate on book price, the query node book will be involved in two predicates (i.e., the relationship predicate and the object predicate) overlapping on it. To handle overlapping predicates, we can perform the content search based on diﬀerent predicates separately, and then intersect the label results to construct the temporary inverted list for the involved object. [Merging object table and relationship table.] If the semantics of participation constraint between two object classes is known, we can merge the object table(s) and the relationship table when the constraint is many-to-one or one-to-one. This is similar to the translation from ER diagram to tables with the consideration of

122

H. Wu et al.

participation constraints in relational database design. However, similar to the vertical partitioning, how to physically maintain relational tables is generally bound to the performance analysis for practical queries. 5.4

A Summary

The optimization techniques are proposed around the semantics of object. This object-level attempt is motivated by the fact that most queries ask about the information of objects. Generally, as more semantic information is known, we can optimize TwigTable to diﬀerent levels, to get better performance. When the object information is not available but we still need it to manage data, we can roughly treat the parent node of each property as an object. As indicated in Section 2.3, this inference may not be correct in some cases, but it does not aﬀect the correctness of twig pattern query processing. However, we construct table index based on the semantics of object instead of the simple structural information is because that object is the information unit in real life queries. Optimizations in object level can minimize the number of structural joins without aﬀecting the processing of general queries. For example, if a person has a composite property name with firstName and lastName. Because most queries are issued to the object person, instead of to the property name only, if we construct table based on name, with the property of firstName and lastName, after the content search, we still have to join name with person. However, if we build the table based on the real object person, i.e., incorporating name/firstName and name/lastName into person table, any predicate on firstName or lastName will result in a label ﬁltering for person directly, to save a structural join.

6

Experiments

In this section, we conduct experiments to show the advantage of TwigTable. We ﬁrst compare our approach to a schema-aware relational approach [24], which is considered more eﬃcient than other relational approaches. Then we compare TwigTable and its two optimizations to TwigStack, a typical structural join based twig pattern matching algorithm. Note that in this experiment, our algorithms take TwigStack to perform structural search, thus we compare to TwigStack to show the beneﬁt gained. We can also take any other structural join algorithm to perform structural search. We do not compare with them because the comparison with TwigStack is suﬃcient to show the advantage of our approach. 6.1

Settings

We implemented all algorithms in Java. The experiments were performed on a dual-core 2.33GHz CPU and a 4GB RAM under Windows XP. We used three types of real-world and synthetic data sets to compare the performance of TwigStack and our approaches: NASA [1], DBLP and XMark [32]. NASA is a 25MB document with deep and complex schema. DBLP data set is a 127MB fragment of DBLP database. It is rather regular with a simple

Using Semantics in XML Twig Pattern Query Processing

Data Set

NASA

Query

Path Expression

NQ1

//dataset//source//other[date/year>1919 and year<2000]/ author/lastName

NQ2

//dataset/tableHead[//field/name=‘rah’]//tableLinks //title

NQ3

//dataset//history//ingest[date[year>1949 and year<2000] [month=‘Nov’][day>14 and day<21]]//creator/lastName

DQ1

/dblp/article[/author=‘Jim Gray’]/title

DQ2

/dblp/proceedings[year>1979]/isbn

DQ3

/dblp/inproceedings[title=‘A Flexible Modeling Approach for Software Reliability Growth’][year=‘1987’][author=‘Sergio Bittanti’]/booktitle

XQ1

//regions/africa/item[//mailbox//mail/from=‘Libero Rive’]// keyword

XQ2

//person[//profile/age>20]/name

XQ3

//open_auction[//bidder[time>18:00:00]/increase>5]/quantity

DBLP

XMark

123

Fig. 14. Experimental queries

DTD schema but a large amount of data values. We also used 10 sets of XMark benchmark data with sizes from 11MB to 110MB for our experiments. We selected three meaningful queries for each data set. All the queries contain predicates with value comparisons, as value predicates appear in most practical queries. Generally, there are three types of query predicates: predicates of equality comparison, predicates of range comparison and multiple predicates of diﬀerent comparisons under one object. The queries are shown in Fig. 14. In TwigTable, we use the Sybase SQL Anywhere [2] to manage relational tables, and inherit the default database parameters. 6.2

Comparison with Schema-Aware Relational Approach

In this section, we compare TwigTable with a Schema-aware Relational Approach proposed in [24]. We name it SRA for short. We also use the Sybase SQL Anywhere for SRA. Since this approach is weak in dealing with “//”-axis, we adopt the proposal in [12] to augment it. SRA is proven more eﬃcient than other schemaless relational approach to process XML queries. The execution time for both SRA and TwigTable to process the queries in NASA, DBLP and a 110MB XMark data is shown in Fig. 15. Note that the Y-axis is in logarithmic scale. From Fig. 15 we can see that for NASA and XMark data (NQ1-3, XQ1-3) TwigTable is more eﬃcient, but for DBLP data (DQ1-3) SRA is more eﬃcient. DBLP data is rather regular and ﬂat, thus the values in DBLP data can be perfectly shredded into relational tables. The maximum height of DBLP is 4, which means using SRA shredding method, there is at most one table join required for

124

H. Wu et al.

Executiontime(ms) tiontime(ms) (ms)

100000 10000 1000 100 10 1 NQ1

NQ2

NQ3

DQ1

DQ2

DQ3

XQ1

XQ2

XQ3

Queries SRA

TwigTable

Fig. 15. Comparison result between SRA and TwigTable

all queries and this join is between the table for root (containing only one tuple) and another table. For such a relational-like document, the relational approach is much more eﬃcient, because table selection dominates the overall performance and this operation can be performed very eﬃciently in all relational databases. However, most real life XML data is not as regular as DBLP, otherwise, it violates the advantage of the semi-structured format. As we see for NQ1-3 and XQ1-3, when the document is deeper and more complex, i.e., requires more table joins for SRA, the performance of SRA is badly aﬀected. 6.3

Comparison with TwigStack

Space management: We test the space issues in document parsing, including the number of labeled nodes in memory, and the number of inverted lists and the number of tables maintained on disk. We parse the NASA, the DBLP, and a 110MB XMark data using the two approaches. The result is shown in Fig. 16. Data

Number of Labeled Nodes

Number of Inverted Lists

Number of Tables

TwigStack

TwigTable

Saving

TwigStack

TwigTable

TwigStack

TwigTable

NASA

997,987

532,963

46.6%

121,833

68

N.A.

39

DBLP

6,771,148

3,736,406

44.8%

388,630

37

N.A.

26

XMark

3,221,925

2,048,193

36.4%

353,476

79

N.A.

43

Fig. 16. Number of labeled nodes, inverted lists and tables in TwigStack and TwigTable

This result validates our analysis in Section 4.4 about the reduction of labeled nodes in memory and the reduction of inverted lists. In TwigTable, the relational tables are built based on diﬀerent types of properties, so the number of tables is limited to the number of diﬀerent property types. We also use 10 sets of XMark data, whose sizes vary between 11MB and 110MB, to further demonstrate the superiority of TwigTable in space management. The experimental result is shown in Fig. 17. We can see that the number of labeled nodes is scaled to the document size for both approaches, and TwigTable always manages less labeled nodes. The

3.5 1 2 33 4 5 2.5 6 7 82 9 10 1.5 1

wigStack ERT 0.5

0.324273 0.650335 0.969617 1.305245 1.628549 1 1.9634 9634 2.294327 2.61591 2.943522 3.221925

0.072371 0.124502 0.164237 0.208709 0.239532 0 0.267772 267772 0.291727 0.311866 0.332839 0.353476

0.20613 0.413111 0.616229 0.820438 1.024073 1.233723 1 233723 1.440674 1.643495 1.849449 2.048193

7.9E-05 7.9E-05 7.9E-05 7.9E-05 7.9E-05 7.9E-05 7 9E 05 7.9E-05 7.9E-05 7.9E-05 7.9E-05

Q1 year Q2 name Q3 year Q3 month Q3 day Q4 71.58 874.656 71.58 29.436 29.436 85 60.876 17.22 60.732 2.148 14.196

No.ofinvertedlists vertedlists (Million) illion)

No. of labeled ed nodes (Million) llion)

Using Semantics in XML Twig Pattern Query Processing 0.4 1 2 0.35 3 4 5 0.3 6 7 0.25 8 9 0.2 10 0.15

0.324273 0.650335 0.969617 1.305245 1.628549 1 1.9634 9634 2.294327 2.61591 2.943522 3.221925

0.072371 0.124502 0.164237 0.208709 0.239532 0 0.267772 267772 0.291727 0.311866 0.332839 0.353476

0.20613 0.413111 0.616229 0.820438 1.024073 1 1.233723 233723 1.440674 1.643495 1.849449 2.048193

125

7.9E-05 7.9E-05 7.9E-05 7.9E-05 7.9E-05 7 7.9E-05 9E 05 7.9E-05 7.9E-05 7.9E-05 7.9E-05

Q3 year Q3 month Q3 day Q4 0.1 Q1 year Q2 name wigStack 71.58 874.656 71.58 29.436 29.436 85 ERT 0.05 60.876 17.22 60.732 2.148 14.196

0

0 11 3 22.8 22 8 11.3

34

45 3 56.2 56 2 68.2 68 2 79 7 90 7 102 45.3 79.7 90.7

111

11 3 22.8 22 8 11.3

34

File size (MB) TwigStack

45 3 56 2 68.2 68 2 79 7 90 7 102 45.3 56.2 79.7 90.7

111

File size (MB)

TwigTable

TwigStack

(a) Number of labeled nodes

TwigTable

(b) Number of inverted lists

Fig. 17. Space management comparisons Total size for indexes (MB) Data

TwigStack

TwigTable

Inverted Lists

Tables

Total

Inverted Lists

Tables

Total

NASA

462

0

462

9

31

40

DBLP

1290

0

1290

30

119

149

XMark

1040

0

1040

17

47

64

Fig. 18. Index sizes in TwigStack and TwigTable

number of inverted lists is scaled to the size of document in TwigStack, whereas this number is a constant in TwigTable. For a large data set it is not practical to handle the tremendous number of inverted lists using TwigStack. We further analyze the total space usage to store indexes of the two approaches, as shown in Fig. 18. We simply use a separate ﬁle to represent each inverted list on disk, for both approaches. The inverted lists are further indexed by a B+ tree automatically by the ﬁle system, so that during query processing the relevant inverted lists can be quickly addressed. From the result, we can see that TwigStack needs a large size of space to manage inverted lists. We further investigate it and ﬁnd that actually the total size of inverted lists is not very large, which is similar to the total size used by TwigTable. However, because of the tremendous number of inverted lists, the extra structures, e.g., the B+ tree, built by the ﬁle system to index these inverted lists are quite large in size. In contrast, although TwigTable maintains table indexes in addition to inverted lists, the total size is still much smaller. Query performance: We used the NASA, the DBLP and a 110MB XMark data set for query performance comparison. We compare TwigStack with our original TwigTable algorithm, as well as Optimization 1 and Optimization 2. As mentioned earlier, we can infer the object information in an XML document.

126

H. Wu et al.

Although the inference may not be semantically correct, it will not aﬀect the correctness of the result. In the two optimizations, we use such inference to build object/property tables and object tables. We do not test Optimization 3. The inverted lists and relational tables are stored on disk and loaded into memory as needed during query processing. As mentioned above, inverted lists are stored as separate ﬁles and indexed by a B+ tree to ensure the high performance in inverted list access. TwigTable actually adopts TwigStack (with the same implementation) to perform structural search, after content search with table selection. The execution time of TwigTable and its optimizations include the I/O and CPU costs to access relational tables to perform content search and the cost on structural search. The comparison result is shown in Fig. 19.

Executiontime(ms)

NQ1 NQ2 NQ3 DQ1 DQ2 DQ3 XQ1 XQ2 XQ3

NASA 18000

relational twigstack vert opt1 35495 640 485 4 //dataset//source/ 800 3813 812 641 3 //dataset/tableHea 2578 609 422 3 //dataset//history/ 600 /dblp/article[/auth 62 2922 2453 14 /dblp/proceedings 46 15547 3454 26 400 78 4132 3546 16 /dblp/inproceeding //regions/africa/ite 7421 789 713 64 //person[//profile/ 11730 1000 869 8 200 //open_auction[//b 8500 1716 1293 6

relational twigstack vert opt1 35495 640 485 4 NQ1 15000 //dataset//source/ 3813 812 641 3 NQ2 //dataset/tableHea 2578 609 422 3 NQ3 12000 //dataset//history/ DQ1 /dblp/article[/auth 62 2922 2453 14 DQ2 9000 /dblp/proceedings 46 15547 3454 26 78 4132 3546 16 DQ3 /dblp/inproceeding 6000 XQ1 //regions/africa/ite 7421 789 713 64 //person[//profile/ 11730 1000 869 8 XQ2 3000 XQ3 //open_auction[//b 8500 1716 1293 6

Executiontime(ms)

NASA 1000

0

0 NQ1

NQ2

NQ3

DQ1

Query TwigTable

TwigStack TwigTableOptimization1

DQ2

TwigTableOptimization2

TwigTableOptimization1

(a) NASA

DQ3

Query TwigTable

TwigStack

TwigTableOptimization2

(b) DBLP

NASA 2000

Executiontime(ms)

NQ1 NQ2 NQ3 DQ1 DQ2 DQ3 XQ1 XQ2 XQ3

relational twigstack vert opt1 35495 640 485 4 //dataset//source/ 3813 812 641 3 //dataset/tableHea 2578 609 422 3 //dataset//history/ 1200 /dblp/article[/auth 62 2922 2453 14 /dblp/proceedings 46 15547 3454 26 800 /dblp/inproceeding 78 4132 3546 16 //regions/africa/ite 7421 789 713 64 //person[//profile/ 11730 1000 869 8 400 //open_auction[//b 8500 1716 1293 6 1600

0 XQ1

XQ2

XQ3

Query TwigTable

TwigStack TwigTableOptimization1

TwigTableOptimization2

(c) XMark Fig. 19. Execution time by TwigStack and TwigTable without optimizations, with Optimization 1 and with Optimization 2 in the three XML documents

From the result we can see that for all queries, TwigTable outperforms TwigStack. The reason is that TwigTable performs content search ﬁrst to simplify the query pattern before structural joins, thus gaining better overall performance. The result also proves that the overhead on table selection will not aﬀect the beneﬁt gained from twig pattern simpliﬁcation. TwigStack performs very badly on DQ2. In the DBLP data, there are a lot of numeric values. To process DQ2, TwigStack has to combine the labels in all the inverted lists with a number name greater than 1979, based on document order. To load and merge these inverted lists is costly. However, in TwigTable this step is replaced by table

Using Semantics in XML Twig Pattern Query Processing

127

selection, thus it is more eﬃcient. We can see, though for small document (e.g., NASA), TwigStack is not very slow to perform range search, when the amount of labels in numeric inverted lists is large, TwigStack will be ineﬃcient to load and merge the labels. For all the queries, Optimization 1 works better than TwigTable. The reason is that Optimization 1 reduces one more level up to the original twig pattern query. Similarly, the query processing performance is further improved. Comparing Optimization 1 with Optimization 2, we can see that for singlepredicated queries there is no obvious diﬀerence. For some queries, Optimization 2 is even slightly worse than Optimization 1. The reason is that the combined object table is larger than object/property table, and then there may be more I/Os to load tuples for object table. However, for multi-predicated queries, e.g., the queries Q3, Q6 and Q9, Optimization 2 has better performance, because Optimization 2 performs content search for all the value comparisons on the same object at the same time. This again proves our analysis in Section 4.4.

7

Conclusion

In this paper, we propose a semantic approach TwigTable to solve diﬀerent kinds of content problems raised in existing approaches for twig pattern query processing. Unlike TwigStack and its subsequent algorithms, our approach uses semantic tables to store values in XML document and avoids the management of a tremendous number of inverted lists for diﬀerent values. During query processing, we perform content search ﬁrst to reduce the size of relevant inverted lists, and rewrite the query to reduce the number of structural nodes and the number of structural joins in it. Then, we match the simpliﬁed pattern with size reduced inverted lists to the document. We also show that the attempt of using hybrid indexes (inverted list and relational table) can easily and eﬃciently process queries across multiple twig patterns. Our approach is a semantic approach because the relational tables are initially built based on the semantics of property. With more semantics on objects and relationships, we propose three optimization techniques to further improve the tables and enhance eﬃciency of query processing. In particular, (1) if each property’s associated object is known, we can change property table to object/property table in Optimization 1; (2) if we know certain properties belong to the same object, we can combine the object/property tables to be object table in Optimization 2; (3) if the relationship between objects is known, we can introduce relationship tables to precisely store the property values of relationships in Optimization 3. To summarize, as more semantic information is known, we can further optimize tables and get better performance. Furthermore, when an output node of a query corresponds to a property type, after ﬁnding an occurrence of a twig pattern in the document, our algorithm can easily extract actual value to answer the query from relational tables, whereas existing approaches need more work to convert labels into values by accessing documents again. Our work brings a new perspective to incorporate relational approach into native approach to manage and query XML data. Processing queries involving

128

H. Wu et al.

ID references [31] or queries across multiple documents becomes possible in our approach, as our approach takes value-based join to link diﬀerent twig patterns involved in a complex query. We will further investigate how to generate optimal query plans for queries involving both structural joins and table joins.

References 1. http://www.cs.washington.edu/research/xmldatasets/data/nasa/nasa.xml 2. http://www.sybase.com/products/databasemanagement/sqlanywhere 3. Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural joins: A primitive for eﬃcient XML query pattern matching. In: Proc. of ICDE, pp. 141–154 (2002) 4. Berglund, A., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Simeon, J.: XML Path Language (XPath) 2.0. W3C Working Draft (2003) 5. Boag, S., Chamberlin, D., Fernandez, M.F., Florescu, D., Robie, J., Simeon, J.: XQuery 1.0: An XML Query. W3C Working Draft (2003) 6. Bohannon, P., Freire, J., Roy, P., Simeon, J.: From XML schema to relations: a cost-based approach to XML storage. In: Proc. of ICDE, pp. 64–75 (2002) 7. Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: Optimal XML pattern matching. In: Proc. of SIGMOD, pp. 310–321 (2002) 8. Chen, T., Lu, J., Ling, T.W.: On boosting holism in XML twig pattern matching using structural indexing techniques. In: Proc. of SIGMOD, pp. 455–466 (2005) 9. Chen, Y., Davidson, S.B., Hara, C.S., Zheng, Y.: RRXS: redundancy reducing XML storage in relations. In: Proc. of VLDB, pp. 189–200 (2003) 10. Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., Shen, W.: Community information management. IEEE Data Eng. Bull. 29(1), 64–72 (2006) 11. Florescu, D., Kossmann, D.: Storing and querying XML data using an RDMBS. IEEE Data Eng. Bull. 22(3), 27–34 (1999) 12. Gou, G., Chirkova, R.: Eﬃcienty querying large XML data repositories: a survey. IEEE Transactions on Knowledge and Data Engineering 19(10), 1381–1403 (2007) 13. Grust, T.: Accelerating XPath location steps. In: Proc. of SIGMOD, pp. 109–120 (2002) 14. Jiang, H., Lu, H., Wang, W.: Eﬃcient processing of XML twig queries with OR-predicates. In: Proc. of SIGMOD, pp. 59–70 (2004) 15. Jiang, H., Wang, W., Lu, H., Yu, J.: Holistic twig joins on indexed XML documents. In: Proc. of VLDB, pp. 273–284 (2003) 16. Li, C., Ling, T.W.: QED: a novel quaternary encoding to completely avoid relabeling in XML updates. In: Proc. of CIKM, pp. 501–508 (2005) 17. Ling, T.W., Lee, M.L., Dobbie, G.: Semistructured database design (web information systems engineering and Internet technologies series). Springer, Heidelberg (2004) 18. Liu, Z., Chen, Y.: Identifying meaningful return information for XML keyword search. In: Proc. of SIGMOD, pp. 329–340 (2007) 19. Lu, J., Chen, T., Ling, T.W.: Eﬃcient processing of XML twig patterns with parent child edges: a look-ahead approach. In: Proc. of CIKM, pp. 533–542 (2004) 20. Lu, J., Ling, T.W., Chan, C., Chen, T.: From region encoding to extended dewey: On eﬃcient processing of XML twig pattern matching. In: Proc. of VLDB, pp. 193–204 (2005)

Using Semantics in XML Twig Pattern Query Processing

129

21. Navathe, S., Ceri, S., Wiederhold, G., Dou, J.: Vertical partitioning algorithms for database design. ACM Transactions on Database Systems 9(4), 680–710 (1984) 22. Pal, S., Cseri, I., Seeliger, O., Schaller, G., Giakoumakis, L., Zolotov, V.: Indexing XML data stored in a relational database. In: Proc. of VLDB, pp. 1146–1157 (2004) 23. Rao, P.R., Moon, B.: PRIX: Indexing and Querying XML Using Prufer Sequences. In: Proc. of ICDE, p. 288 (2004) 24. Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational databases for querying XML documents: limitations and opportunities. In: Proc. of VLDB, pp. 302–314 (1999) 25. Spink, A.: A user-centered approach to evaluating human interaction with web search engines: an exploratory study. Information Processing & Management 38(3), 401–426 (2002) 26. Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Storing and Querying Ordered XML Using a Relational Database System. In: Proc. of SIGMOD, pp. 204–215 (2002) 27. Tian, F., DeWitt, D.J., Chen, J., Zhang, C.: The design and performance evaluation of alternative XML storage strategies. SIGMOD Record 31(1), 5–10 (2002) 28. TreeBank. Retrieved from University of Washington Database Group (2002) 29. Wang, H., Park, S., Fan, W., Yu, P.S.: ViST: A Dynamic index method for querying XML data by tree structures. In: Proc. of SIGMOD, pp. 110–121 (2003) 30. Wu, H., Ling, T.W., Chen, B.: VERT: A semantic approach for content search and content extraction in XML query processing. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 534–549. Springer, Heidelberg (2007) 31. Wu, H., Ling, T.W., Dobbie, G., Bao, Z., Xu, L.: Reducing graph matching to tree matching for XML queries with ID references. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6262, pp. 391–406. Springer, Heidelberg (2010) 32. XMark. An xml benchmark project, http://www.xml-benchmark.org 33. Xu, L., Ling, T.W., Wu, H., Bao, Z.: DDE: From Dewey to a Fully Dynamic XML Labeling Scheme. In: Proc. of SIGMOD, pp. 719–730 (2009) 34. Yoshikawa, M., Amagasa, T., Shimura, T., Uemura, S.: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Techn. 1(1), 110–141 (2001) 35. Yu, C., Jagadish, H.V.: Eﬃcient discovery of XML data redundancies. In: Proc. of VLDB, pp. 103–114 (2006) 36. Yu, T., Ling, T.W., Lu, J.: Twigstacklistnot: A holistic twig join algorithm for twig query with NOT-predicates on XML data. In: Li Lee, M., Tan, K.-L., Wuwongse, V. (eds.) DASFAA 2006. LNCS, vol. 3882, pp. 249–263. Springer, Heidelberg (2006) 37. Zhang, C., Naughton, J., Dewitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: Proc. of SIGMOD, pp. 425–436 (2001)

Database Semantics Recovery through Analysis of Dynamic SQL Statements Anthony Cleve1,3 , Jean-Roch Meurisse2 , and Jean-Luc Hainaut1 1

3

Faculty of Computer Science, PReCISE Research Center University of Namur, Belgium {acl,jlh}@info.fundp.ac.be 2 Department of Education and Technology University of Namur, Belgium [email protected] Department of Information and Communication Sciences Universit´e Libre de Bruxelles, Belgium

Abstract. The documentation of a database includes its conceptual schema, that formalizes the semantics of the data, and its logical schema that translates the former according to an operational database model. Important engineering processes such as database and program evolution rely on a complete and accurate database documentation. In many cases, however, these schemas are missing, or, at best, incomplete and outdated. Their reconstruction, a process called database reverse engineering, requires DDL code analysis but, more important, the elicitation of implicit constructs (data structures and constraints), that is, constructs that have been incompletely translated into the operational database schema. The most powerful discovery technique of these implicit constructs is the static analysis of application program source code, and, in particular of embedded SQL statements. Unfortunately, the increasing use of dynamic SQL in business applications development makes such techniques helpless, so that modern web software systems can no longer be correctly redocumented through current reverse engineering techniques. This paper introduces and evaluates dynamic SQL capturing and analysis techniques and shows that they can replace and even improve upon pure static analysis. A real world case study is presented and discussed.

1 Introduction According to standard database design methodologies, any decently developed database should be accompanied by a complete, acurate and up to date documentation, comprising, among others, its conceptual and logical schemas, and, more specially for the database administrator, its physical schema and the translation of the latter in the DDL of the DBMS. 1.1 Database Schemas The conceptual schema describes the semantics of the data in terms of the subject application domain while the logical schema is its lossless transcription in the data structures S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 130–157, 2011. c Springer-Verlag Berlin Heidelberg 2011

Database Semantics Recovery through Analysis of Dynamic SQL Statements

131

of the DBMS model. Both express the same semantics but in different languages, the former providing a high level of abstraction and semantic expressivity while the latter focussing on programmability and technology efficiency. These schemas are not mere byproducts of the design process but they also are indispensable for the exploitation of the database, in particular for application program development, and for maintenance and evolution of the database. These complex processes cannot be carried out efficiently and in a reliable way without the precise knowledge of the meaning and of the properties of the data. In many situations, however, particularly for ageing databases, these documents are obsolete, partial, inacurate or simply lost, if there ever was any. 1.2 Database Reverse Enginering The need for precise methods to reconstruct the documentation of a database has been widely recognized in the eighties1 under the name database reverse engineering. The first approaches were based on simple rules, that work nicely with databases designed in a systematic way [1,2,3,4]. A second generation of methodologies coped with physical schemas resulting from empirical design in which practitioners tend to apply non standard and undisciplined techniques. More complex design rules were identified and interpreted [5], and, based on them, structured and comprehensive approaches were developed [6] while the first industrial tools appeared (e.g., Bachman’s Reengineering Tool). Many contributions were published in the nineties, addressing practically all the legacy technologies and exploiting such sources of information as application source code and database contents. Among synthesis publications, we mention [7], the first tentative history of this discipline. These second generation approaches were faced with two kinds of problems induced by empirical design. The first problem is the recovery of implicit constructs, that is, structures and constraints that have not been explicitly declared in the DDL code. Their elicitation requires the analysis of such complex information sources as the source code of the application programs (and particularly the DML statements), the contents of the database and the user interface. The second problem is that of the semantic interpretation of logical schemas that may include non standard data structures. 1.3 New Challenges The process of legacy database reverse engineering has been studied for several decades and solid methodologies and tools are now available. One of the major supporting techniques that have been designed and implemented is program code exploration, through which essential implicit constructs of the database schemas can be discovered through understanding how programs access and process the data. In particular, DML statements (in most case SQL statements) are a particularly rich information source on the existence and the properties of implicit constructs. We observe however that new forms of information systems are emerging, that pose new problems that cannot be 1

In fact, the term reverse engineering was already coined in the documents of the ISDOS system, in the seventies. See http://www.pslpsa.com/HistoryTop.htm for example.

132

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

adequately solved by these resources [8]. The proliferation of web-based applications is one of them. Often developed in an undisciplined way through rapid development tools, these applications generate an increasing number of poorly designed and undocumented databases, making the reverse engineering process even more critical. Another challenge we currently face is the increasing use of dynamic languages, and in particular dynamic SQL, notably through ODBC and JDBC interfaces. Since actual statements only exist at runtime, standard program analysis techniques generally fail, and new, more complex, dynamic analysis techniques must be designed. 1.4 About This Paper The objective of this paper is to study the characteristics of dynamic SQL in dataintensive application programs and to propose techniques to extract data semantics from these statements. The paper is structured as follows. Section 2 provides a brief introduction to database reverse engineering (DBRE). Section 3 illustrates the difference between static and dynamic SQL. In Section 4, we elaborate on the analysis of (dynamic) SQL statements in support to DBRE. Section 5 identifies and compares several techniques for capturing the trace of SQL statements that are executed by an application program at runtime. In Section 6 we show how such an SQL execution trace can be analyzed in order to identify implicit schema constructs, in particular undeclared foreign keys. Section 7 reports on a case study evaluating the proposed approach on a real web application. Related work is discussed in Section 8 and concluding remarks are given in Section 9.

2 Background: Database Reverse Engineering Basically, database reverse engineering comprises three processes, namely physical schema extraction, logical schema reconstruction and schema conceptualization. Each of them has distinct objectives and relies on specific techniques, in such a way that they can, to a large extent, be studied independently2. 2.1 Physical Schema Extraction This process recovers the physical schema of a database by parsing its DDL code or, equivalently, by analyzing the contents of its active data dictionary, such as the system tables in most relational systems. This extraction makes visible the explicit constructs of the schema, that is, the data structures and constraints that have been explicitly declared through DDL statements. Such is the case for primary keys, unique constraints, foreign keys and mandatory fields. Generally, this process is fairly straightforward. However, the analysis of sub-schemas (e.g., relational views or IMS PCBs) can be more intricate since each of them brings a partial, and often refined view of the global schema. Since the handling of the database by the DBMS relies on the physical schema, the latter can be considered exact. 2

Some of the material of this section is drawn from [9].

Database Semantics Recovery through Analysis of Dynamic SQL Statements

133

2.2 Logical Schema Reconstruction This major process addresses the discovery of the implicit constructs of the schema, that is, those that have not been declared by explicit DDL statements and clauses. In some favorable situations, they have been translated into programmed database components such as SQL checks, triggers and stored procedures. However, most of them have been coped with into application program fragments (nearly) duplicated and scattered throughout millions of lines of code. For instance, a popular way to check a referential constraint consists in accessing the target record before storing the source record in its file. Recovering these implicit constructs, in constrast with explicit constructs, requires a precise analysis of various pieces of procedural code, and particularly DML statements. Though program source code is the richest information source, the database contents (the data), screen layout, report structure, program execution, users interview and, of course, (possibly obsolete) documentation will be analyzed as well. Contrary to the physical schema, the logical schema produced by this process can only be qualified reasonably exact inasmuch as some implicit constructs may have not been discovered (silence) while some constructs introduced in the schema simply do not exist (noise). There is no formal way to prove that all the implicit constructs, and only them, have been discovered. 2.3 Schema Conceptualization The goal of this process is to interpret the logical schema semantically by extracting a conceptual schema that represents its intended meaning. It mainly relies on transformational techniques that undo the effect of the logical design process [10]. This complex process is decomposed in three subprocesses, namely untranslation, de-optimization and conceptual normalization. The untranslation process consists in reversing the transformations that have been used to draw the logical schema from the conceptual schema. For instance, each foreign key is interpreted as the implementation of a many-to-one relationship type. This process relies on a solid knowledge of the rules and heuristics that have been used to design the database. Those rules can be standard, which makes the process fairly straightforward, but they can also be specific to the company or even to the developer in charge of the database (who may have left the company), in which case reverse engineering can be quite tricky. The main constructs that have to be recovered from their logical implementation are relationship types, super-type/subtype hierarchies, multivalued attributes, compound attributes and optional attributes. The de-optimization process removes the trace of the optimization techniques that have been used to improve the performance of the database. Redundancies must be identified and discarded, unnormalized data structures must be decomposed and horizontal and vertical partitioning must be identified and undone. Finally, conceptual normalization improves the expressiveness, the simplicity, the readability and the extendability of the conceptual schema. Though the experience and skill of the (reverse) engineer are essential in schema conceptualization, the quality of the conceptual schema mainly depends on the completeness and soundness of the logical schema.

134

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

2.4 Implicit Constructs All the structures and constraints of a logical schema that cannot be expressed in the DDL are implicit by nature. However, many databases include implicit constructs that could have been declared at design time but that were not, for such reasons as convenience, standardization, inheritance from older technology or simply by ignorance or bad design. For example, in many legacy relational databases, foreign keys are not declared through foreign key clauses, but are, at best, managed by an appropriate set of triggers. We briefly describe some of the most important implicit constructs that have been observed in relational databases [11]. – Exact column structure. Compound and multivalued columns (that would have been translated into row and array complex types in SQL3) are often anonymously represented by the concatenation of their component values. – Unique keys of tables and multivalued fields. This property is particularly important in implicitly strongly structured columns. – Foreign keys. Each value of a column, or of a set of columns, is processed as a reference to a record in another table. – Functional dependencies. The values of a column can depend on the values of other columns that have not been declared or elicited as a candidate key. This pattern has often been used in older databases for performance reasons. – Value domains. A more precise definition of the domain of a field can be discovered by data and program analysis. Identifying enumerated domains is particularly important. 2.5 Information Sources and Discovery Techniques Discovering implicit constructs is based on ad hoc techniques depending on the nature and reliability of the information sources. None of them can guarantee in an absolute way the presence or the absence of such constructs but they can all contribute to a better knowledge of the hidden components and properties of a database schema. We describe four of them. – Schema analysis [3,12,13]. Spotting similarities in names, value domains and representative patterns may help identify hidden constructs such as foreign keys. – Data analysis [14,15,16,17]. Mining the database contents can be used in two ways. Firstly, to discover implicit properties, such as functional dependencies and foreign keys. Secondly, to check hypothetic constructs that have been suggested by the other techniques. Considering the combinatorial explosion that threaten the first approach, data analysis is most often applied to check the existence of formely identified patterns. – Program analysis [18,19,20,21]. Understanding how programs use the data provides crucial information on properties of these data. Even simple analysis, such as dataflow graph exploration, can bring valuable information on field structure and meaningful names. More sophisticated techniques such as dependency analysis and program slicing can be used to identify complex constraint checking or

Database Semantics Recovery through Analysis of Dynamic SQL Statements

135

foreign keys. SQL statements examination, that will be detailed in this paper, is one of the most powerful variants of program analysis. – Screen/report layout analysis [22,23,24]. Forms, reports and dialog boxes are useroriented views on the database. They exhibit spatial structures (e.g., data agregates), meaningful names, explicit usage guidelines and, at runtime, data population and error messages that, combined with dataflow analysis, provide much information on hidden data structures and properties. 2.6 SQL Statement Analysis When, as has been common for more than two decades, data are managed by relational DBMSs, the database/program interactions are performed through the SQL language and protocols. Based on the relational algebra and the relational calculus, SQL is a high-level language that allows programmers to describe in a declarative way the properties of the data they instruct the DBMS to provide them with. By contrast, navigational DMLs (also called one-record-at-a-time DML) access the data through procedural code that specifies the successive operations necessary to get these data. Therefore, a single SQL statement can be the declarative equivalent of a procedural section of several hundreds of lines of code. Understanding the semantics of an SQL statement is often much easier than that of this procedural fragment. The analysis of SQL statements in application programs has long been used as a major program understanding technique in database reverse engineering [18,25,26,27,21]. We illustrate its importance on a small but representative example based on the schema of Figure 1 made up of two tables describing customers and orders. This physical schema graphically translates the constructs declared in the DDL code.

CUSTOMER CustNum: num (8) CustName: varchar (30) CustAddress: char (120) id: CustNum

ORDERS OrdNum: num (10) OrdDate: date (1) Reference: char (12) Sender: num (8) id: OrdNum

Fig. 1. A two-table schema including implicit constructs

A careful examination of Query 1 shows that it provides the customer city and the ordered product for a definite order. It asks the DBMS to extract these data from a row built by joining tables CUSTOMER and ORDERS for that order. It exhibits the main features of the program/query interface: the program transmits an input value through host variable ORDID and the query returns result values in host variables CITY and PRODUCT. The analysis of this query brings to light some important hidden informations, two of which are essential for the understanding of the database schema. 1. The join is performed on columns CustNum and Sender. The former is the primary key of table CUSTOMER while the second one plays no role so far. Knowing that most joins are based on the matching of a foreign key and a primary

136

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

select substring(CustAddress from 61 for 30), Reference into :CITY, :PRODUCT from CUSTOMER C, ORDERS O where C.CustNum = O.Sender and OrdNum = :ORDID

Query 1. Extracting hidden City and Product information (predicative join) key, Query 1 strongly suggests that column Sender is an implicit foreign key to CUSTOMER. Further analysis will confirm or reject this hypothesis. 2. Atomic column CustAddress appears to actually be a compound field, since its substring at positions 61 to 90 is extracted and stored into variable CITY. Translating this new knowledge in the original schema leads to the more precise schema of Figure 2. CUSTOMER CustNum: num (8) CustName: varchar (30) CustAddress: compound (120) Data1: char (60) City: char (30) Data2: char (30) id: CustNum

ORDERS OrdNum: num (10) OrdDate: date (1) Reference: char (12) Sender: num (8) id: OrdNum ref: Sender

Fig. 2. Two implicit constructs revealed by the analysis of Query 1

Analyzing independent statements may prove insufficient in that it provides incomplete information for some frequent programming patterns. An SQL statement can be decomposed into several fragments in such a way that its actual global structure is no longer explicit. The code of Query 2 is an example of a procedural join, functionally similar to that of Query 1. One can reconstruct the intended join by examining the usage of shared variables that define inter-query dependencies.

select Sender into :CUST from ORDERS where OrdNum = :ORDID select CustName, CustAddress into :CNAME, :CADDRESS from CUSTOMER where CustNum = :CUST

-- Q21

-- Q22

Query 2. Extracting hidden City and Product information (procedural join)

3 Static vs Dynamic SQL The introductory examples of Section 2.6 are expressed in static SQL, a variant of the language in which the SQL statements are hard-coded in the source program. Static SQL, in which Query 1 and Query 2 are written explicitly shows the architecture of the

Database Semantics Recovery through Analysis of Dynamic SQL Statements

137

query according to the SQL native syntax. It mentions the schema constructs concerned and identifies the input and output variables through which the host statements interact with the query. There is another family of SQL interfaces, called dynamic SQL, with which the SQL statements are built at runtime and sent to the database server through a specific API. Typically, the programs build each query as a character string, then ask the DBMS to prepare the query (i.e., to compile it) and finally execute it. The only point in the program at which the SQL query actually exists is at runtime, when, or just before, the query string is sent to the DBMS for compilation and/or execution. Query 3 is a simple example of the use of dynamic SQL to extract the name of customer ’C400’. Dynamic SQL is now widely used in data-intensive application programs, particularly in web business systems. Its analysis is therefore particularly important for the reverse engineering of poorly documented databases. Dynamic SQL provides a high

CNUM = "C400"; QUERY = "select CustName from CUSTOMER where CustNum = ’" + CNUM + "’"; exec SQL prepare Q from :QUERY; exec SQL execute Q into :NAME;

Query 3. First example of dynamic SQL level of flexibility but the application programs that use it generally are difficult to analyse and to understand. Indeed, dynamically generated queries must be reconstructed before they can be properly analyzed. This reconstruction process is not always possible through static program analysis [28]. For instance, such techniques as dataflow analysis or program slicing fall short in this context: the high-level of dynamicity causes a lot of noise in the data dependency graph [21], with a negative impact on the precision of the results. Most major DBMSs, such as Oracle and DB2, include interfaces for both static and dynamic SQL. Though dynamic SQL has been standardized in the eighties and implemented by most relational DBMS vendors, as illustrated in Queries 3 and 4, all of them provide drivers for the popular DBMS-independent APIs ODBC from Microsoft and JDBC from SUN. QUERY = "select CustName from CUSTOMER where CustNum = :v1"; exec SQL prepare Q from :QUERY; exec SQL execute Q into :NAME using :CNUM;

Query 4. Another common usage pattern of dynamic SQL The examples in Queries 3 and 4 are written in dynamic SQL for C/C++. The first one shows the build time injection of constants from variable CNUM, through mere string concatenation, while Query 4 illustrates the binding of formal variable v1 with the actual host variable CNUM through clauses into and from of the prepare and execute statements.

138

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

The ODBC and JDBC interfaces provide several query processing protocols, differing notably on the binding technique, that we refer to in this study. The most general form is illustrated in Query 5, where the iteration structure for obtaining successive results has been ignored for simplicity. Line 1 creates database connection con. Line 2 builds the SQL query in host variable QUERY. This statement includes input placeholder ? which will be bound to an actual value before execution. Line 3 creates and prepares (i.e., compiles) statement SQLstmt from string QUERY. This statement in completed in Line 4, in which the first (and unique) placeholder is replaced with value C400. The statement can then be executed (Line 5), hence creating the resulting set of rows rset. Method next of rset positions its cursor on the first row of this result (Line 6) while Line 7 extracts the first (and unique) output value specified in the query select list and stores it in host variable Num. Line 8 closes the result set.

1 2 3 4 5 6 7 8

CNum = "C400"; Connection con = DriverManager.getConnection(DBurl, Login, Password); String QUERY = "select OrdNum from ORDERS where Sender = ?"; PreparedStatement SQLstmt = con.prepareStatement(QUERY); SQLstmt.setString(1, CNum); ResultSet rset = SQLstmt.executeQuery(); rset.next(); Num = rset.getInt(1); rset.close();

Query 5. Standard JDBC database interaction

4 SQL Statement Analysis We can distinguish two main approaches to SQL statement analysis, namely static analysis and dynamic analysis. Static analysis consists in analyzing the SQL statements occuring in the source code of the application programs. In contrast, dynamic analysis aims at capturing and analyzing the SQL statements that are executed by a running program. It first captures an SQL execution trace, that (at least) contains all the SQL queries sent by the application program to the DBMS. Then it analyzes this trace in order to extract useful information from it. Making the choice between static and dynamic analysis techniques in a given context depends on such criteria as (1) the availability of the source code, (2) the nature of the SQL statements to analyze (static or dynamic SQL) and (3) the final objectives of the analysis. 4.1 Static Analysis of Static SQL Statements The analysis of static SQL statements has been introduced in Section 2.6. Though this paper is mainly devoted to dynamic techniques, it is worth developing further the way sequences of statements can be analyzed. The analysis of SQL statements in a program can address each statement independently (Query 1) or can extend the understanding to chains of dependent statements (Query 2).

Database Semantics Recovery through Analysis of Dynamic SQL Statements

139

The SQL code of Query 1 explicitly exhibits the input and output variables through which the host statements interact with the query. These variables are important since, on the one hand, input variables define the variable parts of the query and, on the other hand, input and output variables potentially weave dependency links with other SQL queries. The code fragment Query 2 (procedural join) illustrates such an inter-query dependency by developing the join into two SQL queries. Naming Q21 and Q22 the two SQL queries, we observe that the host variable CUST is the output variable of query Q21 and the input variable of the query Q22. If the intermediate host statements do not change the value of CUST (this invariance can be checked through dependency analysis), then these queries can be considered as fragments of a distributed global query. Inter-query analysis has been suggested by Petit et al. [18] for example. Remark. Basically, there is no need to apply dynamic techniques to static SQL, but in two objectives, namely capturing result sets of rows and identifying SQL statements execution sequences. In addition, the capturing techniques are limited to program instrumentation and DBMS logging (see Section 5). 4.2 Static Analysis of Dynamic SQL Statements In some disciplined programming styles, the query building step is written as a neat sequence of explicit substring concatenations just before the prepare/execute section, so that significant fragments of the SQL query, if not the whole query itself, can be recovered through careful static analysis [29] or symbolic execution [30]. However, some fragments of the query may be initialized long before, and far away from the execution point, or computed, or extracted from external sources in such a way that discovering the intended query by code examination alone may prove impossible. 4.3 Dynamic Analysis of Dynamic SQL Statements While static analysis of SQL statements is mainly based on the dependencies that exist between host variables, dynamic analysis reasons about column values. When static analysis fails, dynamic analysis can detect constant equality in the traces from different program points. If the first constant belongs to the output of query Q1, if the second constant is an input value of query Q2 and if both constants are equal throughout the traces, then the probability that both constants come from the same variable is high. Studying input and output constants may also reveal another kind of inter-query dependency, according to which the result of a query Q1 influences the fact that another query Q2 is executed or not. This is typically the case when a program first verifies an implicit integrity constraint (as a foreign key) before inserting or updating an SQL row. Similarly, before inserting a new row in a table, the program may check that the provided identifier value is unique. In both cases, there will be equality between an input value of a validation query Q1 and an input value of insertion query Q2. 4.4 Capturing Results of SQL Statement Execution Dynamic analysis is a privileged means to capture actual SQL queries at execution time in order to examine them, but this technique can also be used to examine the results

140

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

of SQL query executions. The format of the output data is defined by the form of the SQL query and the information of the logical schema. As just explained, result analysis often is the only technique to trace the link between two SQL queries through shared host variables when the intermediate host statements cannot be analysed statically, as in Query 2 for example. In addition, query results also provide interesting information such as the number of rows that a query usually returns. For instance, an SQL query that always returns a single row probably includes a condition on a column that may be an implicit candidate key.

5 SQL Statement Capturing Techniques In this section, we identify and describe six different techniques to capture the SQL statements executed by an application program. These techniques, summarized in Figure 3 are intended to capture the behaviour of the client/server system at runtime. More detail about them can be found in [31].

CLIENT side

Programs

API

SERVER side

DBMS

Database

API overloading API substitution Program instrumentation Aspect-based tracing

DBMS logs Tracing stored procedures

Fig. 3. Six capturing techniques for SQL statement executions

Program instrumentation. The capture of a dynamic SQL statement is performed by a dedicated code section inserted before the program point of this statement. Similarly, the result of an SQL statement will be captured by a code section inserted after it. This technique requires code analysis to identify and decode database API statements and entails source code modification and recompilation. It provides a temporal list of statement instances. In the example of Figure 4, the tracing statement writes in the log file the program point id (132), the event type (SQLexec), the statement object id (hashCode()) followed by the SQL statement (build time constant injection is assumed) or the output variable contents. According to the information that needs to be extracted from the trace, program id, process id and/or timestamp can be output as well.

Database Semantics Recovery through Analysis of Dynamic SQL Statements

141

Statement stmt = connection.createStatement(); ResultSet rs = stmt.executeQuery(SQLstmt); SQLlog.write(”132;SQLexec;”+stmt.hashCode()+”;”+SQLstmt); rs.next(); vName = rs.getString(1); SQLlog.write(”133;SQLgetS1;”+rs.getStatement().hashCode()+”;”+vName); vSalary = rs.getInt(2); SQLlog.write(”134;SQLgetI2;”+rs.getStatement().hashCode()+”;”+vSalary);

Fig. 4. Logging SQL operations by program instrumentation

Aspect-based tracing. Aspect technology [32], if available for the host programming language, allows triggers to be added to an existing program without source code modification. The introduction of tracing aspects simply requires the programs to be recompiled. In the case of Java/JDBC applications, AspectJ [33] pointcuts and advices can be defined in order to capture the successive events involved in dynamic database manipulation like query preparation, input value definition, query execution and result extraction [34]. Figure 5 shows a simple tracing aspect that captures and records the execution of an SQL query (in absence of query preparation). The declared pointcut (lines 5-6) refers to the invocation of method executeQuery of class Statement. Before each occurence of query execution, the advice (lines 8-13) writes a log entry indicating (1) the class name, (2) the line of code, (3) the object id of the statement and (4) the query string itself. 1 public aspect SQLTracing { 2 3 private MySQLLog log = new MySQLLog(); 4 5 pointcut queryExecution(String query): 6 call(ResultSet Statement.executeQuery(String)) && args(query); 7 8 before(String query): queryExecution(query){ 9 String file = thisJoinPoint.getSourceLocation().getFileName(); 10 int LoC = thisJoinPoint.getSourceLocation().getLine(); 11 Statement statement = (Statement) thisJoinPoint.getTarget(); 12 log.traceQuery(file, LoC, statement.hashCode(), query); 13 } 14 }

Fig. 5. Aspect-based tracing of SQL query executions

API overloading. The API overloading technique consists in encapsulating (part of) the client side API within dedicated classes which provide similar public methods but produce, in addition, the required SQL execution trace. For instance, we could write our own Statement, PreparedStatement and ResultSet classes which, in turn, make use of the JDBC corresponding classes, as shown in Figure 6. This technique requires minimal program adaptation (illustrated in Figure 7) and recompilation.

142

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

package myAPI; ... public class Statement{ java.sql.Statement stat; public Statement(java.sql.Connection con){ stat = con.createStatement(); } public ResultSet executeQuery(String sql){ log.traceQuery(sql); return new ResultSet(stat.executeQuery(sql)); } ... }

Fig. 6. Illustration of API overloading

import java.sql.Statement; import myAPI.Statement; import java.sql.ResultSet; import myAPI.ResultSet; ... ==> ... Statement sta = con.createStatement(); Statement sta = new Statement(con); ResultSet rsl = sta.executeQuery(q); ResultSet rsl = sta.executeQuery(q);

Fig. 7. Program adaptation for API overloading

API substitution. If the source code of the client side API is available, which is the case for ODBC and JDBC drivers of Open Source RDBMSs, tracing statements can be inserted directly in this API. The latter is then recompiled and bound to the client applications. The client program need not be modified. This technique records the statement instances but ignores the program points. DBMS logs Most database servers store the requests received from the client application programs in a specific file or table. For example, MySQL writes, in the order it received them, all the client queries in its general query log. Each record comprises the client process id, the timestamp of query reception and the text of the query as it was executed, the input variables being replaced with their instant values. As compared to the program instrumentation technique, DBMS log does not provide program points and can be processed off-line only. This technique does not require source code modification. Tracing stored procedures. SQL procedures are centrally stored in the database and can be invoked by any client program. By building SQL procedures equivalent to the most used SQL queries, and by replacing some or all SQL statements in programs by the invocation of these SQL procedures, we provide an ad hoc API that can be augmented with tracing instructions that log SQL statement executions. This technique can be considered in architectures that already rely on SQL procedures. When client programs include explicit SQL statements, it entails in-depth and complex code modification. However, since it replaces complex input and output variable binding with mere procedure arguments, this reengineering can lead to a better code that will be easier to maintain and evolve.

Database Semantics Recovery through Analysis of Dynamic SQL Statements

143

Techniques comparison. Table 1 provides a comparison of the SQL trace capturing techniques described above. Two main dimensions are considered: – the richness of the traces that can be obtained: is the technique able to capture such information as the executed queries, the query results and the program points that correspond to the query execution steps? – the impact on the application programs: does the technique necessitate program modification and/or recompilation? The choice of a particular tracing technique obviously depends on such criteria as the objective of the dynamic analysis (which influences the information to capture) and the nature of the system of interest (programming language, database platform, architecture, availability of the source code). Table 1. Comparison of SQL trace capturing techniques Techniques Instrumentation Aspects API overloading API substitution DBMS logs Stored procedures

Richness of the SQL trace Impact on programs Executed queries Query results Program points Modification Recompilation X X X X X X

X X X X X

X X

X X

(X)

X X X X X

6 SQL Trace Analysis for Database Reverse Engineering Examination of static SQL queries is one of the most powerful techniques to elicit implicit database schema constructs and constraints among which undeclared foreign keys, identifiers and functional dependencies [35,36,37,38]. In this section, we discuss the use of dynamic analysis of dynamic SQL queries for such database reverse engineering task. In particular, we show how SQL execution traces can serve as a basis to formulate hypotheses on the existence of undeclared foreign keys3 . These hypotheses will still need to be validated afterwards (e.g., via data analysis or user/programmer interviews). We distinguish two different approaches to detect implicit referential constraints between columns of distinct or identical tables. One can observe either the way such referential constraints are used or the way they are managed. – Referential constraint usage consists in exploiting the referential constraint. For instance, within the same execution, an output value o1 of an SQL statement s1 querying table T1 is used as an input value of another SQL statement s2 accessing another table T2 (see Query 2). A more direct usage of a foreign key consists of a join of T1 and T2 within a single query (see Query 1). In both cases, this suggests the existence of an implicit foreign key between tables T1 and T2 . 3

However, a similar discussion can be developed for the other implicit constructs.

144

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

– Referential constraint management aims at verifying that the referential constraint keeps being respected when updating the database. For instance, before modifying the contents of a table T2 (by an insert or update statement s2 ), the program executes a verification query q1 on table T1 . According to the result of q1 , s2 is executed or not. When both q1 and s2 are executed, they include at least one common input value. Similarly, when deleting a row of a table T1 using a delete statement d2 , one observes that the program also deletes a possibly empty set of rows of another table T2 via a another delete statement d1 (procedural delete cascade). We have identified three main heuristics for implicit foreign key constraint detection from SQL execution traces, namely joins, output-input dependency and input-input dependency. Below, we further define and illustrate those three heuristics. Notations. Let q be an SQL query occuring in an execution trace. Let t be a table of a relational database. – q.match4 denotes the set of couples of columns (c1 , c2 ) whose values are matched in an equality condition of q; – q.in denotes the set of input values of q; – q.out denotes the set of output values of q; – q.seq denotes the sequence number of q in its trace; – t.cols denotes the set of columns of t. 6.1 Joins As discussed above, most SQL joins rely on the matching of a foreign key with its target primary key. Query 1 corresponds to a standard join, where several table names appear in the from clause of the query. It combines the rows of those tables, typically based on a join condition which expresses the equality of the primary and foreign key values. When primary keys have been recovered, either from the physical schema or as implicit constructs, then the join conditions provide strong evidence for implicit foreign keys. Definition 1. An SQL query q contains a join of two tables t1 and t2 iff ∃(c1 , c2 ) ∈ q.match such that c1 ∈ t1 .cols ∧ c2 ∈ t2 .cols. It is important to mention that all SQL joins do not necessarily correspond to the matching of a foreign key and a primary key value. Several counter-examples can be observed. A typical case consists in joining two tables on their foreign keys (none of them being a candidate key) to a third table, a dubious pattern known as connection trap [39]. 6.2 Output-Input Dependency An output-input dependency occurs when an SQL query uses as input some of the results of a previous SQL query of the same program execution. Definition 2. An SQL query q2 is output-input dependent on another SQL query q1 iff q2 .in ∩ q1 .out = ∅ ∧ q2 .seq > q1 .seq. 4

The q.match relationship is symmetric.

Database Semantics Recovery through Analysis of Dynamic SQL Statements

145

... select Sender from ORDERS where OrdNum = 5789 getString(1) = C400 select Name, Address from CUSTOMERS where Num = ’C400’ ...

Fig. 8. An SQL trace fragment with an output-input dependency that may reveal the existence of the implicit foreign key of Figure 1

In the presence of foreign keys, be they implicit or not, output-input dependencies can be observed in several navigational programming patterns. For instance, in a procedural join between the source and target tables of a foreign key (see Query 2), the value of the foreign key column(s) is used to retrieve the target row using a subsequent query. Conversely, the value of the identifier of a given target row can be used to extract all the rows referencing it. For instance, the program retrieves a customer, before searching for all her recent orders. Figure 8 shows an example of input-output dependency between two succesive SQL queries. In this example, the program retrieves the name and address of the customer who placed a particular order it has just retrieved. We see that the input value of column ORDERS.Sender of the first query is the same as the input value of column CUSTOMER.Num in the second query. 6.3 Input-Input Dependency In an SQL execution trace, an input-input dependency holds between two successive SQL queries that share common input values. Definition 3. An SQL query q1 is input-input dependent on another SQL query q2 iff q1 .in ∩ q2 .in = ∅. The presence of input-input dependencies in SQL execution traces constitutes another strong indicator of the presence of foreign keys. Several data manipulation patterns for referential constraint management make an intensive use of input-input dependent queries. Among the most popular examples, the delete cascade mecanism, that consists in deleting all referencing rows before deleting a target row, makes use of delete queries that share a common input value: the primary/foreign key value of the target rows to be deleted. A second example is the check-before-insert pattern, that aims at preserving a referential integrity constraint when inserting rows in the database. When inserting a row in a referencing table, the program first checks that the provided foreign key value is valid, i.e., it corresponds to the primary key value of an existing row in the target table. Similar patterns can be observed in delete and update procedures. As an illustration, we consider the execution trace given in Figure 9. This trace strongly suggests the existence of an implicit foreign key between column Sender of table ORDERS and column Num of table CUSTOMER. Indeed, a row insertion in table ORDERS is preceded by the execution of a validation query that (1) counts the number

146

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

... select count(*) from CUSTOMER where Num = ’C400’ getInt(1) = 1 insert into ORDERS(_,_,_,Sender) values (_,_,_,’C400’) ...

Fig. 9. An SQL trace fragment with an input-input dependency that may reveal the existence of the implicit foreign of Figure 1

of rows of table CUSTOMER having c as value of column Num – where c corresponds to the value of column Sender of the inserted row of ORDERS – and (2) returns 1 as a result. In other words, the program checks that the provided value of column Sender does correspond to the primary key (Num) value of an existing customer. This SQL trace fragment is actually an instance of the check-before-insert pattern described above.

7 Case Study In order to validate the approach we propose in this paper, we conducted an experiment based on an e-learning application, called WebCampus5 , that is used at the University of Namur. WebCampus is an instantiation of Claroline6 , an open-source Learning Management System (LMS) allowing teachers to offer online courses and to manage learning and collaborative activities on the web. The platform is written in PHP and uses a MySQL database. WebCampus consists of more than a thousand source code files, amounting to 460 000 lines of code. The main database manipulated by WebCampus is made up of 33 tables and 198 columns, representing the data on available online courses, course users, university departments, etc. In addition, the system makes use of one separate database per course, each comprising about 50 tables. The MySQL DDL code of the database does not explicitly declare any foreign key. Indeed, the database makes use of the MyISAM storage engine, which does not support foreign key management. However, the Claroline developers community7 is aware of all the implicit constructs. We particularly focused on the main database of WebCampus, that actually hosts 35 implicit foreign keys (according to the WebCampus developpers). This case study is outstanding since the target of the reverse enginering process is (alledgely8) known, in such a way that we can rigorously evaluate the effectiveness of a given reverse engineering technique as a basis for the detection of implicit foreign keys. WebCampus can be considered as a representative application of the class of applications we target in this paper. First, both the size of the source code and the size of the database schema fully justify the use of automated foreign key detection techniques. Second, we observe a high level of dynamicity in the construction of the database 5 6 7 8

see http://webcampus.fundp.ac.be see http://www.claroline.net The second author is co-developper of WebCampus and a member of the Claroline community. As we will see below...

Database Semantics Recovery through Analysis of Dynamic SQL Statements

147

queries (by string concatenation), combined with a high concentration of database query execution commands in only a few source code files (data access modules). This makes such static analysis as dataflow analysis, pattern mining and program slicing inapplicable. Three other techniques can be considered, namely data analysis, schema analysis and dynamic analysis of SQL statements. Data analysis is generally used as a confirmation technique for other detection techniques like schema or program analysis. Indeed, when a foreign key is implicit, by definition, it is not declared in the DDL code of the DBMS. Consequently, there is no guarantee that the underlying referential constraint is respected by all the rows of the referencing table. We learned from our experience with large databases that a significant proportion of the data actually violate implicit integrity constraints like undeclared foreign keys [40]. Schema analysis constitutes, a priori, a promising technique for implicit foreign detection. It seems indeed reasonnable to assume that an implicit foreign key colum is likely to have a similar name to the primary key column it references. We can also assume that there is a high probability that both the source and target columns are of compatible types. Thus, column name and type similarity appear as adequate heuristics for schema-based implicit foreign key detection. Below, we will verify this by analyzing the WebCampus database schema. Dynamic analysis of SQL statements is presented above as a promising alternative technique to overcome the limitations of static program analysis in the presence of dynamic database query construction, as in such programming environments as PhP/MySQL or Java/JDBC. This case study is intended to support this claim. Concretely, the case study aims to show that one can detect the implicit foreign keys of the WebCampus database using dynamic SQL analysis. Furthermore, we will compare the results obtained using dynamic SQL analysis with respect to the results obtained by means of schema analysis. The experiment involved the following two steps: (1) collecting SQL execution traces corresponding to typical interaction scenarios within WebCampus; and (2) analyzing these traces in order to detect implicit foreign keys. 7.1 Trace Collection The SQL traces collected reports on 14 execution scenarios, which translate the most typical operations carried out by WebCampus users on a regular basis. Trace collection was carried out through tool-supported source code instrumentation by the second author. This proved straightforward since only a few source code modules are in charge of querying the WebCampus database (but those data access modules are called from almost all modules of the application). The output of the tracing process was stored in a MySQL database composed of two tables: (1) sql_trace, each row of which describes an executed SQL query; and (2) sql_trace_results, that contains information on the results of these queries. Table 2 provides size metrics about the trace obtained by indicating, for each execution scenario, the number and the nature of the corresponding queries and query results.

148

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

# of results from main DB

total # of results

# of insert on main DB # of delete on main DB # of update on main DB

# of select on main DB

# of queries on main DB

Execution scenario

total # of queries

Table 2. Some metrics about the SQL trace obtained, classified by execution scenario

register_user 27 27 24 3 0 0 163 163 add_course_manager 364 194 190 4 0 0 2 643 2 391 add_course_user 289 155 151 4 0 0 2 112 1 908 create_course 70 29 20 9 0 0 319 299 delete_course 329 132 123 1 7 0 1 865 1 700 delete_course_user 159 84 83 0 1 0 1 110 996 delete_dpt_attempt 37 37 37 0 0 0 423 419 install_applet 92 88 82 4 0 2 729 721 install_tool 4 894 2 169 2 039 126 4 0 26 002 24 180 uninstall_applet 82 78 68 0 9 1 581 573 uninstall_tool 3 713 1 896 1 888 0 8 0 23 333 22 419 user_register_to_course 64 64 63 1 0 0 721 708 user_register_to_webcampus 35 32 30 2 0 0 188 184 user_unregister_from_course 24 19 17 1 1 0 169 155 Total

10 179 5 004 4 815 155 30 3 60 358 56 816

7.2 Trace Analysis The goal of the trace analysis process was to find indications of undeclared foreign keys in the SQL execution traces, with a focus on joins and output-input dependencies. This process took the physical schema of the main WebCampus database (without any foreign key) as a basis to systematically analyze the contents of the tracing tables. This analysis process was supported by a dedicated trace analyzer, implemented as a Java plugin of DB-MAIN CASE environment [41]. This plugin takes as input (1) a relational database schema and (2) a set of SQL traces stored in a relational database in the format described above. The analyzer returns a set of potential implicit foreign keys together with the number of corresponding joins and output-input dependencies occuring in the input SQL traces. The analyzer behaves as follows. For each couple of non necessarily distinct tables (t1 ,t2 ) of the input schema, it generates and executes a set of SQL queries that analyze the SQL execution traces. The input schema is used to determine the primary key (or unique column) pk of table t2 , while any column of table t1 is considered by the analyzer as a potential foreign key f k. 7.3 Results In order to evaluate the recall of our dynamic analysis technique, we compared the results of the trace analysis process with the set of known implicit foreign keys of the main Webcampus schema. In order to better interpret this recall value, we first needed to evaluate the richness of the SQL traces we analyzed. The left part of Table 3 indicates, for each implicit foreign key f k from table t1 to table t2 , (1) the number of

Database Semantics Recovery through Analysis of Dynamic SQL Statements

149

queries referencing t1 , (2) the number of queries referencing t2 , and (3) the number of distinct scenarios where both t1 and t2 are accessed. From (3), we can derive that only 27 implicit foreign keys of the schema were potentially detectable in the SQL trace we obtained. Indeed, the minimal requirement for detecting an undeclared foreign key t1 → t2 in an SQL trace is that both t1 and t2 must be involved in at least one execution scenario considered. If this is the case, then the SQL trace obtained can contain indications of the foreign key. The right part of Table 3 summarizes the indications of implicit foreign key that have been found in the SQL trace. For each undeclared foreign key (t1 → t2 ), we provide (1) the number of SQL joins between t1 and t2 , and (2) the number of output-input dependencies between a query q1 accessing t2 and a subsequent query q2 accessing t1 , further subdivided according to the nature of q2 . From these statistics, we observe that we found evidence for 23 implicit foreign keys (those in light grey), which represents a recall of: – about 65% of the total number of implicit foreign keys in the database; – about 85% of the foreign keys identified as potentially detectable in the trace we obtained. Let us now discuss the precision that we reached in this experiment. We observed three unexpected foreign keys hypotheses, summarized in Table 4, that are all based on the presence of SQL joins. It turned out that two of them correspond to actual implicit foreign keys that were not part of the list initially provided by the WebCampus developpers. The third hypothesis is erroneous and therefore constitutes a noise in our experiment. Several joins are made between tables notify and course user based on their respective column user id. Both notify.user id and course.user id reference a third table user, as detected by our analyzer (see Table 3). This case actually corresponds to an instance of the connection trap pattern described above. In summary, our experiment allowed us to correctly detect 25 implicit foreign keys, which corresponds to a recall of 71% (25/37). Considering that only 29 implicit foreign keys were potentially detectable (covered by our execution scenarios), we would reach a recall of 86% (25/29). In terms of precision, only one hypothesis revealed to be erroneous (false-positive), which results in a precision of 96% (25/26). Such false foreign keys do not pose a critical problem in practice since they would most probably be invalidated by other techniques, such as schema and data analysis. 7.4 Comparison with Schema-Based Foreign Key Detection As a further evaluation, we have conducted a second experiment where we used schema analysis techniques to detect the 37 undeclared foreign keys of the WebCampus database. The goal of this second experiment is to compare, on a common case, the precision and recall that can be reached using schema analysis vs dynamic SQL analysis. The experiment was supported by a schema analysis plugin of DB-MAIN, that takes as input a relational database schema (with no foreign key) and returns as output a set of implicit foreign key suggestions. A foreign key from column a to column b is suggested if the following conditions hold:

150

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

Conventions : query(t) = query on table t; select(t) = select query on table t; insert(t) = insert query on table t; delete(t) = delete query on table t.

0 1927 1927 55 55 55 0 244 1840 1838 0 11 11 3 4 4 4 4 3 38 17 9 9 9 0 0 0 1 1 161 161 161 0 3 1

0 1840 51 1927 51 41 41 496 1840 1927 1927 41 5 1927 41 3 41 3 41 496 496 1927 41 452 1 41 0 0 1927 1927 51 178 41 41 41

# of output-input dependencies select(t2 ) → insert(t1 ) # of output-input dependencies select(t2 ) → delete(t1 )

# of output-input dependencies select(t2 ) → select(t1 )

# of output-input dependencies query(t2 ) → query(t1 )

# of joins between t1 and t2

# of queries accessing t2 # of execution scenarios involving both t1 and t2

Implicit foreign key (t1 → t2 ) class → class course → faculty course → right_profile course_user → course course_user → right_profile course_user → user desktop_portlet_data → user dock → module faculty → faculty course_addons → course course_program → course user_addons → user user_addons → program im_message → course im_message_status → user im_message_status → im_message im_recipient → user im_recipient → im_message log → user module_contexts → module module_info → module notify → course notify → user notify → course_tool property_definition → user_property rel_class_user → user rel_class_user → class rel_course_class → class rel_course_class → course right_rel_profile_action → course right_rel_profile_action → right_profile right_rel_profile_action → right_action sso → user tracking_event → user user_property → user

# of queries accessing t1 (q(t1 ))

Table 3. SQL trace analysis results (recall)

0 0 0 0 0 0 12 1 832 87 86 1 0 12 0 1 0 1 0 9 9 47 37 7 3 9 0 9 0 9 0 9 7 50 44 4 2 0 0 0 0 0 0 14 58 297 291 3 3 12 3 0 0 0 0 9 0 3 842 3 839 0 3 0 0 0 0 0 0 3 0 15 15 0 0 3 0 0 0 0 0 3 0 5 0 5 0 3 0 2 0 2 0 3 0 0 0 0 0 3 0 2 0 2 0 3 0 0 0 0 0 3 0 1 0 1 0 14 34 11 0 6 5 4 13 11 0 6 5 6 0 3 0 0 3 6 0 11 11 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 3 6 0 22 19 0 3 7 0 711 91 600 20 7 32 810 0 810 0 0 0 0 0 0 0 2 0 2 0 2 0 1 1 0 0 0 0

Database Semantics Recovery through Analysis of Dynamic SQL Statements

151

Table 4. Unexpected foreign key hypotheses, due to SQL joins Unexpected foreign key (t1 → t2 )

# of joins Explanation

course_tool → module right_action → course_tool notify → course_user

1. 2. 3. 4.

372 actual implicit foreign key 18 actual implicit foreign key 6 t1 and t2 reference the same table t3

a and b are distinct; b is declared as primary key of its table; a and b are of the same type; a and b have similar names.

As far as name similarity is concerned, the tool relies on the Jaro-Winkler string proximity metrics [42]. If the proximity between two column names c1 , c2 is not less than a given treshold t, then c1 and c2 are considered similar. We run our schema analysis tool for 11 different treshold values between 0 and 1. For each threshold value, we measured: – – – – – –

D: the total number of detected foreign keys; T P : the number of true positives, i.e., correctly detected foreign keys ; F P : the number of false positives, i.e., wrongly detected foreign keys; F N : the number of false negatives, i.e., undetected foreign keys; P = T P/(T P + F P ): the precision of the schema-based detection technique; R = T P/(T P + F N ): the recall of the schema-based detection technique.

Table 5 summarizes the results obtained when using schema analysis as implicit foreign key detection technique for the WebCampus database. Figure 10 graphically shows the level of precision and recall reached for each name similarity treshold value considered. We observe that a high level of recall can be reached when the name similarity treshold is set to a low value. This is not surprising, since in this case almost all couple of column names are considered similar, thus almost all columns having the same type as a primary key column is considered as a foreign key. We can also observe that for 2 implicit foreign keys, the type of the foreign key column differs from the type of the target primary key column. The most important observation concerns the high proportion of false positives, whatever the treshold chosen. This makes the precision Table 5. Results of implicit foreign key detection using schema analysis Treshold # FK detected # True positives # False positives # False Negatives Recall (%) Precision (%) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

1329 1026 1026 1025 968 655 296 150 105 75 73

35 30 30 30 30 29 24 20 17 14 14

1294 996 996 995 938 626 272 130 88 61 59

2 7 7 7 7 8 13 17 20 23 23

94.6 81.1 81.1 81.1 81.1 78.4 64.9 54.1 45.9 37.8 37.8

2.6 2.9 2.9 2.9 3.1 4.4 8.1 13.3 16.2 18.7 19.2

152

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

Fig. 10. Precision and recall values of the schema analysis technique, depending on the name similarity treshold used

of the schema-based detection technique too low to be satisfying in this experiment, especially when compared to the precision of the dynamic analysis technique used in the above experiment. 7.5 Discussion The experiments presented above clearly validate our approach by confirming that (1) SQL execution traces contain essential information about implicit schema constraints, in particular implicit foreign keys; (2) dynamic analysis of SQL statements allows the detection of implicit foreign keys with a satisfying level of recall and precision and (3) dynamic analysis of SQL statements can lead to significantly more accurate results than other data reverse engineering techniques like schema analysis. More precisely, our results allow us to say that if there is an implicit foreign key, then SQL execution traces most probably contain indications of this foreign key if both involved tables are used by the execution scenario considered. As usual in dynamic program analysis, the main difficulty is to choose relevant execution scenarios based on the knowledge of the analyzed system. In our case, we considered a large set of scenarios in order to reach a good coverage of the database schema. However, the set of selected execution scenarios was not sufficient to reach a detection recall of 100%. This can be explained by several non-exclusive reasons. First, nothing guarantees that the WebCampus application actually exploits all the structures of the database. Second, we did not consider all possible execution scenarios of WebCampus. Third, it is likely that we did not cover all possible interesting execution paths of each execution scenario considered. In the context of this experiment, an execution path is considered interesting if it involves the execution of successive inter-dependent database queries accessing different tables. Precisely evaluating the coverage of execution scenarios and input data is a non-trivial problem that has been extensively studied [43]. Several mature techniques

Database Semantics Recovery through Analysis of Dynamic SQL Statements

153

have been proposed to support this evaluation in the particular case of data-intensive applications [44]. Last, even for each execution path we followed, it is obvious that we did not consider all possible combination of input data and database state. The techniques and tools that we used in our experiment can actually been extended and reused in other application contexts. For instance, while the initial goal of the SQL trace analysis tool we developed for the WebCampus case study is to detect indications of implicit foreign keys in SQL execution traces, a similar tool may also be used for confirming candidate implicit foreign keys elicited using other reverse engineering techniques including schema or user interface analysis. Another interesting application of SQL trace analysis concerns the identification of unsafe data access paths [40], i.e., program fragments where an implicit constraint is not correcly managed. For instance, based on the list of implicit foreign keys, one could detect in the SQL trace that an insert or an update statement is performed without prior verification of the referential constraints. In this case, the analysis would be based on the absence of output-input and input-input dependency under particular scenarios.

8 Related Work Most of the previous approaches on SQL statement analysis [18,27,21,29] rely on static program analysis techniques. Petit et al. [18] present a technique for extracting an Entity-Relationship schema from an operational relational database. The enrichment of the raw schema takes benefit from the analysis of the SQL queries available in the application programs. In particular, joins are seen as heuristics for the detection of implicit dependencies between the columns of distinct tables. Willmor et al. [27] propose an approach to program slicing in the presence of database states. In particular, they introduce two forms of data dependencies related to database queries. The first category, called program-database dependencies, accounts for interactions between program statements and database statements. The databasedatabase dependencies capture the situation in which the execution of a database statement affects the behaviour of another database statement. We also consider the latter kind of dependencies, but they are extracted from SQL execution traces rather than from the source code. van den Brink et al. [29] present a tool-supported method to quality assessment for SQL statements. The initial phase of the method consists in extracting the SQL statements from the source code using control and dataflow analysis techniques. Similarly, Ngo and Tan [30] make use of symbolic execution to extract database interaction points from web applications. Based on a case study, they showed that their method is able to extract about 80% of such interactions. Dynamic analysis of SQL statements has already been used by other authors, but for different purposes than database reverse engineering. Debusmann and Geihs [45] present an aspect-based method for the intrumentation of application components. This method is used in the context of runtime system monitoring. They measure, in particular, the response time of database queries. Yang et al. [46] make use of the aspect-based tracing method that we introduced in [34] for feature model recovery. Their experiments revealed that static analysis techniques would have been inapplicable in this context.

154

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

The WAFA approach [47], by Alalfi et al., is dedicated to program comprehension. It combines static and dynamic program analysis techniques for achieving a fine-grained analysis of database interactions in web applications. The key idea is to automatically recover the link between SQL execution instances and the original statement source. A similar method was proposed in previous work [34] in the case of Java systems.

9 Conclusions and Perspectives SQL statements appear to be a particularly rich source of information in program and data structure understanding and therefore in legacy database semantics recovery, a strict prerequisite to such essential processes as program and database maintenance, evolution and reengineering. Although various static analysis approaches have been proposed in the last two decades, the increasing use of dynamic SQL in business applications makes such techniques inapplicable, so that web software systems can no longer be correctly redocumented and understood. Dynamic analysis of SQL queries constitutes a promising alternative, although it is still a largely unexplored research and technical domain. The goal of this paper was to mark this engineering domain out by identifying and discussing its basic concepts and specific techniques. We first identified, illustrated and compared a set of techniques for capturing the SQL queries executed by a system at runtime. Then, we elaborated on the analysis of SQL traces in the context of database reverse engineering. We identified possible heuristics for the detection of implicit foreign key from SQL execution traces. Those heuristics combine both intra-query dependencies (SQL joins) and inter-query dependencies (input-input and output-input dependencies). A comprehensive experiment, based on a real-life web application, allowed us to establish the analysis of dynamic SQL execution traces as a very effective technique for relational database reverse engineering. It proved convincing enough to let ReveR (a university spin-off devoted to information system reverse engineering) include our techniques and tools to its database reverse engineeering tool set. In the near future, we plan to investigate the benefits of hybrid approaches to database reverse engineering, combining schema analysis, data analysis, static and dynamic program analysis techniques. We also aim at exploring the use of dynamic analysis of SQL statements in other application contexts than database reverse engineering, including quality assessment, program comprehension, intrusion detection and consistency management. Acknowledgments. This research was carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship by the first author. Partial support was also received from the IAP Programme of the Belgian State, Belgian Science Policy (MoVES project).

References 1. Casanova, M.A., De Sa, J.E.A.: Mapping uninterpreted schemes into entity-relationship diagrams: two applications to conceptual schema design. IBM J. Res. Dev. 28(1), 82–94 (1984) 2. Davis, K.H., Arora, A.K.: A methodology for translating a conventional file system into an entity-relationship model. In: Proceedings of the Fourth International Conference on EntityRelationship Approach, pp. 148–159. IEEE Computer Society, Washington, DC, USA (1985)

Database Semantics Recovery through Analysis of Dynamic SQL Statements

155

3. Navathe, S.B., Awong, A.M.: Abstracting relational and hierarchical data with a semantic data model. In: Proceedings of the Sixth International Conference on Entity-Relationship Approach (ER 1987), pp. 305–333. North-Holland Publishing Co., Amsterdam (1988) 4. Johannesson, P.: A method for transforming relational schemas into conceptual schemas. In: Proceedings of the Tenth International Conference on Data Engineering (ICDE 2004), pp. 190–201. IEEE Computer Society, Washington, DC, USA (1994) 5. Blaha, M.R., Premerlani, W.J.: Observed idiosyncracies of relational database designs. In: Proceedings of the Second Working Conference on Reverse Engineering (WCRE 1995), p. 116. IEEE Computer Society, Washington, DC, USA (1995) 6. Hainaut, J.L., Englebert, V., Henrard, J., Hick, J.M., Roland, D.: Database reverse engineering: From requirements to care tools. Automated Software Engineering 3, 9–45 (1996) 7. Davis, K.H., Aiken, P.H.: Data reverse engineering: A historical survey. In: Proceedings of the Seventh Working Conference on Reverse Engineering (WCRE 2000), p. 70. IEEE Computer Society, Washington, DC, USA (2000) 8. Hainaut, J.L.: Legacy and future of data reverse engineering. In: Proceedings of the 16th Working Conference on Reverse Engineering (WCRE 2009), p. 4. IEEE Computer Society, Los Alamitos (2009) 9. Hainaut, J.L., Henrard, J., Englebert, V., Roland, D., Hick, J.M.: Database reverse engineer¨ ing. In: Liu, L., Ozsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 723–728. Springer, US (2009) 10. Hainaut, J.-L.: The transformational approach to database engineering. In: L¨ammel, R., Saraiva, J., Visser, J. (eds.) GTTSE 2005. LNCS, vol. 4143, pp. 95–143. Springer, Heidelberg (2006) 11. Hainaut, J.L.: Introduction to database reverse engineering. LIBD Publish (2002), http://www.info.fundp.ac.be/ dbm/publication/2002/ DBRE-2002.pdf 12. Markowitz, V.M., Makowsky, J.A.: Identifying extended entity-relationship object structures in relational schemas. IEEE Trans. Softw. Eng. 16, 777–790 (1990) 13. Premerlani, W.J., Blaha, M.R.: An approach for reverse engineering of relational databases. Commun. ACM 37(5), 42 (1994) 14. Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational databases: extraction of an eer model from a relational database. Data Knowl. Eng. 12(2), 107–142 (1994) 15. Lopes, S., Petit, J.M., Toumani, F.: Discovering interesting inclusion dependencies: application to logical database tuning. Inf. Syst. 27, 1–19 (2002) 16. Yao, H., Hamilton, H.J.: Mining functional dependencies from data. Data Min. Knowl. Discov. 16(2), 197–219 (2008) 17. Pannurat, N., Kerdprasop, N., Kerdprasop, K.: Database reverse engineering based on association rule mining. CoRR abs/1004.3272 (2010) 18. Petit, J.M., Kouloumdjian, J., Boulicaut, J.F., Toumani, F.: Using queries to improve database reverse engineering. In: Loucopoulos, P. (ed.) ER 1994. LNCS, vol. 881, pp. 369–386. Springer, Heidelberg (1994) 19. Di Lucca, G.A., Fasolino, A.R., de Carlini, U.: Recovering class diagrams from dataintensive legacy systems. In: Proceedings of the 16th IEEE International Conference on Software Maintenance (ICSM 2000), p. 52. IEEE Computer Society, Los Alamitos (2000) 20. Henrard, J.: Program Understanding in Database Reverse Engineering. PhD thesis, University of Namur (2003) 21. Cleve, A., Henrard, J., Hainaut, J.L.: Data reverse engineering using system dependency graphs. In: Proceedings of the 13th Working Conference on Reverse Engineering (WCRE 2006), pp. 157–166. IEEE Computer Society, Washington, DC, USA (2006) 22. Choobineh, J., Mannino, M.V., Tseng, V.P.: A form-based approach for database analysis and design. Communications of the ACM 35(2), 108–120 (1992)

156

A. Cleve, J.-R. Meurisse, and J.-L. Hainaut

23. Terwilliger, J.F., Delcambre, L.M.L., Logan, J.: The user interface is the conceptual model. In: Embley, D.W., Oliv´e, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 424–436. Springer, Heidelberg (2006) 24. Ramdoyal, R., Cleve, A., Hainaut, J.-L.: Reverse engineering user interfaces for interactive database conceptual analysis. In: Pernici, B. (ed.) CAiSE 2010. LNCS, vol. 6051, pp. 332–347. Springer, Heidelberg (2010) 25. Andersson, M.: Searching for semantics in cobol legacy applications. In: Data Mining and Reverse Engineering: Searching for Semantics, IFIP TC2/WG2.6 Seventh Conference on Database Semantics (DS-7). IFIP Conference Proceedings, vol. 124, pp. 162–183. Chapman & Hall, Boca Raton (1998) 26. Embury, S.M., Shao, J.: Assisting the comprehension of legacy transactions. In: Proceedings of the 8th Working Conference on Reverse Engineering (WCRE 2001), p. 345. IEEE Computer Society, Washington, DC, USA (2001) 27. Willmor, D., Embury, S.M., Shao, J.: Program slicing in the presence of a database state. In: ICSM 2004: Proceedings of the 20th IEEE International Conference on Software Maintenance, pp. 448–452. IEEE Computer Society, Washington, DC, USA (2004) 28. Maule, A., Emmerich, W., Rosenblum, D.S.: Impact analysis of database schema changes. In: Proceedings of the 30th international conference on Software engineering (ICSE 2008), pp. 451–460. ACM Press, New York (2008) 29. van den Brink, H., van der Leek, R., Visser, J.: Quality assessment for embedded sql. In: Proceedings of the 7th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007), pp. 163–170. IEEE Computer Society, Los Alamitos (2007) 30. Ngo, M.N., Tan, H.B.K.: Applying static analysis for automated extraction of database interactions in web applications. Inf. Softw. Technol. 50(3), 160–175 (2008) 31. Cleve, A.: Program Analysis and Transformation for Data-Intensive System Evolution. PhD thesis, University of Namur (October 2009) 32. Kiczales, G., Hilsdale, E.: Aspect-oriented programming. In: ESEC/FSE-9: Proceedings of the 8th European software engineering conference held jointly with 9th ACM SIGSOFT international symposium on Foundations of software engineering, p. 313. ACM, New York (2001) 33. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An overview of aspectj. In: Lee, S.H. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 327–353. Springer, Heidelberg (2001) 34. Cleve, A., Hainaut, J.L.: Dynamic analysis of sql statements for data-intensive applications reverse engineering. In: Proceedings of the 15th Working Conference on Reverse Engineering, pp. 192–196. IEEE Computer Society, Los Alamitos (2008) 35. Petit, J.M., Toumani, F., Kouloumdjian, J.: Relational database reverse engineering: A method based on query analysis. Int. J. Cooperative Inf. Syst. 4, 287–316 (1995) 36. Lopes, S., Petit, J.M., Toumani, F.: Discovery of “Interesting” data dependencies from a ˙ workload of SQL statements. In: Zytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 430–435. Springer, Heidelberg (1999) 37. Tan, H.B.K., Ling, T.W., Goh, C.H.: Exploring into programs for the recovery of data dependencies designed. IEEE Trans. Knowl. Data Eng. 14(4), 825–835 (2002) 38. Tan, H.B.K., Zhao, Y.: Automated elicitation of inclusion dependencies from the source code for database transactions. Journal of Software Maintenance 15(6), 379–392 (2003) 39. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970) 40. Cleve, A., Lemaitre, J., Hainaut, J.L., Mouchet, C., Henrard, J.: The role of implicit schema constructs in data quality. In: Proceedings of the 6th International Workshop on Quality in Databases (QDB 2008), pp. 33–40 (2008)

Database Semantics Recovery through Analysis of Dynamic SQL Statements

157

41. DB-MAIN: The DB-MAIN official website (2011), http://www.db-main.be 42. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-2003), pp. 73–78 (2003) 43. Zhu, H., Hall, P.A.V., May, J.H.R.: Software unit test coverage and adequacy. ACM Comput. Surv. 29, 366–427 (1997) 44. Kapfhammer, G.M., Soffa, M.L.: A family of test adequacy criteria for database-driven applications. In: Proceedings of the 9th European software engineering conference held jointly with 11th ACM SIGSOFT international symposium on Foundations of software engineering. ESEC/FSE-11, pp. 98–107. ACM, New York (2003) 45. Debusmann, M., Geihs, K.: Efficient and transparent instrumentation of application components using an aspect-oriented approach. In: Brunner, M., Keller, A. (eds.) DSOM 2003. LNCS, vol. 2867, pp. 209–220. Springer, Heidelberg (2003) 46. Yang, Y., Peng, X., Zhao, W.: Domain feature model recovery from multiple applications using data access semantics and formal concept analysis. In: Proceedings of the 16th International Working Conference on Reverse Engineering (WCRE 2009), pp. 215–224. IEEE Computer Society, Los Alamitos (2009) 47. Alalfi, M., Cordy, J., Dean, T.: WAFA: Fine-grained dynamic analysis of web applications. In: Proceedings of the 11th International Symposium on Web Systems Evolution (WSE 2009), pp. 41–50. IEEE Computer Society, Los Alamitos (2009)

Ontology Alignment Evaluation Initiative: Six Years of Experience Jérôme Euzenat1, Christian Meilicke2 , Heiner Stuckenschmidt2, Pavel Shvaiko3, and Cássia Trojahn1 1

INRIA & LIG, Grenoble, France {jerome.euzenat,cassia.trojahn}@inria.fr 2 University of Mannheim, Germany {christian,heiner}@informatik.uni-mannheim.de 3 Informatica Trentina S.p.A., Trento, Italy [email protected] Abstract. In the area of semantic technologies, benchmarking and systematic evaluation is not yet as established as in other areas of computer science, e.g., information retrieval. In spite of successful attempts, more effort and experience are required in order to achieve such a level of maturity. In this paper, we report results and lessons learned from the Ontology Alignment Evaluation Initiative (OAEI), a benchmarking initiative for ontology matching. The goal of this work is twofold: on the one hand, we document the state of the art in evaluating ontology matching methods and provide potential participants of the initiative with a better understanding of the design and the underlying principles of the OAEI campaigns. On the other hand, we report experiences gained in this particular area of semantic technologies to potential developers of benchmarking for other kinds of systems. For this purpose, we describe the evaluation design used in the OAEI campaigns in terms of datasets, evaluation criteria and workflows, provide a global view on the results of the campaigns carried out from 2005 to 2010 and discuss upcoming trends, both specific to ontology matching and generally relevant for the evaluation of semantic technologies. Finally, we argue that there is a need for a further automation of benchmarking to shorten the feedback cycle for tool developers. Keywords: Evaluation, experimentation, benchmarking, ontology matching, ontology alignment, schema matching, semantic technologies.

1 Introduction The past ten years have witnessed impressive development in the area of semantic technologies, mostly driven by the idea of creating a semantic web [4] as a source of information that is accessible by machines. This development has been enabled by the standardization of representation languages for knowledge on the web, in particular RDF and OWL. Based on these languages, many tools have been developed to perform various tasks on the semantic web, such as searching, querying, integrating and reasoning about semi-structured information. Standards were an important factor for the development of software tools supporting semantic web applications. However, a crucial step in their large scale adoption in real world applications will be the ability to S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 158–192, 2011. c Springer-Verlag Berlin Heidelberg 2011

Ontology Alignment Evaluation Initiative: Six Years of Experience

159

determine the quality of a system in terms of its expected performance on realistic data. This means that systematic evaluation of semantic technologies is an important topic. A major and long term goal of evaluation is to help developers of such systems to improve them and to help users evaluating the suitability of the proposed systems to their needs. The evaluation should thus be run over several years in order to allow for adequate measurement of the evolution of the field. Evaluation should also help assessing absolute results, i.e., what are the properties achieved by a system, and relative results, i.e., how these results compare to the results of other systems. One particular kind of evaluation is benchmarking. A benchmark is a well-defined set of tests on which the results of a system or a subsystem can be measured [9]. It should enable to measure the degree of achievement of proposed tasks on a well-defined scale (that can be achieved or not). It should be reproducible and stable, so that it can be used repeatedly for: (i) testing the improvement or degradation of a system with certainty and (ii) situating a system among others. A medium term goal for evaluation efforts is to set up a collection of reference sets of tests, or benchmark suites for assessing the strengths and weaknesses of the available tools and to compare their evolution with regard to these references. Building benchmark suites is valuable not just for groups of people who participate in planned evaluations but for all the community, since system designers can make use of them at any time and compare their results with those of the other systems. In this paper, we focus on the Ontology Alignment Evaluation Initiative (OAEI)1 which carries out annual campaigns for the evaluation of ontology matching tools. Ontology matching is an important functionality in many applications as it is the basis for linking information, e.g., from heterogeneous sources into a common model that can be queried and reasoned upon. Initially, the focus of OAEI was on the task of matching different ontologies rather than on the data itself. More recently, however, the focus is being extended to include data matching algorithms as well. The main goal of OAEI is to compare systems and algorithms on the same basis and to allow anyone for drawing conclusions about the best matching strategies. The OAEI ambition is that from such evaluations, tool developers can learn and improve their systems, thus extending the state of the art in ontology matching. The goal of this paper is to present the state of the art in evaluating ontology matching. For this purpose, we draw lessons from the six first years of carrying out OAEI focusing on trends we have observed and implications for the further improvement of the OAEI campaigns and the evaluation of semantic technologies in general. Annual OAEI reports [28; 26; 25; 8; 23; 24] present the individual datasets and the results of the different campaigns in detail. In this paper, we take a global view on outcomes of the evaluation campaigns over the years and identify interesting developments, fundamental decisions as well as solved and open problems. Thus, the contributions of the paper are: – A comprehensive overview of the six years of ontology matching benchmarking in the context of the OAEI initiative accompanied with a rationale for the choice of the datasets used; 1

http://oaei.ontologymatching.org/

160

J. Euzenat et al.

– The identification and discussion of problems in designing experiments for evaluating matching technologies; – An analysis of the development of the field of ontology matching on the basis of the results obtained in the different evaluation campaigns; – Current trends and future challenges of ontology matching evaluation based on our observations and experiences from the OAEI campaigns. In a nutshell, the lessons learned from the evaluation campaigns can be summarized as follows: – Systematic ontology matching evaluation indeed allows for measuring the progress of the field in terms of participation to the evaluation campaigns, quality of the matching results and runtime performance; – It is necessary to be reactive to propose improvements in data sets and evaluation modalities in order to keep or increase the interest in the field; – Automation is prone to improve the situation on many fronts of ontology matching evaluation, including scalability, variability, and hardness of tests. The remainder of the paper is structured as follows. In Section 2, we provide an overview of the related work. In Section 3, we introduce the ontology matching problem. Section 4 addresses the problem of designing evaluations for the ontology matching problem and provides some guidelines for the design of future evaluations. Results of the different evaluation campaigns are discussed in Section 5. We first provide background on OAEI, its organization and its development over the years. Then we focus on the progress that has been achieved and how it was measured. In Sections 6 and 7, we summarize our experiences and discuss implications for future evaluation campaigns.

2 Related Work on Evaluations Currently, the systematic evaluation of semantic technologies in general still falls behind other fields, such as theorem proving and information retrieval, where benchmarking against standardized datasets is a common practice. Standardized evaluations also provide the basis for a fair comparison of systems according to scientific standards and make it harder to tune results in favor of one or another system. Evaluation initiatives like TPTP (Thousand Problems in Theorem Proving) or TREC (Text Retrieval Conference) that have been carried out on a regular basis for many years have shown that besides the practical benefits of supporting the uptake of technology, systematic and continuous evaluations also lead to a continuous improvement of the field because fundamental problems are better understood and can be addressed more efficiently due to the direct feedback from the frequent evaluation campaigns. OAEI, presented in this paper, took inspiration from TREC. Indeed, ontology matching is closer to information retrieval than to theorem proving or standard conformance, since there are, in general, no algorithms for providing the solution to the problem to be solved. Thus, establishing an evaluation in such a setting is less direct. For what concerns ontology matching evalutation most of the available works conveged towards contributing to the OAEI campaigns. Thus, below, we discuss the related

Ontology Alignment Evaluation Initiative: Six Years of Experience

161

work on evaluation only in two relevant areas, namely semantic technologies in general and specifically database schema matching. Evaluation of semantic technologies. While systematic evaluation of semantic technologies is not yet as established as in related areas, such as databases or information retrieval, several initiatives started to investigate this problem by focussing on different types of methods and tools. For example, early efforts have considered the evaluation of semantic web systems with respect to their ability of exchanging semantic data without loss of information [63]. Although in theory, interoperability should be granted by the use of standardized languages, such as RDF and OWL, evaluations have shown that this is not always the case. As a response to this problem, interoperability benchmarks for semantic web tools were defined and implemented for testing existing implementations [29]. So far, interoperability has mostly been tested for ontology development tools. More recent efforts also included the evaluation of APIs for ontology management and API-based interfaces [43]. The efficiency of accessing semantic data is another subject of existing evaluation efforts that stands in the tradition of database systems benchmarking, where the main focus has always been on efficiency. To this end, a number of benchmark datasets for evaluating the performance of RDF databases was defined in terms of generators that can be used to generate arbitrarily large RDF datasets based on a predefined schema [33; 6; 55]. The corresponding experiments typically focus on upload and query execution times. Compared to the existing benchmarking activities in the database area, a special characteristic of semantic data access is the need to perform logical reasoning for answering queries. This means that besides the efficiency, completeness and correctness of the underlying reasoning procedures are of a major importance and were also considered in the respective benchmarks, see e.g., [33; 44]. More recently, algorithms for generating test data that allows for measuring completeness of a reasoning system independent of a certain schema were investigated as well [61]. Another aspect of semantic technologies that was the subject of evaluation activities is the ability to find and combine relevant information in a useful way. Here, the main criterion is the quality of the resulting information. This task comes in different forms, depending on the kind of information that is concerned. While the use of semantic technologies for enhancing classical information retrieval tasks has not been the subject of systematic evaluation, there is some work from the area of web service discovery and composition, see, e.g., [66]. In particular, the task of selecting appropriate web services based on a user request and semantic annotations was investigated in detail and a comprehensive benchmarking suite is available [41]. Other benchmarking activities are concerned with the integration of different web services into a coherent workflow, although based on a qualitative evaluation rather than concrete quality measures [51]. Different communities have recognized the benefits of providing an automatic evaluation framework where system developers can test their tools against a predefined set of benchmark datasets and receive an evaluation result online. Examples are the SMT-Exec initiative2 for satistisfiability testing and the S3 contest for web service matching3. The Ontology Alignment Evaluation Initiative described in this paper is a related activity 2 3

http://www.smtexec.org http://www-ags.dfki.uni-sb.de/~klusch/s3/index.html

162

J. Euzenat et al.

in the context of evaluating semantic technologies for finding and combining relevant information that focusses on the task of matching between knowledge models. It thus supplements, or has inspired, the activities mentioned above by focussing on a different technology. Evaluation of schema matching. Untill recently there were no comparative evaluations and it was quite difficult to find two database schema matching systems evaluated on the same dataset. For example, an early evaluation effort of [16] focused mostly on comparison criteria from four areas, such as input (test cases), output (match results), quality measures (precision, recall, f-measure, overall) and savings of manual efforts (pre-match, post-match). It also provided a summary on several matching tools using those criteria. However, even at present in the database community there are no well-established benchmarks for comparing schema matching tools. Instead, the activities were somewhat fragmented, such as those of Cupid [45] and iMAP [15]. Several later works built up on the past results in terms of using the same datasets and quality measures for evaluations, such as COMA++ [3], S-Match [31], SMB [47] and YAM [19] to name a few. In turn, the work on STBenchmark [2; 1] focused on evaluation of mappings, namely on the transformation from source instances into target instances, what finds its parallels with the instance matching track of OAEI. The closest to OAEI works on benchmarking of database schema matching systems are those of [16] and more recently of XBenchMatch [18; 17]; though these initiatives have not led to well-established recurrent evaluation campaigns.

3 Ontology Matching Designing and running evaluation campaigns for a certain kind of tools require a solid understanding of the problem the respective tools try to solve. There have been different formalizations of the matching process and the results generated by this process [5; 42; 38; 59; 70]. We follow the framework presented in [27]. In order to illustrate the matching problem, let us consider two simple ontologies depicted in Figure 3. These ontologies contain subsumption statements, property specifications and instance descriptions. On an abstract level, ontology matching is the task of finding correspondences between ontologies. Correspondences express relationships supposed to hold between entities in ontologies, for instance, that a SubjectArea in one ontology is the same as a Topic in another one or that Regular author in an ontology is a subclass of Author in another one. In the example above, one of the correspondences expresses an equivalence, while the other one is a subsumption correspondence. In a further step, one may generate query expressions that automatically translate instances of these ontologies under an integrated ontology. Matching is the process that determines an alignment A for a pair of ontologies o and o . There are some other parameters that can extend the definition of the matching process, namely: (i) the use of an input alignment A, which is to be completed by the process; (ii) the matching parameters, for instance, weights and thresholds; and (iii) external resources used by the matching process, for instance, common knowledge and domain specific thesauri.

Ontology Alignment Evaluation Initiative: Six Years of Experience

163

o

o ≡

Person

Human

string

≡

email

{male,female}

Chairman

≡

has email has gender

Committee member

ConferenceMember Chair Author Reviewer

Conference contributor

Active participant Paper

assignedTo Ontology matching

SubjectArea

Regular author

≡ Topic

Fig. 1. Two simple ontologies. Classes are shown in rectangles with rounded corners, e.g., in o, Chairman being a specialization (subclass) of Person, while relations are shown without the latter, such as email being an attribute (defined on a domain string) and assignTo being a property. Ontology matching is a shared instance. Correspondences are shown as arrows that connect an entity from o with an entity from o . They are annotated with the relation that is expressed by the correspondence.

Each of the elements featured in this definition can have specific characteristics which influence the difficulty of the matching task. It is thus necessary to know and control these characteristics (called dimensions because they define a space of possible tests). The purpose of the dimensions is the definition of the parameters and characteristics of the expected behavior in a benchmark experiment. As depicted in Figure 2, the matching process receives as input three main parameters: the two ontologies to be matched (o and o ) and, eventually, an input alignment (A). The input ontologies can be characterized by the input languages they are described (e.g., OWL-Lite, OWL-DL, OWL-Full), their size (number of concepts, properties and instances) and complexity, which indicates how deep is the hierarchy structured and how dense is the interconnection between the ontological entities. Other properties, such as consistency, correctness and completeness are also used for characterizing the input ontologies. The input alignment (A) is mainly characterized by its multiplicity (or cardinality, e.g., how many entities of one ontology can correspond to one entity of another one) and coverage in relation to the ontologies to be matched. In a simple scenario, which is the case for most of the OAEI test cases, the input alignment is empty. Regarding the parameters, some systems take advantage of external resources, such as WordNet, sets of morphological rules or previous alignments among general purpose resources, e.g., Yahoo and Google directories.

164

J. Euzenat et al.

o

parameters

A

matching

o

resources

A

Fig. 2. The ontology matching process (from [27])

The output alignment A is a set of correspondences between o and o : Definition 1 (Correspondence). Given two ontologies, o and o , a correspondence is a quintuple: id, e, e , r, n, such that: – id is an identifier of the given correspondence; – e and e are entities, e.g., classes and properties of the first and the second ontology, respectively; – r is a relation, e.g., equivalence (≡), more general (), disjointness (⊥), holding between e and e ; – n is a confidence measure (typically in the [0, 1] range) holding for the correspondence between e and e . Alignments are sets of correspondences between entities belonging to the matched ontologies. The correspondence id, e, e , r, n asserts that the relation r holds between the ontology entities e and e with confidence n. The higher the confidence, the higher the likelihood that the relation holds. For example, an alignment A, which contains only equivalence correspondences, is a 1:1 alignment, if for all id1 , e1 , e1 , r1 , n1 ∈ A there exists no id2 , e2 , e2 , r2 , n2 ∈ A with (e1 = e2 ∧ e1 = e2 ) ∨ (e1 = e2 ∧ e1 = e2 ). For example, in Figure 3 according to some matching algorithm based on linguistic and structure analysis, the confidence measure between entities with labels Chairman in o and Chair in o is 0.75. Suppose that this matching algorithm uses a threshold of 0.55 for determining the resulting alignment, i.e., the algorithm considers all pairs of entities with a confidence measure higher than 0.55 as correct correspondences. Thus, our hypothetical matching algorithm should return to the user the following correspondence id2,4 , Chairman, Chair, , 0.75. Different approaches to the problem of ontology matching have emerged from the literature [27]. The main distinction among them is due to the type of knowledge encoded within each ontology, and the way it is utilized when identifying correspondences between features or structures within the ontologies. Terminological methods lexically compare strings (tokens or n-grams) used in naming entities (or in the labels and comments concerning entities), whereas semantic methods utilize model-theoretic semantics to determine whether or not a correspondence exists between two entities. Some approaches may consider the internal ontological structure, such as the range of the

Ontology Alignment Evaluation Initiative: Six Years of Experience

165

properties (attributes and relations), cardinality, and the transitivity and/or symmetry of the properties, or alternatively the external ontological structure, such as the position of the two entities within the ontological hierarchy. The instances (or extensions) of classes could also be compared using extension-based approaches (e.g., based on frequency distributions). In addition, many ontology matching systems rely not on a single matching method (matcher), but combine several matchers.

4 Evaluation Design The design of the evaluations is at the heart of an evaluation campaign, and the design of a good evaluation is a task that should not be underestimated. Setting new challenges for participants in terms of well-designed tests requires a good understanding of the problem domain, in our case ontology matching. In fact the evaluation initiative only really took off after a theoretical framework for ontology alignment was developed within the KnowledgeWeb network of excellence [7]. Over the years, the theoretical understanding of the problem has been further improved and led to the development of further datasets. Designing an evaluation is difficult, because it has to balance several partially conflicting desiderata: D1: The evaluation criteria and tests should cover all relevant aspects of the problem and the results of an evaluation should provide a good estimation of the expected performance of the tested system in a real application. D2: The evaluation has to be fair in the sense that it does not favor a certain approach or systems that make a certain assumption on the nature of the data or the result. D3: The results have to be informative in the sense that they allow the developers of the tested system as well as potential users to learn about the strengths and the weaknesses of a tool and also to decide which tool shows a better performance. D4: The evaluation should allow for quick feedback cycles to foster advances of the state of the art. This requires that the effort of conducting the campaign should not be too high neither for the participants nor for the organizers. In the development of the Ontology Alignment Evaluation Initiative we have worked with these desiderata and came up with different methods for improving the evaluations to better meet them. These and further necessary developments are discussed in this section. We start with a basic evaluation design and then discuss its variations. Figure 3 shows a basic evaluation process for ontology matching tools. The main component in this process is the matching component, which represents the system to be evaluated. The system takes two ontologies as input and generates an alignment. The second component is an evaluation script (evaluator) that takes the produced alignment and compares it with a reference alignment representing the expected outcome of the matching process. The evaluator compares the two alignments and computes a measure of the quality of the alignment produced by the matching component. This basic process is simplistic and has to be concretized in many respects. First of all, the input data in terms of the ontologies to be matched has to be defined. No single pair of ontologies can test all aspects of ontology matching. We also experienced

166

J. Euzenat et al.

R o

parameters matching

o

evaluator

m

A

resources

Fig. 3. Basic evaluation design: a matcher receives two ontologies o and o as input and generates an alignment A using a certain set of resources and parameters. An evaluation component receives this alignment and computes a (set of) quality measure(s) m – typically precision and recall – by comparing it to the reference alignment R.

that there is a need for different types of datasets: for systematic evaluations and for competitive evaluations. Another insight gained was that standard quality measures, in particular precision and recall, are not always suited for the purpose of ontology matching as they fail to completely capture the semantics of ontology alignments and different measures are needed for evaluating different aspects. Finally, we found out that more complex approaches are sometimes needed in certain situations, for instance, if a partial alignment exists or if no reference alignment is available. It is possible to use external resources as long as they have not been tuned to the current evaluation experiment (for instance, using a sub-lexicon, which is dedicated to the domain considered by the tests). It is acceptable that the algorithm prunes or adapts these resources to the actual ontologies as long as this is in the normal process of the algorithm. Moreover, some parameters can be provided to the methods participating in an evaluation. However, these parameters must be the same for all the tests. It can be the case that some methods are able to tune their parameters depending on the presented ontologies. In such a case, the tuning process is considered to be part of the method. In the following, we elaborate these insights with respect to datasets, quality measures and evaluation processes used in the context of OAEI. Specifically, in §4.1, we discuss properties of ontologies and alignments that determine the hardness of a test. The datasets used in the OAEI initiative are presented in §4.2. In turn, §4.3 discusses evaluation measures and processes that were developed and used in OAEI. Finally, typical evaluation processes are discussed in §4.4. 4.1 Dataset Characteristics Good datasets are a prerequisite for a good evaluation. The nature of the datasets determines how far the evaluation design meets our first two desiderata: the coverage of relevant aspects and the fairness of the evaluation. In the case of ontology matching, a dataset typically consists of at least two ontologies and a reference alignment between these ontologies. In the following, we call the combination of exactly two ontologies and, if present, a reference alignment between these ontologies a test. A dataset consists

Ontology Alignment Evaluation Initiative: Six Years of Experience

167

of several tests. If not defined otherwise, we assume that each combination of ontologies plus the respective reference alignment is a test in the dataset. The work in [30] proposed the following criteria for designing or selecting datasets for ontology matching evaluation: – Complexity, i.e., that the dataset is hard for state of the art matching systems. – Discrimination ability, i.e., that the dataset can discriminate sufficiently among various matching approaches. – Incrementality, i.e., that the dataset allows for incrementally discovering the weaknesses of the tested systems. – Monotonicity, i.e., that the matching quality measures calculated on subsets of gradually increasing size converge to the values obtained on the whole dataset. – Correctness, i.e., that a reference alignment is available for the dataset, which allows to divide generated correspondences into correct and incorrect ones. There are two basic properties that determine the nature of a dataset, and thus, how well it meets the quality criteria mentioned above: the properties of the ontologies to be matched and the properties of the reference alignment, that are expected to be reproduced by the matching systems. Ontologies. There are two major aspects of an ontology that have an influence on the matching process: the complexity of labels used to describe classes, relations and instances in the ontology, that has an influence on the initial determination of candidate correspondences, and the complexity of the structures used to define these elements that is often used to improve and validate the initial hypotheses. Complexity of labels. Many matching systems use a combination of heuristics for comparing the labels of entities in ontologies in order to compute correspondences between these entities. Hence, the kind of labels found in an ontology influences heavily the performance of a particular matching system. Specifically, we distinguish between simple labels vs. sentence-like labels, monolingual vs. multilingual labels. It also often makes a large difference whether labels used in an ontology can be anchored to common background knowledge sources, such as WordNet, that helps interpreting those labels. Further complexity is added if the ontologies to be matched use specific vocabularies, e.g., from the biomedical or geo-spatial applications, that are outside common language. Complexity of structures. Almost all matching systems use the structure of definitions in the ontologies to be matched in the later stages of the matching process to propagate similarity estimations and to validate hypotheses on correct correspondences. Therefore, structures found in ontologies are also an important issue in the design of benchmark datasets. Fortunately, the standardization of the semantic web languages RDF and OWL provide a common syntax for comparing ontologies, but still the way and intensity this common syntax is used varies a lot. Directories and thesauri only use the hierarchical structure given by subsumption, while more expressive ontologies use relations between classes that may be constrained by various kinds of axioms. This additional knowledge can be used by matchers for matching as well as for checking the coherence of their alignments [48].

168

J. Euzenat et al.

On the level of instances, we can also have different levels of complexity. In particular, instances can either be described in detail using attributes and relations to other instances or can be atomic entities with no further explicit definitions or property specifications. Often instances represent links to external sources, e.g., web pages or images, that can be used as a basis for matching. In this case, the nature of the external resource can also make a significant difference. For example, web pages often provide a good basis for extracting additional information about the described object that makes matching easier, an image is harder to interpret and to compare with other resources. Reference alignments. A reference alignment is another important aspect to consider: characteristics, such as the types of semantic relations used in the alignment or the coverage of the alignment, have a significant impact not only on the hardness of the task but also puts restrictions on evaluation measures that are discussed later. Types of semantic relations. As mentioned in §3, an alignment consists of a set of correspondences defined by elements from the two ontologies and a semantic relation between them. The kind of semantic relations found in the reference alignment also determine what kind of relations the matching systems should be able to produce. The most commonly used relation is equivalence of elements (in most cases classes and relations). The majority of available matching systems are designed to generate equivalence statements. There are exceptions to this rule, however, that should be taken into account. Other kinds of relations that were investigated are subclass [67; 32] and disjointness relations [54; 32]. Formal properties of the alignment. Besides the type of a relation, its semantics is another relevant aspect. In particular, we have to distinguish between more and less rigorous interpretations of relations. The equivalence relation, for example, can be interpreted as logical equivalence or more informally as a high level of similarity or exchangeability. Using a rigorous formal interpretation of the semantic relations has the advantage that we can enforce formal properties on the reference alignment. For example, we can claim that the merged model consisting of the two ontologies and the alignment should be coherent, i.e., it should not contain unsatisfiable classes. Enforcing such consistency conditions is not possible for less formal interpretations. Cardinality and coverage. A less obvious property with a significant influence on the evaluation results is the cardinality of the reference alignment. In principle, there is no restriction on the alignment, so the relation between elements from the different ontologies can be an n-to-m relation. In practice, however, it turns out that the alignment relation is one-to-one in most cases. Therefore, matching systems often generate one-to-one alignments. Along the same lines, the degree of overlap between the ontologies to be matched is not restricted and a dataset could consist of two ontologies with little or no overlap. Typically, however, it is assumed that the two ontologies to be matched describe the same domain. As a consequence, matching systems normally try to find a correspondence for every element in the two ontologies rather than ignoring elements.

Ontology Alignment Evaluation Initiative: Six Years of Experience

169

4.2 OAEI Datasets From 2005 on, different datasets have been used in the OAEI evaluation campaigns. The aim of using these different sets is to cover as much as possible the relevant aspects of the matching problem, i.e., the desideratum D1 discussed above. Initially, the goal of the initiative was to achieve this coverage within a single dataset, the benchmark dataset. The benchmark dataset deals with the topic of scientific publications. It consists of a large set of artificial tests. These tests alter an initial ontology and the task is to match it to the modified ontology. Modifications concern both the element labels, e.g., replacing them by random labels, and the structure, e.g., deleting or inserting classes in the hierarchy. In addition, the dataset comprises four other real ontologies that have to be matched to the reference ontology. Details about the different tests can be found on the OAEI website4 . The declared goal of the benchmark dataset is the analysis of matching systems to identify their strengths and weaknesses with respect to the absence or the presence of certain structures in the ontologies to be matched. While the benchmark dataset serves this purpose quite well, it turned out to be less useful for other purposes. In particular, the benchmark dataset is not really suited for comparing the overall performance of systems. Obviously, comparing the performance of systems on the artificial tests is not useful for assessing system behavior in reality as each of the tests focuses on a specific situation that is not likely to occur in practice and the tests did not reflect any realistic situation. In consequence, we recognized that we needed other, more realistic tests to actually compare the performance of matching systems in realistic situations and that the benchmark dataset is not a suitable means for assessing matcher behavior on real tasks. However, it can be still used as an immediate first check-up of the newly proposed system in terms of its weaknesses, strengths and its presumable position with respect to the state of the art. Based on these experiences, the benchmark dataset was complemented by a number of other datasets that try to cover those aspects not addressed by the benchmark dataset. These datasets fall in different categories; see Table 1 for an overview of the datasets that are currently used in OAEI. Expressive ontologies. For addressing the issues of realism and difficulty identified on the benchmark dataset, we have introduced two datasets that are more challenging in the sense that they are much larger, more heterogeneous and feature more complex definitions of classes that have to be taken into account during matching. The datasets in this category are the OntoFarm5 dataset [69] also referred to as the conference dataset in the context of the OAEI campaigns and the anatomy dataset. The conference dataset consists of a set of fifteen OWL ontologies describing scientific conferences using complex definitions. The anatomy dataset consists of two ontologies describing the human and the mouse anatomy that are actually used in the medical community and have been manually matched by medical experts. For both datasets, reference alignments exist, but we have decided not to publish these reference alignments completely to avoid the effect we have observed for the benchmark dataset. Thus, it is possible to conduct a blind 4 5

http://oaei.ontologymatching.org/2009/benchmarks/ http://nb.vse.cz/~svatek/ontofarm.html

170

J. Euzenat et al.

Table 1. Characteristics of test cases (‘open’ evaluation is made with already published reference alignments, ‘blind’ evaluation is made by organizers from reference alignments unknown to the participants and ‘expert’ evaluation involves manual analysis of results, by an expert user) Dataset

Formalism

Relations

Confidence

Modalities

Language

benchmarks anatomy conference directory library

OWL OWL OWL-DL OWL SKOS +OWL OWL RDF RDF RDF SKOS +OWL

= = =, <= =, <, >, ⊥ exact-,narrow-, broadMatch =,<,> = = = exact-, closeMatch

[0 1] [0 1] [0 1] 1 1

open blind blind+open blind+open blind

EN EN EN EN EN+NL+FR

[0 1] [0 1] [0 1] [0 1] [0 1]

open open open open blind expert

EN EN EN EN NL+EN

benchmarksubs ars tap iimb vlcr

evaluation, where the correct answers are not given to the participants. Both datasets have become an integral part of the OAEI campaigns. Directories and thesauri. These datasets consist of large weakly structured ontologies, as they are already in use on the web and in digital libraries. The lack of a sophisticated structure puts the element labels in a much more prominent position. Besides the analysis of labels, the size of the datasets in this category is a major challenge for many matching systems as the structures to be matched contain up to hundreds of thousands of classes. A problem connected to these more realistic datasets, e.g., library in Table 1, is lack of complete reference alignments. Due to the size of the models creating such an alignment manually is not an option, therefore other means of evaluation had to be found [36; 30]. Instance matching. With the increasing interest in linked open data, it turns out that typical matching problems on the web consist of finding instances representing the same individual rather than finding equivalent classes in different ontologies. While instance matching is covered by the theory of ontology matching outlined in §3, it has not been represented in the OAEI campaigns until recently. Since 2009 a number of instance matching datasets have been included in the campaigns. These datasets are the iimb, the ars, and tap. These datasets comprise automatically generated benchmarks, in which one dataset is modified according to various criteria, as well as real data from the domain of scientific publications. Beyond equivalence. Finally, there are first attempts to move ahead from equivalence as the only semantic relation considered in the OAEI tests. There are tests now that ask for close matches as well using the relations ‘exactMatch’ and ‘closeMatch’.

Ontology Alignment Evaluation Initiative: Six Years of Experience

171

4.3 Evaluation Measures The diverse nature of OAEI datasets, mainly in terms of the complexity of test cases and presence/absence of (complete) reference alignments, has required to use different evaluation measures. Furthermore, evaluating a matching systems from different perspectives allows for avoiding to favor a certain approach or system, when evaluation is made under a same dataset. This is one of the criterion to meet the desideratum D2 presented above. Organizers have as well the important role of conducting a fair evaluation. Table 2 provides an overview of the evaluation criteria used in the OAEI evaluations. Table 2. OAEI evaluation criteria (compliance is usually measured with variations of precision and recall against available references) Type Compliance Other Measure / Dataset Manual Partial Complete Efficiency Data Logical Application labelling reference reference mining Reasoning oriented √ benchmarks √ √ anatomy √ √ √ √ conference √ directory √ √ library √ benchmarksubs √ √ ars √ √ tap √ √ iimb √ vlcr

The most commonly used and understood criterion for evaluation of ontology alignments is the compliance of matcher alignments with respect to the reference alignments. Measures, such as precision (true positive/retrieved), recall (true positive/expected) and f–measure (aggregation of precision and recall) have been used as basis in the OAEI campaigns for measuring compliance. For a subset of datasets, namely conference, library and vlcr, the complete reference alignment is not available and then compliance is measured on a partial reference alignment. Although precision and recall are standard measures for evaluating compliance of alignments, alternative measures addressing some limitations of these measures have been used. For example, it may happen that an alignment is very close to the expected result (reference alignment) and another one is quite remote from it, although both share the same precision and recall. The reason for this is that standard metrics only compare two sets of correspondences without considering if these are close or remote to each other. In order to better discriminate such systems a relaxed precision and recall measures were defined which replace the set intersection by a distance [20]. To solve another problem, that two alignments may score differently while being semantically equivalent, semantic precision and recall were defined based on entailment instead of inclusion [22].

172

J. Euzenat et al.

Specially in the cases where only partial reference is available, alternative evaluation approaches have been applied. For instance, in the conference track, manual labeling, data mining and logical reasoning techniques were considered: – For manual labeling, for each matcher the most highly rated correspondences were considered as population. n correspondences per matcher were randomly sampled from the population. These correspondences were then evaluated as correct or incorrect. As a result, a score for precision was estimated. – For supporting the discovery of non-trivial findings about matchers, data mining techniques and correspondence patterns were exploited as well. The aim is to find explanations on the so-called analytic questions, such as: (i) which systems give higher/lower validity than others to the correspondences that are deemed ‘in/correct’?; (ii) which systems produce certain matching patterns/correspondence patterns more often than others?; and (iii) which systems are more successful on certain types of ontologies? – Logical reasoning was used to measure the degree of incoherence that is caused by an alignment. The underlying idea is that a correct alignment should not result in unsatisfiable classes. Measuring the degree of (in)coherence of an alignment was first proposed in [48]. The approach adopted by the library track organizers, for compensating the lack of complete reference alignments, was based on application relevance. They considered the provided alignment in the context of an annotation translation process supporting the re-indexing of books indexed with one vocabulary A, using concepts from the aligned vocabulary B [36]. For each pair of vocabularies A − B, this scenario interprets the correspondences as rules to translate existing book annotations with A into equivalent annotations with B. Based on the quality of the results for those books for which the correct annotations are known, the quality of the initial correspondences can be assessed. The criteria above are about alignment quality. However, another useful comparison between systems refers to their efficiency. The best way to measure efficiency is running all the systems under the same controlled evaluation environment. However, in the previous OAEI campaigns, participants were asked to run their systems on their own machines and to send the resulting alignments to be evaluated. So, the information about the time each system takes to execute the matching was gathered directly from the participants and could not be directly compared. 4.4 Evaluation Processes An evaluation process represents the interaction between several components in an evaluation experiment (matchers, test providers, evaluators, etc.). A simple process restricts the experiment to the evaluation of one matcher using a set of test cases. Usually, several matchers are evaluated in one evaluation experiment. Figure 4 illustrates the evaluation process that extends the process presented at the beginning of this section (Figure 3). The first step is to retrieve, from a database of tests containing the ontologies to be matched and the corresponding reference alignments, the tests to be

Ontology Alignment Evaluation Initiative: Six Years of Experience

173

considered in such an evaluation. Next, each available matcher performs the matching task, taking as input parameters the two ontologies. Then, the resulting alignment is evaluated against the reference alignment, by an evaluator. Finally, each result interpretation is stored into the result database (for instance, precision and recall).

R o

evaluator matching

test

m

result

A

o

matcher

Fig. 4. Basic evaluation process

Due to the variability of the alignment evaluation, different scenarios can be specified, by adding new components to the process presented in Figure 4: Test generator. Test cases can be generated from a description of the kind of evaluation to be executed (for example, removing n% of the properties of the ontologies). A description of the desired test case must be provided and the output of the test generator service is then used as input to the matching process. Lack of reference alignment. It is not the case that all test cases have a complete reference alignment. Thus, alternative evaluation metrics must be provided, such as measuring the consensus between several matchers, intersection or union of the results, etc. User in the loop. Sometimes, matching systems are considered as semi-automatic and the user has control over the matching process. On the other hand, manual labeling can be required in the cases where the reference alignments are not available. Task-specific evaluation. It can be useful to set up experiments which do not stop at the delivery of alignments, but carry on with the particular task. This is especially true when there is a clear measure of the success of the overall task; see §4.3. The components described above can be combined together in different ways. Figure 5 illustrates a more elaborated process where tests are generated by a test generator, according to the description provided by the user. This generation process may create a set of alternative ontologies, from a reference ontology, by removing its properties or

174

J. Euzenat et al.

user

test description

matcher test generator o matching

test

A

result

o

Fig. 5. Advanced evaluation process

individuals. Moreover, one can imagine that no reference alignments are provided by the test generator. In such a scenario, the user has the role of an evaluator. For each generated test, the available matchers are executed and their resulting alignments are stored into a database, whose content will be used later for user evaluation.

5 Analysis of the Past OAEIs The evaluation design presented in the previous section was chosen to provide tool developers and potential users with feedback on the state of the art in ontology matching and to foster developments in the field, meeting the two last desiderata presented in §4. Therefore, a crucial question that needs to be answered is whether the initiative indeed supported an improvement of the field. In the following, we try to answer this question by providing an abstract view on the results of the evaluation campaigns. This overview shows that OAEI was a success in many respects. First of all, a large and vivid community was established around the OAEI campaigns, which is shown by an increasing number of participants and test cases provided by the community. Further, we will show that there actually has been an improvement of matching systems that frequently participated in the benchmarking campaigns both in terms of runtime and quality of matching results. Finally, and probably most importantly, we have gained insights in the nature of the ontology matching tasks and the functioning of matching systems. We first provide an overview of the evaluation campaigns that have been carried out from 2004 to 2010 (§5.1). We then summarize our observations with respect to the evolution of result quality (§5.2), the efficiency of matching systems (§5.3) and with respect to the impact of matching system configurations on the results (§5.4).

Ontology Alignment Evaluation Initiative: Six Years of Experience

175

5.1 Campaigns and Participation The first ontology matching evaluations were carried out in 2004 as part of the Information Interpretation and Integration Conference (I3CON)6 held at the NIST Performance Metrics for Intelligent Systems (PerMIS) workshop and the Evaluation of Ontologybased Tools (EON) workshop of the annual International Semantic Web Conference (ISWC) [62]. The workshops were organized in a joint but complementary way by different organizers. This parallel development emphasized the importance of the topic and indicated that a joint initiative would be of advantage. From 2005 on, joint activities are carried out under the heading of the Ontology Alignment Evaluation Initiative. The first official evaluation campaign of the initiative was carried out at the Workshop on Integrating Ontologies at the 3rd International Conference on Knowledge Capture (K-CAP 2005) in Banff, Canada. Since 2006, the annual evaluation campaigns are carried out at the International Semantic Web Conference in the context of the Ontology Matching workshop. Since the early beginning the workshop has a constant attendance of more than 50 participants working on the topic. Over the years, the number of systems participating has increased from 4 systems in 2004 to 15 in 2010. A detailed look shows that there was a significant increase from 4 in 2004 up to 17 in 2007, while from 2007 to 2010 the participation rate is with ≈ 15 participants relatively stable and fluctuates around this point. In future it is required to extend OAEI with new datasets and evaluation modalities according to the trends in the field (see §6) in order to maintain or increase the participation rate. Table 3 provides an overview of the campaigns carried out so far. More information on the individual campaigns can be found on the OAEI web site4 . Table 3. Overview of the evaluation campaigns year I3CON OAC OAEI OAEI OAEI OAEI OAEI OAEI

2004 2004 2005 2006 2007 2008 2009 2010

location

#tests #participants reference Gaithersburg, US 10 5 -6 [62] Hiroshima, JP 1 4 [28] Banff, CA 3 7 [26] Athens, US 6 10 [25] Busan, KR 7 17 [8] Karlsruhe, DE 8 13 [23] Chantilly, US 9 16 [24] Shanghai, CN 6 15

5.2 Quality Improvement The main goal of OAEI is to support the enhancement of the ontology matching field. In the following, we report on results that show in how far this goal has been reached. First, we present summative results. In particular, we show how the average f-measure developed from 2005 to 2010 analyzing those datasets which have been run several 6

http://www.atl.external.lmco.com/projects/ontology/i3con.html

176

J. Euzenat et al.

years in succession. Then we analyze those systems that have been participating continuously from 2007 to 2010 in detail. The presented results allow to discuss the effects of a continuous participation. Summative results. Our analysis required to recompute some of the values presented as results of the annual campaigns. In particular, the benchmark data was rendered more difficult in 2008. Since these changes affected both ontologies and resulting reference alignments, we did not recompute these values. The reference alignments of the conference track have been extended year by year. We recomputed the average f-measure based on the current, most comprehensive corpus of reference alignments. This has to be taken into account when analyzing the results. We have compared the average fmeasure in terms of the arithmetic mean over all participants per year and track. This gives a rough representation on the main tendency and allows for abstracting from interdependencies between precision and recall. 1. benchmark

f measure

anatomy directory conference

0.

2005

2006

2007

2008

2009

2010

year

Fig. 6. Evolution of the average f-measure in different datasets

The results of Figure 6 are heterogeneous for the different datasets. The results for the conference and directory dataset range from an f-measure of 0.3 to 0.5. Both datasets leave room for further improvements. A detailed look reveals that there is a high variance between participants. The top performers of the last years reached an f-measure in the range of 0.5 to 0.6 in the conference track and an f-measure of ≈ 0.6 in the directory track. The average f-measure for the benchmark and anatomy datasets ranges from 0.6 to 0.8, even though both datasets describe different domains and vary in size and expressivity. In both cases good results in f-measure are based on a high precision. The challenge with regard to these datasets is to increase recall of the results without decreasing the high precision scores. We observe a moderate increase in benchmark and conference with some exceptions. Results for anatomy and benchmark will be analyzed in more detail later on. The quality of the alignments submitted to the conference track increases with each year, with the exception of 2008. However, Figure 6 does not show that the average f-measure is

Ontology Alignment Evaluation Initiative: Six Years of Experience

177

based on very different results generated by each of the participating matching systems. It seems to be hard to find an appropriate configuration for matching ontologies of the conference dataset that is also appropriate for the other datasets. We discuss this issue in detail in §5.4. The improvements of the anatomy results are the most significant. In particular, we measured for each year a continous improvement. Remember that the reference alignment of the anatomy track was not available for participants (blind modality) until 2010 and it is hardly reconstructable without biomedical expertise. For what concerns the directory track (where the reference alignments were partially available), the overall trend from 2006 to 2010 is positive, though with a substantial drop in 2008. There are several explanations for this: (i) OLA2 and Prior+ never participated again after 2007 and those were the two systems showed top results, (ii) the set of participating systems in 2008 was almost completely different compared to 2007; it performed worse than the set of participating systems in 2007, but better than those participating in 2006. Overall we conclude that the field as a whole improved over the years. We have also claimed that systems entering the campaign for several times tend to improve over years. By providing matching system developers with measurable feedback on their developments, it seems reasonable to think that they will be able to analyze the reasons for these results in order to improve their systems. We consider this claim in the following. Very few of these systems have participated in most of the tests and also only a few systems have participated more than three years in a row, thus allowing a judgement of their individual improvement over time. We therefore have to base our discussion on quality improvement on a limited set of systems and datasets. From among the datasets that were systematically evaluated against a reference alignment from 2007 to 2010 we have chosen the benchmark and the anatomy datasets. We have selected these tracks because several systems participated at least three of four times in these tracks from 2007 to 2010. For the other tracks, we found a lower number of (more or less) continuous participation.

1.

asmov aroma

f measure

dssim lily mappso

0.

rimon taxomap

2007

2008

2009

2010

year

Fig. 7. Evolution of results on the benchmark dataset

178

J. Euzenat et al.

Results on the benchmark dataset. Figure 7 shows f-measure of the systems under consideration on the benchmark dataset in 2007 through 2010. These systems achieve a similar level of precision, between 80% and 95%, which is quite a high value for a matching task. Only recall differs and impacts f-measure. However, for each system there is little variation, and not necessary towards an increase. This is confirmed by the results of ASMOV and RiMOM which have participated for four years in a row, respectively .92 and .9 in f-measure. Figure 8 shows that the best systems are overall very safe because their precision is very close to 100% until 50% of recall and it is still over 95% until 90% of recall. They are also able to stop providing correspondences when their quality goes down (compared to the edna baseline). Figure 8 also shows the yearly progress of these systems by preserving better precision when looking for more recall. The reasons for this behavior is that benchmark is made of a myriad of tasks, some of which are very difficult, but most of which are relatively easy. In consequence, the overall results are higher than for other tasks which were designed to be realistic or hard. This means that the results (precision, recall, f-measure) cannot be transposed from benchmarks to other situations. This also explains why gaining the last points of precision and recall is difficult for each system individually. Due to this feature, the benchmark dataset has lost its discrimination power over the years: a large difference between systems on the benchmarks still reflects a difference in the versatility of systems in practice, but small differences are not relevant. In addition, benchmarks are often used by system developers to tune their systems both because they cover a variety of situations and because reference alignments are available. Hence, even systems which participate for the first time achieve relatively high performances. This may lead systems to be overfitted to the benchmark dataset, i.e., tuned to solve this particular kind 1.

edna 2010 ASMOV 2009 Lily

precision

2008 Lily 2007 ASMOV 2006 RiMOM 2005 Falcon

0. 0.

recall

1.

Fig. 8. Precision/recall curves for the yearly best results on the benchmark dataset (edna is a simple matcher based on edit distances on names of entities)

Ontology Alignment Evaluation Initiative: Six Years of Experience

179

of tasks. Only systems which are especially designed for another purpose and whose designers do not want to twist to address benchmarks achieve low performances. In summary, although artificially designed, benchmarks can be used as a starting point for developers to test which kinds of ontology features their systems handle better. This feedback can be then exploited for further improvements in their implementations. However, it is not relevant for predicting the behavior of a system in a particular situation. Results on the anatomy dataset. If our explanation of the results on the benchmark dataset is correct, and there is still an improvement of the overall quality of individual matchers, this improvement will have to be visible in the results on the blind datasets. We chose the anatomy dataset as a basis for checking this as it has been evaluated against a complete reference alignment from 2007 on. Further, we present the results of those systems that particpated at least three times in these four years.

1.

agrmaker

f measure

dssim rimon asmov lily taxomap 0.

2007

2008

2009

2010

year

Fig. 9. Evolution of results on the anatomy dataset

Figure 9 shows the development of f-measure of the systems from 2007 to 2010. This time, we can clearly see an upward trend in f-measure, which reflects both a significant increase in precision and a moderate increase in recall. This trend is more significant for the second time a system participates, reinforcing the analysis above that once participants know the results of the evaluation they can better improve it, but the next time the increase will be smaller. This pleads for more tests and more reference alignments given to participants because this can be a sign of over fitting to the OAEI datasets in general. Hence, we conclude that there has been a significant increase in the quality at least of those matching systems on real world datasets that participated in the evaluation on a regular basis. This supports our claim that OAEI observes a measurable quality improvement in the ontology matching field.

180

J. Euzenat et al.

5.3 Runtime

Median

Average

X-SOM

TaxoMap

SOBOM

SAMBO

RiMOM

Prior+

Lily

kosimap

Falcon-AO

DSSim

ASMOV

Aroma

AgreementMaker

Anchorflood

System / Year

Besides the quality of generated alignments, other criteria are important for practical applications. With the increase of the number and the size of existing ontologies, the runtime of systems becomes also crucial. We therefore made first attempts of measuring the runtime of matching systems as well. This was done in a systematic way for the first time in 2007 evaluation of the anatomy track. Runtime was not a topic of investigations for the other OAEI tracks, thus we have to focus on the results for the anatomy track. Due to the setting of previous OAEI campaigns, where system developers run their matchers on the test sets locally and send the results for inspection, it was not possible to measure comparable runtimes on a fair ground. For that purpose, it would have been necessary to execute all systems on the same machine ensuring a fair comparison. As an alternative, we collected statements about runtimes that have been measured by the participants themselves. This information had to be delivered together with a description of CPU and memory capabilities of the computer on which the matching process was executed. According to these descriptions, in 2009 most of the systems were run on a CPU with two cores in the range from 2.0 to 3.1 GHz, using 2 or 4 GB RAM memory. In 2007, this survey was conducted by OAEI organizers mainly for interest. However, the huge variability in the reported runtimes together with the fact that all systems were executed on machines of similar strength, encouraged us to publish runtimes as part of the results in 2007, following the same strategy in 2008 and 2009. In 2010 it was originally planned to conduct the track in a completely automized evaluation setting. Finally, the evaluation was conducted in semi-automized way. Due to this, runtime measurements are unfortunately missing for 2010. Table 4 gives an overview on the runtimes that were reported by the matching tool developers. These results show that the community has made clear improvements regarding runtime issues. In 2007, one of the systems required four days for matching the ontologies, while the slowest system in 2009 finished the matching process in under two hours. This trend is also reflected by the average and median values that significantly decreased from 2007 to 2009. It is also interesting to see that the median and average seem to converge in 2009 because there is no longer negative outliers that require an enormous amount of time. Note also that the decreased runtime is in most cases not related to a decreased quality of the generated alignment.

2007 - 30 - 900 75 12 - 5760 23 240 360 - 300 600 830 270 2008 1 - 4 230 17 - - 200 - 24 720 - 25 - 152.6 24.5 2009 0.25 23 1 5 12 - 5 99 - 10 - 19 12 18.6 11 Table 4. Runtimes in minutes reported by the participants

Ontology Alignment Evaluation Initiative: Six Years of Experience

181

More important than the trend line of faster matching systems, is the positive acceptance of presenting and discussing runtimes as part of the evaluation. Although we are aware that this way of gathering reported runtimes is a subject to criticism, the approach has nevertheless pointed to an important aspect in evaluating matching systems that would have been neglected otherwise. In §6.3, we describe an infrastructure that will finally allow to measure runtimes by the use of an evaluation runtime environment. 5.4 System Configuration and Optimisation We study here the configuration of participating systems and the possible influence of datasets and evaluation settings on the performance of systems. For that purpose, we examine the blind results obtained on the conference dataset with the results obtained by the same systems in the benchmark dataset. Our analysis is based on a discussion of the results from the OAEI 2009 conference track. All of the participants of the conference track have also participated in the benchmark track, most of them with good results. These systems are AFlood [56], AgreementMaker [11] as well as an extension of the system, AROMA [13], Asmov [37], Kosimap [52] and DSSim [50]. Two systems, namely DSSim and AFlood, did not annotate correspondences with confidence values. Since our approach requires confidence values, we omitted these systems. The conference submissions of 2009 are well suited for our analysis, because in 2009 the evaluation of the submitted results was based for the first time on the use of a substantial set of reference alignments. Only a small subset of these alignments has been available prior to the evaluation. Similar to the ontologies of the benchmark track, the ontologies of the conference track are of moderate size (between 32 and 140 classes). They cover the domain of conference organization. This domain is partially overlapping with the domain of bibliography. In addition, in both cases, ontologies are labeled in natural language terms (as opposed to the ontologies of the anatomy track, for instance). Thus, we would expect that a system that obtains good results for the benchmark track, obtains similar results for the conference track. In particular, we would expect that the configuration of such a system is also well suited for the conference track. However, the results do not fit with this hypothesis. In Figure 10, the dependency between f-measure and threshold is shown for each system that participated in this track. Figure 10 is generated, for each submitted alignment featuring confidences different than 1, by applying a posteriori a threshold that is increased step by step. For each setting the resulting f-measure is measured and its average for all test cases is depicted in Figure 10. For each of the other systems, we can distinguish between two interesting points. The first one is the threshold t where the f-measure increases for the first time (e.g., for kosimap t = 0.1, ASMOV did not use a threshold). This point refers to the threshold that was chosen by the tool developer when running his matching tool to generate the alignments. Since none of the correspondences has a confidence value below this threshold, we observe a horizontal line for, e.g., t < 0.1 with regard to kosimap. The second interesting point is the threshold t where the system reaches its maximum fmeasure (e.g., for kosimap t = 0.52).

182

J. Euzenat et al.

f measure

.6

0. threshold

0.

aroma 0.40

amaker 0.55 asmov 0.54

1. amext 0.49 kosimap 0.45

Fig. 10. F-measures of matching systems for different thresholds (the values under the legends are the absolute optimal f-measure for each system)

These curves are all nearly monotonically increasing until the optimum and then monotonically decreasing. This could be the sign of a robust way from all systems to rank correspondences (the increasing phase corresponds to less than optimally ranked false positives and the decreasing phase, more than optimally ranked true positives). On the other side, if these systems are good at ranking correspondences, they are not very good at finding the optimal threshold. Moreover, they are all lower than optimal. However, the f-measure of these systems is far lower than the one they obtain in the benchmark track. How can this be explained? We cannot exclude that systems are overfitting on benchmarks, i.e., they are optimized for performing well at benchmark. Alternatively, the benchmark dataset has a particular feature that can favor those systems which try to maximize pairing of entities, e.g., by considering a similarity between two entities as a gain and to maximize the gain. Given one of the benchmark test cases, let os be the smaller ontology and let ol be the larger ontologies. Then each matchable entity in os has a counterpart in ol . Thus maximizing pairing is a good strategy. We call this over-matching. In fact, overfitting or over-matching would have the same results. This characteristic occurs in most of the benchmark tests and in none of the conference tests. Moreover, it is only likely to occur in specific scenarios such as version matching. So, it introduces a bias in evaluation results towards a particular strategy. The lessons of this analysis from the standpoint of evaluation are three fold: – The benchmark test case should be modified so that this bias is suppressed;

Ontology Alignment Evaluation Initiative: Six Years of Experience

183

– Delivering more datasets with reference alignments would help developers avoiding overfitting; – Multiplying datasets and studying divergence is a good direction because it allows to test matchers in different situations and cross-compare the results from multiple datasets. Further analysis that goes beyond the scope of a single track is required to understand the effects and appropriateness of specific system configurations.

6 Trends and challenges From the last years of OAEI (2005-2010), we can identify medium term trends and challenges for ontology matching evaluation. We distinguish between two types of trends: some trends are relevant for the evaluation of systems in general (§6.1) while others are more specific to the matching problem (§6.2). The first ones are independent of the concrete problem that is solved by a system to be evaluated. Typical examples are evaluation metrics related to the effectiveness of user interaction or issues related to the hardness of a test. In addition to these two groups of trends, we finally conclude in §6.3 with a challenge that is tangent to many of the issues. We discuss the automation of ontology evaluation and the infrastructure required for that purpose. In particular, we present the infrastructure developed in the SEALS project7 (Semantic Evaluation at Large Scale), which represents a partial continuation of OAEIs, as key to solve many open issues. The work in [60] described ten challenges for ontology matching. Amongst others, these include large-scale evaluation, performance of ontology matching techniques and reasoning with alignments, which are directly related to trends identified here. 6.1 General Issues User interaction. Automatic ontology matching can only be a first step in generating a final alignment. Therefore, systems that (i) automatically generate an initial set of matching hypothesis, (ii) support the user in the refinement of the generated alignment, (iii) propagate the user input to semi-automatic filter and/or extend the alignment are in advance and will finally be used to solve concrete matching problems. An example for such a system is AgreementMaker [10]. However, current evaluation techniques do not take into account the quality and effectiveness of user interventions. Currently, only a subtask of the anatomy track in OAEI deals with the third aspect marginally, while the second point is not at all considered. This is one of the most important drawbacks of current evaluation practices that has to be tackled in future. In situ evaluation. The overall approach underlying most OAEI evaluations is based on the implicit assumption that there exists a unique reference alignment that correctly describes how ontologies have to be matched. Although this reference alignment is not always available, correspondences can in principle be divided in correct and incorrect ones. However, the relative quality or usefulness of a generated alignment also depends 7

http://about.seals-project.eu/

184

J. Euzenat et al.

on its intended use. The difference between these approaches was emphasized in [68] by a comparison of relevance and correctness. In [34], an evaluation method is described that takes into account some characteristics of a usage scenario and reports the respective evaluation results. Large scale analysis. OAEI campaigns gave only some preliminary evidence of the scalability characteristics of the ontology matching technology. We reported about these attempts in §5. Therefore, larger tests involving 10.000, 100.000 and 1.000.000 entities per ontology (e.g., UMLS has about 200.000 entities) are to be designed and conducted. In turn, this raises the issues of a wider automation for acquisition of reference alignments, e.g., by minimizing the human effort while increasing an evaluation dataset size [60; 46]. Notice also that scalability involves not only the consideration of runtime, but has to focus also on aspects as memory consumption and required disk capacity. Defining and measuring test hardness. There is a need for evaluation methods grounded on a deep analysis of the matching problem space. Semi-automatic test generation methods require such an analysis as basis. These methods will allow for the construction of tests of desired hardness by addressing a particular point in the matching problem space. We have already argued that additional tests are required. Initial steps towards this line were already discussed in [30]. 6.2 Specific Issues In the following, we present several specific issues that we believe will become more important to OAEI: complex matching, instance matching and database schema matching. Complex matching refers to a matching process in which correspondences are not restricted to link named entities, but can also link complex descriptions. Instance matching is not concerned with matching terminological entities but focuses on matching individuals. Finally, schema matching has received decades of attention in the database community. Database schemas are different from ontologies, e.g., by not providing explicit semantics for their data. However, these are also similar in the sense that both schemas and ontologies provide a vocabulary of terms and constrain the meaning of terms used in the vocabulary. Moreover, in real life situations schemas and ontologies have both well defined and obscure labels and structures, thus, these often share similar solutions, which need to be evaluated. Complex matching. State of the art ontology matching techniques are often limited to detect correspondences between atomic concepts and properties. Nevertheless, for many concepts and properties atomic counterparts will not exist, while it is possible to construct equivalent complex concept and property descriptions [58]. A typical example, presented in [53], is the correspondence Researcher ≡ Person ∃researchedBy−1 .. The expressivity supported by the available Alignment API [21] implementation was in the past restricted to non-complex correspondences and has recently been extended to a more expressive language referred to as EDOAL (Expressive and Declarative Ontology Alignment Language) [14]. Even though the infrastructure for expressing complex correspondences is now available and several approaches for complex matching techniques have been proposed (see, for example, [15; 53]).

Ontology Alignment Evaluation Initiative: Six Years of Experience

185

Instance matching and linked data. While rich ontologies were promoted as an integral part of every semantic web application [35], it is increasingly argued that the real value of the semantic web is based on its ability to create and maintain linked open data which provides effective access to semantically enhanced information on the web [65]. In 2009, OAEI comprised for the first time a track explicitly concerned with instance matching. In 2009 six matching systems participated, in 2010 five systems participated. It can be expected that this track will be an important component of the OAEI in the following years with an increasing number of participants. Database schema matching. As was mentioned in §2, at present in the database community there are no well-established benchmarks for comparing schema matching tools. However, there are many recent schema matching tools and more generally model management infrastructures, e.g., COMA++ [3], AgreementMaker [12], GeRoMe [40; 39], Harmony [49; 57], that are able also to process ontologies, and hence, might be interested to test them within OAEI, as actually already happens, though modestly. On the other hand, OAEI has to consider including explicit schema matching tasks involving XML and relational schemas in order boost the cross-fertilization between these communities. 6.3 Automation Although OAEI campaigns have created a basis for evaluation that did not exist before, the progress in leveraging increased evaluation efforts has to be made in order to continue the growth of ontology matching technology. Further progress is highly dependent on the automation of many parts of the evaluation process. This would reduce the effort necessary for carrying out evaluation, but above all, this would allow to handle more complex evaluation processes as well as measurements of runtime and memory consumption. Reducing the evaluation effort will allow for better meeting the fourth desideratum discussed in §4. The SEALS project aims at establishing systematic evaluation methods for semantic technologies, including ontology matching, by providing standardized datasets, evaluation campaigns for typical semantic web tools and, in particular, a software infrastructure – the SEALS platform – for automatically executing evaluations. This platform will allow matcher developers to run their tools on the execution environment of the platform in both the context of an evaluation campaign and on their own for a formative evaluation of the current version of the tool. The results can be published both in the context of the evaluation campaign or in the context of evaluating a tool on its own. In both cases, results are reproducible, since the matching system, the test dataset and the results themselves are stored and archived in the repositories of the SEALS platform. This approach differs from the approach conducted in the OAEI campaigns where participants send their results (and their systems) to the OAEI organizers in the Alignment API format [21]. These submissions are accepted by the organizers as official results of the matching system. After a phase of validating, e.g., the format of the submissions, evaluation experiments are conducted by the organizers and the results are

186

J. Euzenat et al.

prepared and finally presented on a webpage8 and in the annual result reports [28; 26; 25; 8; 23; 24]. This process requires several weeks before first results are published. The SEALS platform aims at automating most of the evaluation process. This allows tool developer to receive a direct feedback. OAEI will in particular benefit from both the reduced amount of effort required by the organizers and from the controlled execution environment. This environment ensures that the matching systems generate the alignments with a fixed setting for each track and test case. In particular, it allows to execute all evaluated matching systems in the same controllable context. Thus, it is possible to conduct precise runtime measurements that will replace the report-based approach used from 2007 to 2009. OAEI and SEALS are closely coordinated and the SEALS platform will be progressively integrated within the OAEI evaluations. In a first phase, the participants of three OAEI 2010 tracks (benchmarks, anatomy, conference) were asked to make their tools available as web services. Implementing the required interface allowed participants in 2010 to debug their system from their own site. This approach substitutes the phase of preliminary testing as well as the submission of the final results. The alignments are generated on the machine of the tool developer and sent to the SEALS platform in the context of an evaluation process. On the one hand, evaluation results are immediately available in this setting. On the other hand, runtime and memory consumption cannot be correctly measured due the fact that the controlled execution environment is missing. Details on this approach can be found in [64]. In the second phase, which is planned already for OAEI 2011, the tools will be deployed in the SEALS platform. This allows organisers to compare systems on the same basis, in particular in terms of runtime. This is also a test of the deployability of tools. The successful deployment relies on the Alignment API and requires additional information about how the tool can be executed in the platform and its dependencies in terms of resources (e.g., installed databases or resources like WordNet). For that reason, the challenging goal of the SEALS project can only be reached with the support of the matching community and it highly depends on the acceptance by the tool developers.

7 Conclusions The OAEI campaigns of the last years have provided extensive experience in ontology matching and evaluation of semantic technologies in general. This experience was reported in this paper. We summarize lessons learned that are worth emphasizing because they are relevant not only to OAEI, but also to the evaluation activities in other areas of semantic technologies and beyond. As the reported experience indicates, foremost, there is a real need for systematic evaluation. Researchers and practitioners of the ontology matching tools have eagerly taken up the challenges offered by OAEI and actively participated from the beginning on. In general, systems have improved their performances over the campaigns for most of the tracks. This is specially corroborated by the results for the anatomy track, but this is a general trend. 8

See, for example, http://oaei.ontologymatching.org/2010/results/

Ontology Alignment Evaluation Initiative: Six Years of Experience

187

We observed that it was necessary to evolve with the field, involving our understanding of it and the reaction of developers to the proposed datasets. For example, most of the participants focused on the benchmark dataset, followed by anatomy and conference. There are only few systems that did not submit their results for benchmark. It can be due to the fact that benchmark offers relatively easy tasks and full reference alignments. Developers naturally use this available information (evaluation results) for improving their results. However, this overfitting has a potential influence on the performance of the systems. In turn, this requires to be reactive in proposing new datasets, new measures and new evaluation settings. We have pointed out areas in which improvements are necessary: more varied benchmarks (from various vertical domains as well as transversal ones), instance-based and user-assisted matching to name a few. Also we made the case for automation and reported about first steps made in that direction. Increased automation does not only mean less work for evaluation organizers and better reproducibility. It offers the opportunity to generate datasets and thus to test scalability, variability and to understand test hardness. This allows for performing runtime, space and deployability measurements. Ultimately, it turned out that a rather minimal common infrastructure was sufficient to start the initiative. Finally, setting up such an evaluation is a great chance, and a great responsibility: it has an influence not only on the improvement of systems but also on research directions being followed. This chance, however, comes at a price, since the successful evaluation initiative requires a deep understanding of the problem domain and substantial resources to be dedicated to creating datasets, designing protocols and processing evaluations.

Acknowledgements Jérôme Euzenat, Christian Meilicke, Heiner Stuckenschmidt and Cássia Trojahn dos Santos are partially supported by the European project SEALS (IST-2009-238975). Pavel Shvaiko was supported by the Trentino as a Lab initiative of the European Network of the Living Labs at Informatica Trentina. We are grateful to all our colleagues who contributed to the OAEI campaigns: Caterina Caraciolo, Alfio Ferrara, Ryutaro Ichise, Antoine Isaac, Fausto Giunchiglia, Willem Robert van Hage, Laura Hollink, Cliff Joslyn, Véronique Malaisé, Malgorzata Mochol, Andriy Nikolov, Natasha Noy, Juan Pane, Marta Sabou, François Scharffe, Vassilis Spiliopoulos, Ondˇrej Šváb-Zamazal, Vojtech Svatek, George Vouros, Shenghui Wang and Mikalai Yatskevich. Finally, we also thank all the OAEI participants who made these evaluations worthwhile: Masaki Aono, John Beecher-Deighan, Boutheina Ben Yaghlane, Sadok Ben Yahia, Wayne Bethea, Jürgen Bock, Olivier Bodenreider, Gosse Bouma, Silvana Castano, Gong Cheng, Isabel F. Cruz, Carlo Curino, Jérôme David, Jean-François DjoufakKengue, Marc Ehrig, Daniel Engmann, Alfio Ferrara, Clayton Fink, Sandra Geisler, Jorge Gracia, Philippe Guégan, Fayçal Hamdi, Jan Hettenhausen, Bo Hu, Wei Hu, Yves R. Jean-Mary, Ningsheng Jian, Mansur R. Kabuka, Yannis Kalfoglou, Vangelis Karkaletsis, Ulas C. Keles, David Kensche, Ching-Chieh Kiu, Konstantinos Kotis, Najoua Laamari, Patrick Lambrix, Chien-Sing Lee, Dan Li, Juanzi Li, Xiang Li, Yi Li, Peng Liu, Qiang Liu, Angela Maduko, Ming Mao, Sabine Massmann, Eduardo Mena,

188

J. Euzenat et al.

Engelbert Mephu-Nguifo, Gianpaolo Messa, Enrico Motta, Miklos Nagy, Slawomir Niedbala, Nobal Niraula, Giorgio Orsi, Roelant Ossewaarde, Flavio Palandri-Antonelli, Jeff Z. Pan, Yefei Peng, Yuzhong Qu, Christoph Quix, Erhard Rahm, Nataliya Rassadko, Quentin Reul, Chantal Reynaud, Marta Sabou, Brigitte Safar, Hanif Seddiqui, Feng Shi, E. Patrick Shironoshita, Yahya Slimani, Vassilis Spiliopoulos, Piotr Stolarski, Umberto Straccia, Cosmin Stroe, Heiko Stoermer, William Sunna, York Sure, He Tan, Letizia Tanca, Jie Tang, Haijun Tao, Raphaël Troncy, Alexandros G. Valarakos, Petko Valtchev, Maria Vargas-Vera, George A. Vouros, Peng Wang, Yadong Wang, Honghan Wu, Baowen Xu, Peigang Xu, Tianyi Zang, Haifa Zargayouna, Sami Zghal, Yuanyuan Zhao, Duo Zhang, Songmao Zhang, Xiao Zhang, Dongdong Zheng, Qian Zhong and Xinyu Zhong.

References 1. Alexe, B., Tan, W.C., Velegrakis, Y.: Comparing and evaluating mapping systems with STBenchmark. VLDB Endowment (PVLDB) 1(2), 1468–1471 (2008) 2. Alexe, B., Tan, W.C., Velegrakis, Y.: Stbenchmark: towards a benchmark for mapping systems. VLDB Endowment (PVLDB) 1(1), 230–244 (2008) 3. Aumueller, D., Do, H.-H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the 24th International Conference on Management of Data (SIGMOD), Software Demonstration, Baltimore, MD US, pp. 906–908 (June 2005) 4. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (2001) 5. Bernstein, P., Halevy, A., Pottinger, R.: A vision of management of complex models. ACM SIGMOD Record 29(4), 55–63 (2000) 6. Bizer, C., Schultz, A.: The berlin SPARQL benchmark. International Journal of Semantic Web and Information Systems 5(2), 1–24 (2009) 7. Bouquet, P., Ehrig, M., Euzenat, J., Franconi, E., Hitzler, P., Krotzsch, M., Serafini, L., Stamou, G., Sure, Y., Tessaris, S.: Specification of a common framework for characterizing alignment. Deliverable D2.2.1v2, Knowledge web NoE (December 2004) 8. Caracciolo, C., Euzenat, J., Hollink, L., Ichise, R., Isaac, A., Malaisé, V., Meilicke, C., Pane, J., Shvaiko, P., Stuckenschmidt, H., Zamazal, O.Š., Svatek, V.: Results of the ontology alignment evaluation initiative 2008. In: ISWC 2008, Karlsruhe, DE, pp. 73–119 (October 2007) 9. Castro, R.G., Maynard, D., Foxvog, D., Wache, H., González-Cabero, R.: Specification of a methodology, general criteria, and benchmark suites for benchmarking ontology tools. Deliverable D2.1.4, Knowledge web NoE (February 2004) 10. Cruz, I., Antonelli, F.P., Stroe, C.: Agreementmaker: efficient matching for large real-world schemas and ontologies. VLDB Endowment 2(2), 1586–1589 (2009) 11. Cruz, I., Antonelli, F.P., Stroe, C., Keles, U.C., Maduko, A.: Using AgreementMaker to align ontologies for OAEI 2009: Overview, results, and outlook. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 135–146 (October 2009) 12. Cruz, I.F., Antonelli, F.P., Stroe, C.: Agreementmaker: Efficient matching for large real-world schemas and ontologies. VLDB Endowment (PVLDB) 2(2), 1586–1589 (2009) 13. David, J.: AROMA results for OAEI 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 147–151 (October 2009) 14. David, J., Euzenat, J., Scharffe, F., Trojahn, C.: The Alignment API 4.0. Semantic web journal 2(1) (2011)

Ontology Alignment Evaluation Initiative: Six Years of Experience

189

15. Dhamankar, R., Lee, Y., Doan, A.-H., Halevy, A., Domingos, P.: iMAP: Discovering complex semantic matches between database schemas. In: Proceedings of the 23rd International Conference on Management of Data (SIGMOD), Paris, FR, pp. 383–394 (June 2004) 16. Do, H.-H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Chaudhri, A.B., Jeckle, M., Rahm, E., Unland, R. (eds.) NODe-WS 2002. LNCS, vol. 2593, pp. 221–237. Springer, Heidelberg (2003) 17. Duchateau, F., Bellahsene, Z.: Measuring the quality of an integrated schema. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 261–273. Springer, Heidelberg (2010) 18. Duchateau, F., Bellahsene, Z., Hunt, E.: XBenchMatch: a benchmark for xml schema matching tools. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, AT, pp. 1318–1321 (September 2007) 19. Duchateau, F., Coletta, R., Bellahsene, Z., Miller, R.J.: (not) yet another matcher. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management CIKM, Hong Kong, CN, pp. 1537–1540 (November 2009) 20. Ehrig, M., Euzenat, J.: Relaxed precision and recall for ontology matching. In: Ashpole, B., Ehrig, M., Euzenat, J., Stuckenschmidt, H. (eds.) Proceedings of the Workshop on Integrating Ontologies, vol. 156, p. 8 (August 2005) 21. Euzenat, J.: An API for ontology alignment. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 698–712. Springer, Heidelberg (2004) 22. Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, IN, pp. 348–353 (January 2007) 23. Euzenat, J., Ferrara, A., Hollink, L., Isaac, A., Joslyn, C., Malaisé, V., Meilicke, C., Nikolov, A., Pane, J., Sabou, M., Scharffe, F., Shvaiko, P., Spiliopoulos, V., Stuckenschmidt, H., Zamazal, O.Š., Svatek, V., Trojahn, C., Vouros, G., Wang, S.: Results of the ontology alignment evaluation initiative 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, Washington (DC US), pp. 73–126 (October 2009) 24. Euzenat, J., Ferrara, A., Meilicke, C., Pane, J., Scharffe, F., Shvaiko, P., Stuckenschmidt, H., Zamazal, O.Š., Svatek, V., Trojahn, C.: Results of the ontology alignment evaluation initiative 2010. In: Proceedings of the ISWC 2010 Workshop on Ontology Matching, Shanghai, China, pp. 85–125 (2010) 25. Euzenat, J., Isaac, A., Meilicke, C., Shvaiko, P., Stuckenschmidt, H., Šváb, O., Svatek, V., Hage, W.R., Yatskevich, M.: Results of the ontology alignment evaluation initiative 2007. In: Proceedings of the ISWC 2007 Workshop on Ontology Matching, Busan (KR), pp. 96–132 (November 2007) 26. Euzenat, J., Mochol, M., Shvaiko, P., Stuckenschmidt, H., Šváb, O., Svatek, V., van Hage Robert, W., Yatskevich, M.: Results of the ontology alignment evaluation initiative 2006. In: Proceedings of the ISWC 2006 Workshop on Ontology Matching, Athens (GA US), pp. 73–95 (November 2006) 27. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 28. Euzenat, J., Stuckenschmidt, H., Yatskevich, M.: Introduction to the ontology alignment evaluation 2005. In: Proceedings of the K-CAP 2005 Workshop on Integrating Ontologies, Banff, CA (October 2005) 29. Garcia-Castro, R., Gómez-Pérez, A., Prieto-Gonzalez, J.: IBSE: An OWL interoperability evaluation infrastructure. In: Golbreich, C., Kalyanpur, A., Parsia, B. (eds.) Proceedings of the Workshop on OWL: Experiences and Directions (OWLED), Innsbruck, Austria. CEUR Workshop Proceedings, vol. 258, pp. 1–10 (June 2007) 30. Giunchiglia, F., Paolo, M.Y., Avesani, Shvaiko, P.: A large scale dataset for the evaluation of ontology matching systems. The Knowledge Engineering Review Journal (KER) 24(2), 137–157 (2009)

190

J. Euzenat et al.

31. Giunchiglia, F., Shvaiko, P., Yatskevich, M.: Semantic schema matching. In: Chung, S. (ed.) OTM 2005. LNCS, vol. 3760, pp. 347–365. Springer, Heidelberg (2005) 32. Giunchiglia, F., Yatskevich, M., Shvaiko, P.: Semantic matching: Algorithms and implementation. In: Spaccapietra, S., Atzeni, P., Fages, F., Hacid, M.-S., Kifer, M., Mylopoulos, J., Pernici, B., Shvaiko, P., Trujillo, J., Zaihrayeu, I. (eds.) Journal on Data Semantics IX. LNCS, vol. 4601, pp. 1–38. Springer, Heidelberg (2007) 33. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics 3(2), 158–182 (2005) 34. Hollink, L., van Assem, M., Wang, S., Isaac, A., Schreiber, G.: Two variations on ontology alignment evaluation: Methodological issues. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 388–401. Springer, Heidelberg (2008) 35. Horrocks, I.: Ontologies and the semantic web. Communications of the ACM 51(11), 58–67 (2008) 36. Isaac, A., Matthezing, H., van der Meij, L., Schlobach, S., Wang, S., Zinn, C.: Putting ontology alignment in context: Usage scenarios, deployment and evaluation in a library case. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 402–417. Springer, Heidelberg (2008) 37. Jean-Mary, Y.R., Shironoshita, E.P., Kabuka, M.R.: Ontology matching with semantic verification. Journal of Web Semantics 7(3), 235–251 (2009) 38. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. The Knowledge Engineering Review 18(1), 1–31 (2003) 39. Kensche, D., Quix, C., Li, X., Li, Y., Jarke, M.: Generic schema mappings for composition and query answering. Data and Knowledge Engineering 68(7), 599–621 (2009) 40. Kensche, D., Quix, C., Chatti, M.A., Jarke, M.: Gerome: A generic role based metamodel for model management. In: Spaccapietra, S., Atzeni, P., Fages, F., Hacid, M.-S., Kifer, M., Mylopoulos, J., Pernici, B., Shvaiko, P., Trujillo, J., Zaihrayeu, I. (eds.) Journal on Data Semantics VIII. LNCS, vol. 4380, pp. 82–117. Springer, Heidelberg (2007) 41. Kuster, U., Konig-Ries, B.: Towards standard test collections for the empirical evaluation of semantic web service approaches. International Journal of Semantic Computing 2(3), 381–402 (2008) 42. Lenzerini, M.: Data integration: A theoretical perspective. In: Proceedings of the 21st Symposium on Principles of Database Systems (PODS), Madison, WI, US, pp. 233–246 (June 2002) 43. Luther, M., Liebig, T., Böhm, S., Noppens, O.: Who the heck is the father of bob? In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 66–80. Springer, Heidelberg (2009) 44. Ma, L., Yang, Y., Qiu, Z., Xie, G., Pan, Y., Liu, S.: Towards a complete OWL ontology benchmark. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 125–139. Springer, Heidelberg (2006) 45. Madhavan, J., Bernstein, P., Rahm, E.: Generic schema matching with Cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), Roma, IT, pp. 48–58 (September 2001) 46. Maltese, V., Giunchiglia, F., Autayeu, A.: Save up to 99% of your time in mapping validation. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1044–1060. Springer, Heidelberg (2010) 47. Marie, A., Gal, A.: Boosting schema matchers. In: Chung, S. (ed.) OTM 2008, Part I. LNCS, vol. 5331, pp. 283–300. Springer, Heidelberg (2008)

Ontology Alignment Evaluation Initiative: Six Years of Experience

191

48. Meilicke, C., Stuckenschmidt, H.: Incoherence as a basis for measuring the quality of ontology mappings. In: Proceedings of the ISWC 2008 Workshop on Ontology Matching, Karlsruhe, DE, pp. 1–12 (October 2008) 49. Mork, P., Seligman, L., Rosenthal, A., Korb, J., Wolf, C.: The Harmony Integration Workbench. In: Spaccapietra, S., Pan, J.Z., Thiran, P., Halpin, T., Staab, S., Svatek, V., Shvaiko, P., Roddick, J. (eds.) Journal on Data Semantics XI. LNCS, vol. 5383, pp. 65–93. Springer, Heidelberg (2008) 50. Nagy, M., Vargas-Vera, M., Motta, E.: DSSim-ontology mapping with uncertainty. In: Proceedings of the ISWC 2006 Workshop on Ontology Matching, pp. 115–123 (November 2006) 51. Petrie, C., Margaria, T., Lausen, H., Zaremba, M.: Semantic Web Services Challenge - Results from the First Year. Semantic Web and Beyond, vol. 8. Springer, Heidelberg (2009) 52. Reul, Q., Pan, J.Z.: KOSIMap: ontology alignments results for OAEI 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 177–185 (October 2009) 53. Ritze, D., Meilicke, C., Zamazal, O.Š., Stuckenschmidt, H.: A pattern-based ontology matching approach for detecting complex correspondences. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, Washington, DC, USA (October 2009) 54. Sabou, M., Gracia, J.: Spider: bringing non-equivalence mappings to OAEI. In: Proceedings of the ISWC 2008 Workshop on Ontology Matching, Karlsruhe, (DE), pp. 199–205 (October 2008) 55. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: Sp2 bench: A SPARQL performance benchmark. In: Proceedings of the 25th International Conference on Data Engineering ICDE, Shanghai, China, pp. 222–233. IEEE, Los Alamitos (2009) 56. Seddiqui, M.H., Aono, M.: Anchor-Flood: results for OAEI 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 127–134 (October 2009) 57. Seligman, L., Mork, P., Halevy, A.Y., Smith, K., Carey, M.J., Chen, K., Wolf, C., Madhavan, J., Kannan, A., Burdick, D.: Openii: an open source information integration toolkit. In: Proceedings of the 29th International Conference on Management of Data (SIGMOD), Indianapolis, IN, US, pp. 1057–1060 (2010) 58. Seylan, I., Franconi, E., de Bruijn, J.: Effective query rewriting with ontologies over dboxes. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), Pasadena (USA), pp. 923–925 (July 2009) 59. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. Journal on Data Semantics IV, 146–171 (2005) 60. Shvaiko, P., Euzenat, J.: Ten challenges for ontology matching. In: Proceedings of the 7th International Conference on Ontologies, Databases, and Applications of Semantics, Monterrey, MX, pp. 1163–1181 (November 2008) 61. Stoilos, G., Grau, B.C., Horrocks, I.: How incomplete is your semantic web reasoner? In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2010), Atlanta, USA, pp. 11–15. AAAI Press, Menlo Park (2010) 62. Sure, Y., Corcho, O., Euzenat, J., Hughes, T. (eds.): Proceedings of the 3rd ISWC Workshop on Evaluation of Ontology-based tools EON, Hiroshima, JP (November 2004) 63. Sure, Y., Gómez-Pérez, A., Daelemans, W., Reinberger, M.-L., Guarino, N., Noy, N.: Why evaluate ontology technologies? because it works! IEEE Intelligent Systems 19(4), 74–81 (2004) 64. Trojahn, C., Meilicke, C., Euzenat, J., Stuckenschmidt, H.: Automating oaei campaigns (first report). In: Proceedings of the International Workshop on Evaluation of Semantic Technologies, IWEST (2010)

192

J. Euzenat et al.

65. Tummarello, G., Delbru, R., Oren, E.: Sindice.com: Weaving the open linked data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 552–565. Springer, Heidelberg (2007) 66. Vaccari, L., Shvaiko, P., Pane, J., Besana, P., Marchese, M.: An evaluation of ontology matching in geo-service applications. GeoInformatica (2011) (in press) 67. van Hage, W.R., Katrenko, S., Schreiber, G.: A method to combine linguistic ontologymapping techniques. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 732–744. Springer, Heidelberg (2005) 68. van Hage Robert, W., Kolb, H., Schreiber, G.: Relevance-based Evaluation of Alignment Approaches: The OAEI 2007 Food Task Revisited. In: Proceedings of the ISWC 2008 Workshop on Ontology Matching, pp. 234–238 (October 2008) 69. Šváb, O., Svatek, V., Berka, P., Rak, D., Tomášek, P.: Ontofarm: Towards an experimental collection of parallel ontologies. In: Proceedings of the 4th International Semantic Web Conference (ISWC) – Poster Track, Galway, IE, pp. 1–3 (November 2005) 70. Zimmermann, A., Euzenat, J.: Three semantics for distributed systems and their relations with alignment composition. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 16–29. Springer, Heidelberg (2006)

Author Index

Chen, Bo 102 Cleve, Anthony

Petit, Jean-Marc

66

130 Rousset, Marie-Christine

Decker, Stefan Euzenat, J´erˆ ome Groza, Tudor

Santoro, Nicola 37 Shvaiko, Pavel 158 Stuckenschmidt, Heiner

158

1

Termier, Alexandre 66 Tournaire, R´emi 66 Trojahn, C´ assia 158

Hainaut, Jean-Luc 130 Handschuh, Siegfried 1 Ling, Tok Wang

66

1

102

Mawlood-Yunis, Abdul-Rahman Meilicke, Christian 158 Meurisse, Jean-Roch 130

37

Weiss, Michael 37 Wu, Huayu 102 Xu, Liang

102

158