Model and Data Engineering First International Conference, MEDI 2011 Óbidos, Portugal, September 28-30, 2011 Proceedings

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Ladjel Bellatreche | Filipe Mota Pinto

114 downloads 1057 Views 7MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6918

Ladjel Bellatreche Filipe Mota Pinto (Eds.)

Model and Data Engineering First International Conference, MEDI 2011 Óbidos, Portugal, September 28-30, 2011 Proceedings

13

Volume Editors Ladjel Bellatreche Ecole Nationale Supérieure de Mécanique et d’Aérotechnique Laboratoire d’Informatique Scientiﬁque et Industrielle Téléport 2 - avenue Clément Ader 86961 Futuroscope Chasseneuil Cedex, France E-mail: [email protected] Filipe Mota Pinto Instituto Politécnico de Leiria Escola Superior Tecnologia e Gestão de Leiria Departamento Engenharia Informática Rua General Norton de Matos Leiria 2411-901, Portugal E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-24442-1 e-ISBN 978-3-642-24443-8 DOI 10.1007/978-3-642-24443-8 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011936866 CR Subject Classiﬁcation (1998): H.3, H.4, D.2, D.3, I.2, I.6, F.1, H.5 LNCS Sublibrary: SL 2 – Programming and Software Engineering

© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

The First International Conference on Model and Data Engineering (MEDI 2011) was held in Obidos, Portugal, during September 28–30. MEDI 2011 was a forum for the dissemination of research accomplishments and for promoting the interaction and collaboration between the models and data research communities. MEDI 2011 provided an international platform for the presentation of research on models and data theory, development of advanced technologies related to models and data and their advanced applications. This international scientiﬁc event, initiated by researchers from Euro-Mediterranean countries, aimed also at promoting the creation of north–south scientiﬁc networks, projects and faculty/student exchanges. The conference focused on model engineering and data engineering. The scope of the papers covered the most recent and relevant topics in the areas of advanced information systems, Web services, security, mining complex databases, ontology engineering, model engineering, and formal modeling. These proceedings contain the technical papers selected for presentation at the conference. We received more than 67 papers from over 18 countries and the Program Committee ﬁnally selected 18 long papers and 8 short papers. The conference program included three invited talks, namely, “Personalization in Web Search and Data Management” by Timos Sellis, Research Center “Athena” and National Technical University of Athens, Greece; “Challenges in the Digital Information Management Space,” Girish Venkatachaliah, IBM India, and “Formal Modelling of Service-Oriented Systems,” Ant´ onia Lopes, Faculty of Sciences, University of Lisbon, Portugal. We would like to thank the MEDI 2011 Organizing Committee for their support and cooperation. Many thanks are due to Selma Khouri for providing a great deal of help and assistance. We are very indebted to all Program Committee members and outside reviewers who very carefully and timely reviewed the papers. We would also like to thank all the authors who submitted their papers to MEDI 2011; they provided us with an excellent technical program. September 2011

Ladjel Bellatreche Filipe Mota Pinto

Organization

Program Chairs Ladjel Bellatreche Filipe Mota Pinto

LISI-ENSMA, France Polytechnic Institute of Leiria, Portugal

Program Committee El Hassan Abdelwahed Yamine A¨ıt Ameur Reda Alhajj Franck Barbier Maurice ter Beek Ladjel Bellatreche Boualem Benattallah Djamal Benslimane Moh Boughanem Athman Bouguettaya Danielle Boulanger Azedine Boulmakoul Omar Boussaid Vassilis Christophides Christine Collet Alain Crolotte Alfredo Cuzzocrea Habiba Drias Todd Eavis Johann Eder Mostafa Ezziyyani Jamel Feki Pedro Furtado Faiez Gargouri Ahmad Ghazal Dimitra Giannakopoulou Matteo Golfarelli Vivekanand Gopalkrishnan Amarnath Gupta Mohand-Said Hacid Sachio Hirokawa

Cadi Ayyad University, Morocco ENSMA, France Calgary University, Canada Pau University, France Istituto di Scienza e Tecnologie dell’Informazion, Italy LISI-ENSMA, France University of New South Wales, Australia Claude Bernard University, France IRIT Toulouse, France CSIRO, Australia Lyon-Jean Moulin University, France FST Mohammedia, Morocco Eric Lyon 2 University, France ICS-FORTH Crete, Greece INPG, France Teradata, USA ICAR-NRC, Italy USTHB, Algeria Concordia University, Canada Klagenfurt University, Austria University of Abdelmalek Essˆ adi, Morocco Sfax University, Tunisia Coimbra University, Portugal Sfax University, Tunisia Teradata, USA Nasa, USA University of Bologna, Italy Nanyang Technological University, Singapore University of California San Diego, USA Claude Bernard University, France Kyushu University, Japan

VIII

Organization

Eleanna Kafeza Anna-Lena Lamprecht Nhan Le Thanh Jens Lechtenborger Yves Ledru Li Ma Mimoun Malki Nikos Mamoulis Patrick Marcel Tiziana Margaria Brahim Medjahed Dominique Mery Mohamed Mezghiche Mukesh Mohania Kazumi Nakamatsu Paulo Novais Carlos Ordonez Aris Ouksel ¨ Tansel Ozyer Heik Paulheim Filipe Mota Pinto Li Qing Chantal Reynaud Bernardete Ribeiro Manuel Filipe Santos Catarina Silva Alkis Simitsis Veda C. Storey David Taniar Panos Vassiliadis Virginie Wiels Leandro Krug Wives Robert Wrembel

Athens University of Economics and Business, Greece TU Dortmund, Germany Nice University, France M¨ unster University, Germany Grenoble 1 University, France Chinese Academy of Science, China Sidi Bel Abbs University, Algeria University of Hong Kong, China Tours University, France Potsdam University, Germany University of Michigan - Dearborn, USA LORIA and Universit´e Henri Poincar´e Nancy 1, France Boumerdes University, Algeria IBM India University of Hyogo, Japan Universidade do Minho, Portugal Houston University, USA Illinois University, USA Tobb Economics and Technology University, Turkey SAP, Germany Polytechnic Institute of Leiria, Portugal City University of Hong Kong, China LRI INRIA Saclay, France Coimbra University, Portugal Universidade do Minho, Portugal Institute of Leiria, Portugal HP, USA Georgia State University, USA Monash University, Australia University of Ioannina, Greece Onera, France Federal University of Rio Grande do Sul, Brazil Poznan University, Poland

Table of Contents

Keynotes Personalization in Web Search and Data Management . . . . . . . . . . . . . . . . Timos Sellis

1

Challenges in the Digital Information Management Space . . . . . . . . . . . . . Girish Venkatachaliah

2

Formal Modelling of Service-Oriented Systems . . . . . . . . . . . . . . . . . . . . . . . Antonia Lopes

3

Ontology Engineering Automatic Production of an Operational Information System from a Domain Ontology Enriched with Behavioral Properties . . . . . . . . . . . . . . . Ana Simonet

4

Schema, Ontology and Metamodel Matching - Diﬀerent, But Indeed the Same? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petko Ivanov and Konrad Voigt

18

A Framework Proposal for Ontologies Usage in Marketing Databases . . . Filipe Mota Pinto, Teresa Guarda, and Pedro Gago

31

Proposed Approach for Evaluating the Quality of Topic Maps . . . . . . . . . Nebrasse Ellouze, Elisabeth M´etais, and Nadira Lammari

42

Web Services and Security BH : Behavioral Handling to Enhance Powerfully and Usefully the Dynamic Semantic Web Services Composition . . . . . . . . . . . . . . . . . . . . . . . Mansour Mekour and Sidi Mohammed Benslimane

50

Service Oriented Grid Computing Architecture for Distributed Learning Classiﬁer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Santos, Wesley Mathew, and Filipe Pinto

62

Securing Data Warehouses: A Semi-automatic Approach for Inference Prevention at the Design Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salah Triki, Hanene Ben-Abdallah, Nouria Harbi, and Omar Boussaid

71

X

Table of Contents

Advanced Systems F-RT-ETM: Toward Analysis and Formalizing Real Time Transaction and Data in Real-Time Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mourad Kaddes, Majed Abdouli, Laurent Amanton, Mouez Ali, Rafik Bouaziz, and Bruno Sadeg Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache for Flash Memories: A Case Study . . . . . . . . . . . . . . . . . . . . . . Jalil Boukhobza, Ilyes Khetib, and Pierre Olivier Toward a Version Control System for Aspect Oriented Software . . . . . . . . Hanene Cherait and Nora Bounour AspeCis: An Aspect-Oriented Approach to Develop a Cooperative Information System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Amroune, Jean-Michel Inglebert, Nacereddine Zarour, and Pierre-Jean Charrel

85

97 110

122

Knowledge Management An Application of Locally Linear Model Tree Algorithm for Predictive Accuracy of Credit Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad Siami, Mohammad Reza Gholamian, Javad Basiri, and Mohammad Fathian

133

Predicting Evasion Candidates in Higher Education Institutions . . . . . . . Remis Balaniuk, Hercules Antonio do Prado, Renato da Veiga Guadagnin, Edilson Ferneda, and Paulo Roberto Cobbe

143

Search and Analysis of Bankruptcy Cause by Classiﬁcation Network . . . . Sachio Hirokawa, Takahiro Baba, and Tetsuya Nakatoh

152

Conceptual Distance for Association Rules Post-Processing . . . . . . . . . . . . Ramdane Maamri and Mohamed said Hamani

162

Manufacturing Execution Systems Intellectualization: Oil and Gas Implementation Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stepan Bogdan, Anton Kudinov, and Nikolay Markov Get Your Jokes Right: Ask the Crowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joana Costa, Catarina Silva, M´ ario Antunes, and Bernardete Ribeiro

170 178

Model Speciﬁcation and Veriﬁcation An Evolutionary Approach for Program Model Checking . . . . . . . . . . . . . . Nassima Aleb, Zahia Tamen, and Nadjet Kamel

186

Table of Contents

Modelling Information Fission in Output Multi-Modal Interactive Systems Using Event-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linda Mohand-Oussa¨ıd, Idir A¨ıt-Sadoune, and Yamine A¨ıt-Ameur Speciﬁcation and Veriﬁcation of Model-Driven Data Migration . . . . . . . . . Mohammed A. Aboulsamh and Jim Davies

XI

200 214

Models Engineering Towards a Simple Meta-Model for Complex Real-Time and Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yassine Ouhammou, Emmanuel Grolleau, Michael Richard, and Pascal Richard Supporting Model Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R´emi Delmas, David Doose, Anthony Fernandes Pires, and Thomas Polacsek Modeling Approach Using Goal Modeling and Enterprise Architecture for Business IT Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karim Doumi, Salah Ba¨ına, and Karim Ba¨ına

226

237

249

MDA Compliant Approach for Data Mart Schemas Generation . . . . . . . . Hassene Choura and Jamel Feki

262

A Methodology for Standards-Driven Metamodel Fusion . . . . . . . . . . . . . . Andr´ as Pataricza, L´ aszl´ o G¨ onczy, Andr´ as K¨ ovi, and Zolt´ an Szatm´ ari

270

Metamodel Matching Techniques in MDA: Challenge, Issues and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lamine Lafi, Slimane Hammoudi, and Jamel Feki

278

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

287

Personalization in Web Search and Data Management Timos Sellis Research Center "Athena" and National Technical University of Athens, Greece [email protected] (joint work with T. Dalamagas, G. Giannopoulos and A. Arvanitis)

Abstract. We address issues on web search personalization by exploiting users' search histories to train and combine multiple ranking models for result reranking. These methods aim at grouping users' clickthrough data (queries, results lists, clicked results), based either on content or on specific features that characterize the matching between queries and results and that capture implicit user search behaviors. After obtaining clusters of similar clickthrough data, we train multiple ranking functions (using Ranking SVM model), one for each cluster. Finally, when a new query is posed, we combine ranking functions that correspond to clusters similar to the query, in order to rerank/personalize its results. We also present how to support personalization in data management systems by providing users with mechanisms for specifying their preferences. In the past, a number of methods have been proposed for ranking tuples according to user-specified preferences. These methods include for example top-k, skyline, top-k dominating queries etc. However, neither of these methods has attempted to push preference evaluation inside the core of a database management system (DBMS). Instead, all ranking algorithms or special indexes are offered on top of a DBMS, hence they are not able to exploit any optimization provided by the query optimizer. In this talk we present a framework for supporting user preference as a rst-class construct inside a DBMS, by extending relational algebra with preference operators and by appropriately modifying query plans based on these preferences.

L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

Challenges in the Digital Information Management Space Girish Venkatachaliah IBM New Delhi, India

Abstract. This keynote will address the challenges in the digital information management space with specific focus on analyzing, securing and harnessing the information, what needs to be done to foster the ecosystem and the challenges/gaps that exist and the progress that is crying to be made in the coming decade.

L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, p. 2, 2011. © Springer-Verlag Berlin Heidelberg 2011

Formal Modelling of Service-Oriented Systems Antonia Lopes Department of Informatics, Faculty of Sciences, University of Lisbon, Portugal [email protected]

Abstract. In service-oriented systems interactions are no longer based on fixed or programmed exchanges between specific parties but on the provisioning of services by external providers that are procured on the fly subject to a negotiation of service level agreements (SLAs). This research addresses the challenge raised on software engineering methodology by the need of declaring such requirements as part of the models of service-oriented applications, reflecting the business context in which services and activities are designed. In this talk, we report on the formal approach to service-oriented modelling that we have developed which aimed at providing formal support for modelling service-oriented systems in a way that is independent of the languages in which services are programmed and the platforms over which they run. We discuss the semantic primitives that are being provided in SRML (SENSORIA Reference Modelling Language) for modelling composite services, i.e. services whose business logic involves a number of interactions among more elementary service components as well the invocation of services provided by external parties. This includes a logic for specifying stateful, conversational interactions, a language and semantic model for the orchestration of such interactions, and an algebraic framework supporting service discovery, selection and dynamic assembly.

L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, p. 3, 2011. © Springer-Verlag Berlin Heidelberg 2011

Automatic Production of an Operational Information System from a Domain Ontology Enriched with Behavioral Properties Ana Simonet Agim laboratory, Faculté de Médecine, 38700 La Tronche, France [email protected]

Abstract. The use of a domain ontology and the active collaboration between analysts and end-users are among solutions aiming to the production of an Information System compliant with end-users’ expectations. Generally, a domain ontology is used to produce only the database conceptual schema, while other diagrams are designed to represent others aspects of the domain. In order to produce a fully operational Information System, we propose to enrich a domain ontology by behavioral properties deduced from the User Requirements, expressed by the input and output data necessary to the realization of the end-users’ business tasks. This approach is implemented by the ISIS (Information System Initial Specification) system. The behavioral properties make it possible to deduce which concepts must be represented by objects, literals or indexes in the generated Information System, where the Graphical User Interface enables the users to validate the expressed needs and refine them if necessary. Keywords: Ontology, Information System, User Requirements, Database Design.

1 Introduction The design of Information Systems (IS) is confronted with more and more complex domains and more and more demanding end-users. To enable a better acceptation of the final system, various solutions have been proposed, among which the active collaboration between analysts and end-users [6] and the use of a domain ontology [17]. Such collaboration during the design of an Information System favors a better understanding of the user requirements by analysts and thus limits the risks of the rejection of the IS. However, in the short term, such collaboration increases the global cost of the project as it requires a higher availability of both parties, which may question the very feasibility of the project [4]. It also requires a common language, mastered by both parties, in order to limit the ambiguities in the communication. Ontologies have been proposed to support the communication between the various actors in a given domain: a domain ontology (in short an ontology) expresses an agreement of the actors of a domain upon its concepts and their relationships. Reusing the knowledge represented in an ontology allows the analyst to define a more L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 4–17, 2011. © Springer-Verlag Berlin Heidelberg 2011

Automatic Production of an Operational Information System

5

comprehensive and consistent database conceptual schema more quickly [17]. Moreover, as an ontology plays the role of semantic referential in a given domain, it is easier to make the resultant IS collaborate with other IS, especially when they are based on the same referential. This property is particularly true when the link between the ontology and the IS is explicitly maintained, as in [8]. However, the use of an ontology is generally limited to the sole design of the conceptual schema of the database (class diagram or E-R schema) of the IS. According to the analysis method used (e.g., UML), other diagrams (e.g., use case diagram, state transition diagram, sequence diagram, activity diagram, …) have to be designed to represent other aspects of the domain under study. A common language cannot rely on such methods, because of the number and the complexity of the models end-users have to master in order to collaborate with the analyst [12]. We have chosen a binary relational model, the ISIS data model, as the support to the common language, because such models have a limited number of meta-concepts [1] [18], easily mastered by non-computer scientists. Moreover, in order to limit the number of meta-concepts, rather than using several models to represent various aspects of a domain, we chose to use a single model and enrich it with Use Cases modeling the user requirements. The ISIS data model has three meta-concepts: concept, binary relation and ISA relation. In ISIS a domain ontology is represented by a graph, named Ontological Diagram (OD), where the nodes represent concepts and the arcs represent the relations between concepts1. This graph is enriched with constraints (e.g., minimal and maximal cardinalities of relations, relations defining the semantic key of the instances of a concept). The criticity and the modifiability of the relations of an OD are the two main behavioral properties we have identified in order to automatically transform concepts of the ontological level into computer objects of the implementation level. The criticity of a relation expresses that this relation is necessary for at least one of the Use Cases modeling the user requirements. The modifiability of a relation with domain A and range B expresses that there exists at least one instance of A where its image in B changes over time (non-monotonicity). However, deciding on the criticity or the modifiability of the relations of an OD is outside the capabilities of end-users collaborating in the IS design, as their knowledge is centered on the data and rules they need to perform their business tasks. This data, made explicit in the ISIS methodology through the input and output parameters of each Use Case, allows us to infer which relations are critical and/or modifiable. We then deduce the concepts that – for a given set of Use Cases – can be omitted, thus leading to a sub-ontology of the original OD. The concepts that should be represented as objects, values or indexes in the implementation of the application are proposed. Following the designer’s choices, ISIS proceeds to the automatic generation of the database, the API and a prototype GUI of the IS. These software artifacts enable endusers to verify the adequacy of the IS to their needs and refine them if necessary. The paper is organized as follows. We firstly present some notions of the ontological and the implementation level, then the ISIS project, its model and some properties necessary to the production of an operational system. Finally, we present the ISIS methodology and platform through an example. 1

Nodes model classes and attribute domains of a class diagram; arcs model attributes and roles.

6

A. Simonet

2 From Ontological Level to Implementation Level The classical design of an IS systems entails the conceptualization of the domain under study and the representation in several models of the analyst’s perception of the static and dynamic phenomena [3]. For example, in UML, these models are: 1) an object model that supports the representation of classes, attributes, relationships… of the entities of the domain under study; 2) dynamic models that support the description of valid object life cycles and object interactions; and 3) a functional model that captures the semantics of the changes of the object state as a consequence of a service occurrence [10] These representations allow the analyst to increase his understanding of the domain, help communication between developers and users and facilitate the implementation of the information system. Domain ontologies represent agreed domain semantics and their fundamental asset is their independence of particular applications: an ontology consists of relatively generic knowledge that can be reused by different kinds of application [7][16]. Unlike ontologies, the design of an object2 diagram takes into account the application under consideration. In order to support the production of the target system, the database designer has chosen the entities of the domain which must be represented as (computer) objects and those which must be represented by literals [5]. In the ISIS project, we attempted to establish the behavioral properties that support automatic transformations leading from a domain ontology to an operational IS. To identify these properties we relied on the ODMG norm [5]. This norm specifies the criteria to distinguish between two categories of entities: objects and literals (also named values); objects as well as literals are entities whose values can be atomic, structured or collection. Contrary to an object, whose value can change during its lifetime3, the value of a literal is immutable. In order to automatically choose the entities that must be represented by objects, the computer system must know how to decide if the value of an entity is or is not modifiable. This raises an issue that is rarely considered as such: « what does value of an entity mean? » In a binary relational model, the value of an instance of a concept A is an element of the Cartesian product of the concepts which are the ranges of all the binary relations having A as domain. In short, the value of an instance of A is given by the set of binary relations with domain A. Consequently, the value of an instance of A is modifiable iff at least one of the binary relations with domain A is modifiable. Thus, the problems we have to solve are: 1.

2.

Among all the binary relations “inherited” from a domain ontology or defined for the application, which ones are actually necessary to model the user requirements? Such relations are called critical. Which binary relations are modifiable?

An Ontological Diagram enriched with the critical and modifiable properties enables the ISIS system to deduce which concepts should be represented as values, as objects, and which ones are potential indexes. However, to product the prototype GUI we also need to consider the sub-graphs proper to each Use Case. 2 3

Also called data models. “mutable” is the term used by the ODMG group to qualify values that change over time. In the following, we use the term “modifiable” as a synonym of “mutable”.

Automatic Production of an Operational Information System

7

3 The ISIS Project ISIS is the acronym for Information Systems Initial Specification. It has two models, a data model and a Use Case model. It offers a methodology and a tool for the design of an Information System (database, API and prototype GUI), from a Domain Ontology and a set of Use Cases modeling the user requirements. All the IS design methodologies use a central model4. In this article, we will refer to such a model as a conceptual model. Contrary to other approaches that use different models of a method to represent different aspects of a domain, we use a unique model – the so-called conceptual model – to express the static properties of the entities of the domain as well as their dynamic (behavioral) properties. Naming the concepts (e.g., vendor, customer, product, price, quantity, address …) of the domain and their interrelationships is the first step in the design of an IS following the ISIS approach. This is ontological work, and the input to ISIS can be an existing domain ontology or a micro-ontology of the considered application domain that is built at the time of the design. In both situations, the ontological structure that constitutes the input to the design process is called an Ontological Diagram (OD). The behavioral properties are deduced from the user requirements, expressed by the Use Cases needed for the users’ business tasks. We have classified the Use Cases into two categories: those whose objective is to consult existing enterprise data (e.g., for a patient, date of its consultations and name of the doctor), and those in which objects can be created, modified or suppressed (e.g., create a new patient). We call the former Sel-UC (for Selection Use Case) and the latter Up-UC (for Update Use Case). Usually, a Use Case can be modeled by a single query. Each query has input and output data, represented by concepts of the OD. For example, the above Sel-UC is interpreted as (input: patient, output: consultation date, name of doctor). In the context of an OD, the set of Sel-UC enables ISIS to determine the subgraph of the OD that is actually needed to produce the database physical schema and the API of the functional kernel of the application. The whole set of Use Cases enables ISIS to produce the prototype GUI and the operational system. 3.1 The ISIS Data Model The ISIS model belongs to the family of binary relational models [1] [18]. Its metaconcepts are concept, binary relation (or simply relation), and subsumption (or specialization) relation. The concepts of a given domain and their relationships are represented through the OD graph. Definitions - A concept is an intensional view of a notion whose extensional view is a set of instances. - A binary relation R between two concepts A (domain) and B (range), noted R(A,B), is considered in its mathematical sense: a set of pairs (a, b) with a ∈ A and b ∈ B. - the image of x through R, noted R(x) is the set of y such that R(x,y). R(x)t is the image of x through R at time t. 4

class diagram in object methods, conceptual schema in E-R models, logical schema in relational databases.

8

A. Simonet

- An association is a pair of binary relations, reverse of one another. - A subsumption relation holds between two concepts A and B (A subsumes B) iff B is a subset of A. In an OD, static properties (or constraints) of concepts and relations are given. Among these, only the minimal (generally 0 or 1) and maximal (generally 1, * or n) cardinalities of relations and unicity constraints are mandatory. Other constraints, such as Domain Constraints and Inter-Attribute Dependencies may be considered for the production of Intelligent Information Systems under knowledge-based models such as Description Logics [13] [14]. In ISIS we consider three categories of concepts: predefined concepts, primary concepts and secondary concepts. Predefined concepts correspond to predefined types in programming languages, e.g., string, real, integer. Primary and secondary concepts are built to represent the concepts specific to a domain. Primary concepts correspond to those concepts whose instances are usually considered as atomic; a secondary concept corresponds to concepts whose instances are « structured ». We name valC the relation whose domain is a primary concept C and whose range is a predefined concept (e.g., valAge, valName). Fig. 1 represents the ISIS complete diagram designed to model persons with a name and an age. STRING

valName 1..1

1..1

NAME 1..*

PersonWithName

valAge

ageOf

nameOf

1..1

1..1

INTEGER

AGE

PERSON 1..*

PersonWithAge

Fig. 1. ISIS complete OD modeling person name and age

Representing the predefined concepts of an OD increases its complexity. Thus predefined concepts and their relations with primary concepts are masked in the external representation of the OD (see the ISIS diagrams that follow). The static constraints govern the production of the logical database schema. Behavioral constraints are necessary to automatically produce the physical database schema and the associated software. In ISIS, the main behavioral properties5 that are considered are the criticity and the modifiability6 of a relation [15]. 3.2 Critical Relations The ultimate purpose of an IS is expressed through its selection queries, hence our choice of these queries to decide which relations are critical. Our main criterion in selecting the critical relations is to consider the relations participating in at least one selection query of Sel-UC (critical query). However, the designer can decide to make any relation critical, independently from critical queries. The update queries are needed to ensure that, at every moment, data in the IS are complying with data in the real world; they are not considered in the determination of the critical relations.

5 6

In an object model the behavioral properties are expressed as class methods. A modifiable relation is a non-monotonous relation.

Automatic Production of an Operational Information System

9

Definitions - A selection query is critical iff it is part of Sel-UC. - A selection query Q is defined by a triple (I, O, P) where: • I is the set of input concepts of Q • O is the set of output concepts of Q • P is a set of paths in the OD graph. - The triple (I, O, P) defines a subgraph of the OD. - A path p(i, o) in a query (I, O, P) is an ordered set of relations connecting i∈I to o∈O. - A binary relation is critical iff it belongs to at least one critical selection query or if it has been explicitly made critical by the designer. - Given a concept CC, domain of the critical relations r1, r2,…, rn, and C1, C2, …, Cn the range concepts of r1, r2,…, rn. The value of an instance cck ∈ CC is an element of the Cartesian product C′1 x C′2 x … x C′n, where C′i = Ci if ri is monovalued (max. card. = 1) and C′i = P(Ci)7 if ri is multivalued (max. card. ≥ 1). - An association in an OD is critical iff at least one of its relations is critical. 3.3 Modifiable Relations and Concepts Definition - Given a relation R(A, B), R is modifiable iff there exists a ∈ A such that R(a)t is different from R(a)t+1. To express the modifiability property we had to extend the classical binary relational model, which has only two categories of nodes8, to a model with three types of nodes (§3.1). This difference is illustrated by Fig. 1 and Fig. 2. Fig. 2, extracted from [1], represents the Z0 schema of a set person with two access function9, ageOf and nameOf, where the notion of access function is derived from that of relation. This representation induces a representation of ageOf and nameOf as attributes of a class/entity (in the object/E-R model) or of a table (in the relational model) person. nameOf

STRING

1..1

Person-with-name

1..1

PERSON

ageOf

INTEGER

Person-with-age 1..*

1..*

Fig. 2. Z0 schema modeling persons with name and age [1]

In ISIS (Fig. 1), person, age and name are concepts, and so are string and integer. Thanks to the behavioral properties, ISIS will propose that a given concept be represented as an object or as a literal, according to the ODMG classification [5], and among the objects propose those candidate to become database indexes. 7 8

9

P(E) represents the set of parts of E. E.g., in Z0, concrete (structured) and abstract (atomic) sets; lexical and not-lexical in Niam [18]. « An access function is à function which maps one category into the powerset of another ».

10

A. Simonet

Let us consider the concept person represented in Fig. 1 and Fig. 2, and the Use Case change the age of a person in the context of Korea and other Asian countries, where a person changes his age on Jan. 1st at 0h [11]. In common design situations, representing age by a literal or by an object is (manually) decided by the designer. If he is conscious that an age update on Jan. 1st will concern millions or billions of persons he will choice an object representation for age, which leads to at most 140 updates (if the age ranges from 0 to 140) instead of millions or billions with a representation as an attribute. This «best» solution cannot be automatically produced from the diagram of Fig. 2 where the only relation10 that may be modifiable is ageOf. In the ISIS representation (Fig. 1) two relations are potentially modifiable: ageOf and valAge. Making ageOf modifiable models the update of one person, whereas making valAge modifiable models the update of all the persons with a given age. The best modeling for Korea is then to consider valAge as a modifiable relation, but the best modeling for Europe is to consider ageOf as the modifiable relation. Distinguishing primary concepts and predefined concepts is necessary to differentiate these two ways of modeling the update of the age of a person. The same reasoning applies to primary concepts such as salary: either change the salary of one person (who has been promoted) or change the salary of all the persons belonging to a given category. Definition -

A concept is said to be a concept with instances with modifiable values, or simply modifiable concept11, iff it is the domain of at least a binary relation that is both critical and modifiable. A primary concept CP has a semantic identifier iff its relation valCp is not modifiable.

Considering the example of Fig. 1, we can model person as a modifiable concept for European countries, whereas in the Korea case the modifiable concept is age. In a European context, age has a semantic identifier, whereas in the Korean one it has not. 3.4 Object/Literal Deduction Conceptually, a computer object is represented by a pair , which provides a unique representation of the value of an object, as the oid is non modifiable and is used to reference the object wherever it is used. We present two of the rules used to infer which concepts should be represented by computer objects in an application. Rule 1: A modifiable concept is an object concept. Considering the object/value duality, only an object has an autonomous existence. A value does not exist by itself but through the objects that « contain » it, hence our second proposal. Rule 2: A concept t domain of a partial12 relation rel, which belongs to a critical association, is an object concept. A concept that is not an object concept is a value concept. 10

Called access function in Z0. Note that it is not the concept itself that is mutable but its instances. 12 Minimal cardinality equals 0 11

Automatic Production of an Operational Information System

11

3.5 Deduction of Potential Indexes An index is usually perceived as an auxiliary structure designed to speed up the evaluation of queries. Although its internal structure can be complex (B-tree, Bitmap, bang file, UB-tree … [9]) an index can be logically seen as a table with two entries: the indexing key, i.e., the attribute(s) used to index a collection of objects, and the address of the indexed object (tuple, record …). However, a deeper examination of indexing structures reveals a more complex situation. The component that manages an index: 1. is a generic component instantiated every time a new indexing is required. 2. contains procedures to create, modify, and delete objects from an index, i.e., objects of the indexing structure. These procedures are implicitly called by the procedures and functions that create, modify or delete objects from the class/table being indexed. 3. contains procedures to retrieve objects of the indexed class/table in an efficient manner. Again, a programmer does not call explicitly such procedures and functions. From the moment at which the programmer asks for the creation of an index, for example on a table, the computer system fully manages this auxiliary internal structure. Therefore an index can be seen as a generic class of objects transparently managed by the computer system. The generic class index disposes of two attributes: the first one, indexValue: indexingValue, provides its identifier, and the second, indexedElement: SetOf (indexedConcept), gives the address of the indexed objet (or the indexed objects if doubles are accepted). In order to allow the automatic management of the generic class « index », its identifier must be a semantic identifier. When the index models a simple index, indexingValue is a primary concept and the (implicit) relation valIndexigValue is not modifiable. For example, if one wants to index the table person by age, valAge, with domain age and range integer, must not be modifiable, in order to play the role of semantic key of indexingValue, and, transitively, of index. When the index is a complex one, indexingValue represents an implicit secondary concept that is domain of the relations r1, r2… rn whose ranges are the primary concepts that define the value of the indexing key. Each of these primary concepts must have a semantic identifier. The Cartesian product of these semantics identifiers constitutes the semantic key of the index concept. Definition A concept c is a Potential Index concept iff 1) it has a semantic identifier and 2) it is the domain of one and only one critical relation (apart from the relation defining its semantic identifier). Following this definition, the primary concept indexingValue whose relation valIndexingValue is non modifiable, satisfies the first condition. They are Potential Index concepts if the second condition of the definition is also satisfied. Considering Fig. 1 and a European application, age may be a Potential Index because the relation valAge is a non modifiable relation. It is effectively a Potential Index iff the relation personWithAge is a critical relation. In a Korean IS [11], where valAge is a modifiable relation, age cannot be a Potential Index concept because it does not have a semantic identifier.

12

A. Simonet

4 ISIS Methodology and Tool through an Example The first step in the ISIS approach is the design or the import of an OD. The OD can be checked for well-formedness (absence of cycle, no relation between two primary concepts …). Fig. 3 shows a simplified OD of a concert management application.

Fig. 3. OD of a concert management application

On the OD of Fig. 3 are represented: 1) primary concepts (e.g., style, name, date …); 2) secondary concepts (e.g., group, contact, concert …); 3) ISA relations (pastConcert is a subconcept of concert); and 4) binary relations (e.g., 13

concertOfGroup , styleOfGroup).

Pairs such as (1,1) or (0,*) represent respectively the minimal and maximal cardinalities of a binary relation. concertOfGroup and groupOfConcert is a binary association. The relation defining the semantic identifier (key) of a secondary concept is represented by an arc with black borders and a key symbol attached to the concept. Representation of Use Cases To enable ISIS to deduce the behavioral properties one annotates the OD with the input and output parameters of each Use Case (UC). Let us consider: UC1 – concerts given by a group: given a group (identified by its name), find the concerts he

has given; for each concert, display its date, its benefit, its number of spectators, the name of the concert place, of the town and of the country. UC2 – planned concerts of a group: given a group (name), display its style, the date and the price of each concert, the name of the city and its access, the name of the country and its currency. UC3 – information about a concert place: for a concert place, display its max number of spectators, location price, phone and email, city and country, and the concerts (date, name and style of group). UC4 – groups of a style: for a given style display the name of the groups and their contact. UC5 - new group: create a new group. UC6 - new concert: create a new concert. 13

The name aaOfBb is automatically produced by ISIS for a relation with domain aa and range bb. This name can be changed. The name of a relation is optionally shown on the OD.

Automatic Production of an Operational Information System

13

UC7 - new concert place: create a new concert place. UC8 - update concert: update the benefit and the number of spectators of a concert.

The first four are Sel-UC and the last four are Up-UC. For each UC the designer identifies the concepts it concerns and annotates them as input or output14 concepts. UC1: in {name(group)}, out {date, benefit, nbSpectators, name(concertPlace), name(city), name(country)}; UC2: in { name(group)}, out {style, date, ticketPrice, name(city), name(country), currency}; UC3: in {name(concertPlace)},out {maxSpectators, locationPrice, phone, email, name(city), name(country), date, name(group), style}; UC4: in {style}, out {name(group), name(contact)}; UC5: in {group}; UC6: in {concert}; UC7: in {concertPlace}; UC8: in {concert} out {benefit, nbSpectators}.

Fig. 4 illustrates UC1. The designer annotates the concept name(group) as input concept (downward arrow ) and the concepts date, benefit, nbSpectators, name(place) and name(city) as output concepts (upward arrow ). ISIS calculates and presents the paths between the input and the output concepts. The intermediate concepts, such as concert, are automatically annotated with a flag ( ). All the relations of a path in a query of Sel-UC are automatically annotated by ISIS as critical. In Fig. 4, groupOfName, concertOfGroup, dateOfConcert, … are critical. When different paths are possible between an input concept and an output concept, in order to ensure the semantics of the query, the designer must choose the intermediate concepts by moving the flag(s). When there are several relations between two concepts, the designer must select one of them.

Fig. 4. Subgraph of the Use Case UC1

Sub-Ontology Extraction From the annotations of all the queries in Sel-UC ISIS deduces a diagram that is the smallest subgraph that contains the subgraphs of all the queries of Sel-UC and proposes to suppress the concepts that are not needed. For example, if the only query of the IS is the one presented in Fig. 4, the relations of the associations between group-contact or between group-style are not critical. Thus, the designer must decide 14

Input and output concepts correspond to input and output concepts of the procedures of the functional kernel of the IS.

14

A. Simonet

either to suppress these associations or to make critical one of their relations. When a concept becomes isolated from the other concepts of the OD, it is suppressed. This constitutes the first phase of the simplification process where the objective is to determine the sub-ontology of an application. Diagram Simplification Before generating the IS, ISIS proceeds to a second phase of simplification, by proposing to eliminate the concepts that do not bear information significant for the business process. For example, considering again the query of Fig. 4 as the only query of the IS, the concept region can be eliminated, as it acts only as an intermediate concept linking city to country. Contrary to the first simplification phase, the result of the second one may depend on the order of the choices. Fig. 5 shows the simplified sub-ontology obtained by taking into account the whole set of Sel-UC. The concepts contact, region, address, name(region) … have been suppressed and will not appear in the generated application, because they are not used in any of the four Sel-UC of this example and the designer has agreed with their suppression. Object, Value and Index Deduction From the update queries of Up-UC, ISIS deduces which relations are modifiable (cf. § 3.3). For example, UC6 (new concert) enables ISIS to deduce that the relations concertOfGroup and concertOfConcertPlace are modifiable. Considering the critical modifiable relations, ISIS deduces which concepts should be represented as values or as objects, and among the latter which ones are proposed to become indexes of the generated database. Again, the designer may decide to make other choices. Fig. 5 also shows the object-value-index deductions on the example.

Fig. 5. Simplified OD with Object-Value-Index deductions for the concert example

• group, concert, pastConcert, concertPlace, city and are object (secondary) concepts; access is an object (primary) concept. • name(group), style, date, name(concertPlace) and name(country) are potential indexes. • The other concepts are value concepts.

Automatic Production of an Operational Information System

15

As country is not the domain of any modifiable critical relation, ISIS does not propose to implement it as an object but the designer can decide to make a different choice. If the ISIS proposal is accepted, name(country) becomes a potential index of city. Generation of Software Artifacts In the last step ISIS generates the application, i.e., the database, the API (i.e., the code of the queries of the Use Cases) and a prototype GUI. Fig. 6 shows the GUI corresponding to UC1 (concerts given by a group) in the PHP-MySQL application that is automatically generated.

Fig. 6. Prototype GUI: screen copy of the window generated for UC1

The prototype GUI has Spartan ergonomics: first the monovalued attributes are presented in alphabetical order, then the multivalued attributes if any. In spite of these basic ergonomics, it enables users to verify the items and their type. They can also check whether the dynamics of windows corresponds to the needs of their business process.

5 Conclusion and Perspectives The reuse of a domain ontology and the collaboration between analysts and end-users during the design phase of the IS are two of the solutions proposed to favor a better acceptation of the final system. Generally the domain ontology is only used to support the design of the conceptual schema of the IS database [8][17]. From our experience, a conceptual database schema (e.g., UML class diagram or E-R schema) concerns analysts rather than end-users, whose knowledge is not sufficient to master the metaconcepts that are used and, consequently, are only able to validate the terms used. Moreover, as they interpret them in their own cultural context, two users validating the same schema may actually expect different systems. An active collaboration between designers and end-users necessitates a common language, mastered by both parties, in order to enable them to quickly identify possible misunderstandings [6]. It also requires a high degree of availability of both parties in order that user requirements and business rules be understood by the designer [4]. To avoid increasing the cost of the project, we propose a common

16

A. Simonet

language based on a single model and at the automatic production of an operational IS that can be immediately tested by end-users. The common language is based on a binary relational model, which has a limited number of meta-concepts. Contrary to other methods that propose several models to represent the static and the dynamic properties of the entities of a domain, in ISIS we chose to enrich the ontological diagram with the Use Cases representing the user requirements. This enrichment allows deducing the subgraph proper to each functionality of the IS. It also allows the deduction of the behavioral properties of the concepts of a domain, properties which, in an object model, are expressed by the methods of the business classes. The main two behavioral properties we have identified are the criticity and the modifiability of a relation [15]. However, deciding which relations are critical or modifiable is outside the capabilities of end-users, whereas they know the data they use for their business tasks. This data is made explicit in ISIS through the input and output parameters of the queries of the Use Cases. From these parameters ISIS infers which relations are critical and/or modifiable. ISIS then deduces and proposes the concepts that should be omitted. For the concepts belonging to the sub-ontology of the application, ISIS proposes the concepts that should be represented as values, objects, or indexes at the implementation level. The designer can accept or refuse these proposals. ISIS then proceeds to the automatic generation of the database, the API and a prototype GUI of the IS. This approach leads to a reduction of the cycle « expression-refinement of needs / production of target system / validation », during the analysis process. Consequently, the number of these cycles can grow without increasing the global cost of the project and the final result can be close to the real needs of the users. The current ISIS tool has been developed in Java with a dynamic web interface. It generates a PHP-MySQL application. A console also enables the programmer to write SQL code, which makes it possible to write more complex queries. ISIS is currently being used for the design of an ontological diagram of « quality » in computerassisted surgery [2]. It will support the design of an IS to study the « quality » of an augmented surgery device. Future work encompasses the introduction of constraints as pre-conditions of a query in order to model the relationship between a Use Case and the state of the objects it uses, the generation of UML and E-R diagrams [15], and the use of the ISIS methodology for the integration of heterogeneous databases. Integrating linguistic tools to help the designer select the input and output concepts necessary for the Use Cases is also a future step of the ISIS project. Acknowledgments. The author wants to thank Michel Simonet, who played a central role in the gestation and the development of the ISIS project. She also thanks Eric Céret, who designed the current web version of ISIS, and Loïc Cellier who continues its development. She is grateful to Cyr-Gabin Bassolet, who designed and implemented the early prototypes and participated actively in the first phases of the project.

References 1. Abrial, J.R.: Data Semantics. In: Klumbie, J.W., Koffeman, K.I. (eds.) Database Management, pp. 1–59. North-Holland, Amsterdam (1974)

Automatic Production of an Operational Information System

17

2. Banihachemi, J.-J., Moreau-Gaudry, A., Simonet, A., Saragaglia, D., Merloz, P., Cinquin, P., Simonet, M.: Vers une structuration du domaine des connaissances de la Chirurgie Augmentée par une approche ontologique. In: Journées Francophones sur les Ontologies, JFO 2008, Lyon (2008) 3. Burton-Jones, A., Meso, P.: Conceptualizing Systems for Understanding: An Empirical Test of Decomposition Principles in Object-Oriented Analysis. Information Systems Research 17(1), 38–60 (2006) 4. Butler, B., Fitzgerald, A.: A case study of user participation in information systems development process. In: 8th Int. Conf. on Information Systems, Atlanta, pp. 411–426 (1997) 5. Cattell, R.G.G., Atwood, T., Duhl, J., Ferran, G., Loomis, M., Wade, D.: Object Database Standard: ODMG 1993. Morgan Kaufmann Publishers, San Francisco (1994) 6. Cavaye, A.: User Participation in System Development Revisited. Information and Management (28), 311–323 (1995) 7. Dillon, T., Chang, E., Hadzic, M., Wongthongtham, P.: Differentiating Conceptual Modelling from Data Modelling, Knowledge Modelling and Ontology Modelling and a Notation for Ontology Modelling. In: Proc. 5th Asia-Pascific Conf. on Conceptual Modelling (2008) 8. Fankam, C., Bellatreche, L., Dehainsala, H., Ait Ameur, Y., Pierra, G.: SISRO: Conception de bases de sonnées à partir d’ontologies de domaine. Revue TSI 28, 1–29 (2009) 9. Housseno, S., Simonet, A., Simonet, M.: UB-tree Indexing For Semantic Query Optimization of Range Queries. In: International Conference on Computer, Electrical, and Systems Science, and Engineering, CESSE 2009, Bali, Indonesia (2009) 10. Isfran, I., Pastor, O., Wieringa, R.: Requirements Engineering-Based Conceptual Modelling. Requirements Engineering 7, 61–72 (2002) 11. Park, J., Ram, S.: Information Systems: What Lies Beneath. ACM Transactions on Information Systems 22(4), 595–632 (2004) 12. Pastor, O., Gomez, J., Insfran, E., Pelechano, E.: The OO-Method for information system modeling: from object-oriented conceptual modeling to automated programming. Information Systems 26, 507–534 (2001) 13. Roger, M., Simonet, A., Simonet, M.: A Description Logic-like Model for a Knowledge and Data Management System. In: Ibrahim, M., Küng, J., Revell, N. (eds.) DEXA 2000. LNCS, vol. 1873, p. 563. Springer, Heidelberg (2000) 14. Simonet, A., Simonet, M.: Objects with Views and Constraints: from Databases to Knowledge Bases. In: Patel, D., Sun, Y., Patel, S. (eds.) Object-Oriented Information Systems, OOIS 1994, pp. 182–197. Springer, London (1994) 15. Simonet, A.: Conception, Modélisation et Implantation de Systèmes d’Information. Habilitation à Diriger des Recherches. Université de Grenoble (2010) 16. Spyns, P., Meersman, R., Jarrar, M.: Data modeling versus Ontology engineering. SIGMOD Record 31(4), 12–17 (2002) 17. Sugumaran, V., Storey, V.C.: The role of domain ontologies in database design: An ontology management and conceptual modeling environment. ACM Trans. Database Syst. 31, 1064–1094 (2006) 18. Weber, R.: Are Attributes Entities? A Study of Database Designers’ Memory Structures. Information Systems Research 7(2), 137–162 (1996)

Schema, Ontology and Metamodel Matching Diﬀerent, But Indeed the Same? Petko Ivanov and Konrad Voigt SAP Research Center Dresden, Chemnitzer Strasse 48, 01187 Dresden, Germany {p.ivanov,konrad.voigt}@sap.com

Abstract. During the last decades data integration has been a challenge for applications processing multiple heterogeneous data sources. It has been faced across the domains of schemas, ontologies, and metamodels, inevitably imposing the need for mapping speciﬁcations. Support for the development of such mappings has been researched intensively, producing matching systems that automatically propose mapping suggestions. Since an overall relation between these systems is missing, we present a comparison and overview of 15 systems for schema, ontology, and metamodel matching. Thereby, we pursue a structured analysis of applied state-of-the art matching techniques and the internal models of matching systems. The result is a comparison of matching systems, highlighting their commonalities and diﬀerences in terms of matching techniques and used information for matching, demonstrating signiﬁcant similarities between the systems. Based on this, our work also identiﬁes possible knowledge sharing between the domains, e. g. by describing techniques adoptable from another domain.

1

Introduction

For the last decades data integration has been a well-known challenge for applications processing multiple heterogeneous data sources [1]. The fundamental problem concerns exchange of data and interoperability between two systems being developed independently of each other. Usually, each system uses its own data format for processing. To avoid a reimplementation of a system, a mapping between diﬀerent system formats is needed. The speciﬁcation of such mapping is the task of matching, i. e. the speciﬁcation of semantic correspondences between the formats’ elements. This task is a tedious, repetitive, and error-prone one, if performed manually, therefore support by semi-automatic calculation of such correspondences has been proposed [2]. Several systems have been developed to support the task of schema, ontology, and metamodel matching by the calculation of correspondences. Although all systems tackle the problem of meta data matching, they were and are researched in a relatively independent manner. Therefore, we want to provide an overview on matching systems of all three domains. This overview facilitates the choice of L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 18–30, 2011. c Springer-Verlag Berlin Heidelberg 2011

Schema, Ontology and Metamodel Matching

19

a matching system in case of a given matching problem. Furthermore, it identiﬁes how knowledge from other domains can be reused in order to improve a matching system. Prior to studying matching systems, one needs to clarify the relation of the domains of schemas, ontologies, and metamodels. In this work, we adopt the perspective of Aßmann et al. [3], who studied the relation of ontologies and metamodels. We extended the perspective by including XML schemas. Schemas, ontologies, and metamodels provide vocabulary for a language and deﬁne validity rules for the elements of the language. The diﬀerence lies in the nature of the language, it is either prescriptive or descriptive. Thereby, schemas and metamodels are restrictive speciﬁcations, i. e. they specify and restrict a domain in a data model and systems speciﬁcation, hence they are prescriptive. As a complement, ontologies are descriptive speciﬁcations and as such focus on the description of the environment. Therefore, using a similar vocabulary made of linguistic and structural information, the three domains diﬀer in their purpose. Having a diﬀerent purpose, matching systems for the three domains of schema, ontology, and metamodel matching have been developed independently. First, (1) schema matching systems have been developed to mainly support business and data integration, and schema evolution [2,4,5,6,7]. Thereby, the schema matching systems take advantage of the explicit tree structure deﬁned. Second, with the advent of the Semantic Web, (2) ontology matching systems are dedicated to ontology evolution and merging, as well as semantic web service composition and matchmaking [8,9,10,11,12,13]. They are especially of use in the biological domain for aligning large taxonomies as well as for classiﬁcation tasks. Finally, in the context of MDA, an area in which reﬁnement and model transformation are required (3), metamodel matching systems are concerned with model transformation development, with the purpose of data integration as well as metamodel evolution [14,15,16,17,18]. In this paper, we investigate 15 matching systems from the three domains of schema, ontology, and metamodel matching showing their commonalities. We have a closer look at the matching techniques of each system, arranging the systems in an adopted classiﬁcation and present an analysis of the matching systems’ data models to answer questions of their similarity and diﬀerences. This allows to compare the matching systems and analyze transferable matching techniques. These are such techniques from one domain, which may be adopted by another. Moreover, we also derive from an overview on the state-of-the-art systems’ internal data models commonalities in these models and conclude with the transferability of matching techniques. We organize our paper as follows: in Sect. 2 and 3 we introduce our approach on selecting and comparing the matching systems. In the subsequent Sect. 4 we present the classiﬁcation of matching techniques and internal models and arrange the matching systems accordingly. Thereby, we highlight cross domain matching techniques as well as the overall matching technique distribution. We conclude our paper in Sect. 5 by giving a summary and an outlook on open questions as well as future work.

20

2

P. Ivanov and K. Voigt

Analysis Approach

Numerous matching systems, which try to deal with the matching problem in diﬀerent domains, have evolved. The systems apply various matching strategies, use diﬀerent internal representations of data being matched, and apply diﬀerent strategies to aggregate and to select ﬁnal results. The present variety of matching systems confronts the user with a diﬃcult choice as to which systems to use in which case. Aiming at an outline of commonalities and diﬀerences between the systems, we performed a systematic comparison of applied matching techniques and used internal data models by the matching systems. The comparison consists of several steps: 1. Selection of matching systems. In the schema and the ontology domain alone there are more than 50 diﬀerent matching systems. Therefore, we base our selection of matching systems on the following criteria: – Matching domain. The selected systems have representatives from all three domains where the matching problem occurs, namely the schema, ontology, and metamodel domain. We group the systems according to their main domain of application. It has to be noted that there exist several systems which can be applied in more than one domain, which is addressed in Sect. 4. – Availability/Actuality. The selected systems from all domains are systems that were either developed after 2007 or are still being actively worked on. – Quality. The selection of the systems is based on their provided matching quality, if available. For example in the ontology domain, where evaluation competitions exist, only systems that give the best results were selected. – Novel approaches. Additionally, we selected systems that represent approaches diﬀerent from the classical one to deal with the matching problem. 2. Applied matching techniques. To cover the functionality of the matching systems we also studied the matching techniques that they apply. For this purpose we adopted an existing classiﬁcation of matching techniques by Euzenat and Shvaiko [19]. The classiﬁcation is based on and extends the classiﬁcation of automated schema matching approaches by Rahm and Bernstein, presented in [20]. It considers diﬀerent aspects of the matching techniques and deﬁnes basic groups of techniques that exist nowadays. We arrange the selected systems according to the classiﬁcation, additionally pointing out the domain of application, thus showing not only the applicability of the systems in the diﬀerent domains but also main groups of matching techniques that are shared between the systems. More details about this step are presented in Sect. 4.1. 3. Classification of data models. The internal data representation, also called internal data model or shortly data model, of a matching system inﬂuences what kind of matching techniques could be applied by the matching system

Schema, Ontology and Metamodel Matching

21

depending on the information that the model represents. To examine the similarities and diﬀerences of the models, we extracted the information that an internal model could provide for matching and arranged the data models of the selected systems according to it. More information about this step is given in Sect. 4.2.

3

Selection of Matching Systems.

This section describes the performed selection of ﬁfteen matching systems from three domains based on the criteria as described in Sect. 2. Schema Matching Systems. The selection in this domain is based on a survey of schema-based matching approaches [21], where three systems applicable in the schema matching domain were selected as representatives of the group. The selected systems are COMA++ [5], Cupid [6], and the similarity ﬂooding algorithm [4]. Another system, whose main application domain is the schema domain and has been actively developed during the last years is GeRoMe [7]. It applies a novel approach by using a role-based model which was another reason to include it in this work. It has to be noted that COMA++ was extended to be also applicable in the ontology domain. GeRoMe has a generic role model and is also applied in the ontology domain. Similarity ﬂooding was ﬁrst implemented for schema matching, but nowadays is adapted and used in systems in other domains, e.g. [10,7]. Ontology Matching Systems. The selection of matching systems that are primarily used in the ontology domain is based on their performance in the Ontology Alignment Evaluation Initiative (OAEI) contest1 . The goals of OAEI are to assess the strengths and weaknesses of matching systems. The contest includes a variety of tests, ranging from a series of benchmark and anatomy tests to tests over matching of large web directories, libraries and very large cross-lingual resources. Since the creation of the OAEI in 2004, more than 30 diﬀerent systems have taken part. For the purpose of the presented work, six systems were chosen that performed best in the benchmark tests in 2007, 2008, 2009, and 2010 contests, namely Anchor-Flood [11], Agreement Maker[12], ASMOV [8], Lily [9], OLA2 [13], and RiMOM [10]. Metamodel Matching Systems. Although the research area of metamodel matching is developing, it does not yet have as many matching systems as the ontology and schema matching domains. It has to be noted that we consider only metamodel matching systems, and no model diﬀerencing tools as described in [22]. Five metamodel matching systems were included in the study, namely the Atlas Model Weaver [15] (extended by AML [23]), GUMM [17], MatchBox [18], ModelCVS [16] (implicitly applying COMA++), and SAMT4MDE [14]. 1

Ontology Alignment Evaluation Initiative - http://oaei.ontologymatching.org/

22

4

P. Ivanov and K. Voigt

Comparing Matching Techniques and Internal Models of Selected Systems

In this section we show the commonalities that matching systems from diﬀerent domains share, based on the matching techniques that they apply as well as on the information that their internal models represent. We adopt an existing classiﬁcation [19] of matching techniques from the ontology domain and show that systems from other domains apply, to a large extent, similar techniques. Furthermore, we examine the internal models of the matching systems and classify the information that they provide for matching. Based on this classiﬁcation, we point out the similarity of information provided by internal models of diﬀerent matching systems. 4.1

Matching Techniques in the Systems

The classiﬁcation of matching techniques that we adopt [19] is taken from the ontology domain. The classiﬁcation is based on and extends the classiﬁcation of automated schema matching approaches by Rahm and Bernstein [20], thus further giving signs of commonalities of the matching techniques in both domains. More details about the classiﬁcation can be found in [19]. For convenience, we show the graphical representation of the classiﬁcation with some naming adjustments in Fig. 1. Here, we use the term mapping reuse instead of alignment reuse as we consider the term “mapping” more general than the speciﬁc term “alignment” for the ontology domain. Upper level domain specific ontologies we change to the more general domain specific information and we call model based techniques – semantically grounded techniques, as in the diﬀerent domains “model” can be an ambiguous term. The classiﬁcation of matching techniques gives a detailed overview of the diﬀerent matching techniques, the information used by them, and the way it is interpreted. We arranged the selected matching systems according to the basic matching techniques as to be seen in Tab. 1. Furthermore, the table denotes the

Meta data Matching Techniques Element Syntactic

Structure External

Syntactic

External

Semantic

String- Language Constraint Linguistic Mapping Domain spec. Data analysis Graph- Taxonomy Repository Sem. based -based -based resources Reuse information & statistics based -based of structures grounded Linguistic Internal Terminological

Extensional

Structural

Relational Semantic

Meta data Matching Techniques

Fig. 1. Classiﬁcation of matching techniques adopted from [19]

Schema, Ontology and Metamodel Matching

23

applicability of a system in the diﬀerent existing matching domains. The systems are organized in groups, depending on the primary domain in which they are applied. In each group the systems are arranged alphabetically. The upper part of Tab. 1 shows the classiﬁcation of used basic matching techniques by the selected matching systems from the schema domain. All schema matching systems [4,5,6,7] apply string-, constraint-, and taxonomy-based matching techniques. Half of the systems apply language-based techniques [5,6]. Three out of four use graph-based techniques [5,4,7]. The majority [5,6,7] applies also external linguistic resources, such as WordNet to receive mapping results. Only one system [5] applies mapping reuse techniques.

3.

GeRoMe

4.

Similarity flooding

ϱ͘

Aflood

ϲ͘

AgrMaker

ϳ͘

ASMOV

ϴ͘

Lily

9.

OLA2

10

RiMOM

11.

AMW

12.

GUMM

13.

MatchBox

14.

ModelCVS

15.

SAMT4MDE

Semantically grounded

Graph-based

Data analysis and statistics

Linguistic resources

String-based

Language-based

Schema Domain

Repository of structures

Taxonomy-based

Cupid

Mapping reuse

COMA++

2.

Constraint-based

1.

Metamodel Domain

Ontology Domain

000000000000000000000000000000000000000000000000000000000000000000000000000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000000000000000000000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000000000000000000000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00000000000000000000000000000000000000000000000000000000000000000000000000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000

Domain specific information

Table 1. Basic matching techniques applied by analyzed systems in diﬀerent domains of application – ontology, schema, and metamodel matching domains

The middle part of Tab. 1 represents the classiﬁcation of the selected ontology matching systems. It can be seen that all of the ontology matching systems [8,9,10,11,12,13] (primary domain is the ontology domain) exploit string-based, language-based, linguistic resources, and taxonomy-based matching techniques. Almost all apply also the constraint-based and the graph-based techniques.

24

P. Ivanov and K. Voigt

WŽƌƚŝŽŶŽĨƐǇƐƚĞŵƐƵƐŝŶŐ ƐƉĞĐŝĨŝĐƚĞĐŚŶŝƋƵĞ

One system uses domain speciﬁc information and mapping reuse techniques to produce mappings [8]. Only two systems apply semantically grounded techniques [9,10]. The lower part of Tab. 1 shows the classiﬁcation of basic matching techniques used by matching systems in the metamodel domain. All metamodel matching systems [14,15,16,17,18] apply constraint-, graph-, and taxonomy-based techniques. The majority [15,16,17,18] also applies string-based techniques. Only one system [16] applies mapping-reuse techniques. If we take a look on the applied matching techniques, there are several things to point out. As a direct consequence, that the classiﬁcation is taken from the ontology domain, there are several techniques that are used only by ontology matching systems, such as semantically grounded and data analysis and statistics as to be seen in Tab. 1. Semantically grounded techniques produce results based usually on reasoners, so these techniques are speciﬁc for the ontology domain. Using domain speciﬁc information in the form of upper level ontologies is also applied only to the ontology domain. None of the systems apply data analysis and statistical approaches due to lack of appropriate object samples. Nevertheless, if such input were available, this type of matching technique could be applied in every domain. Very few systems make use of mapping reuse [5,8,16] and repository of structures [5,16,14] techniques as these approaches are relatively recent. ϭϬϬй ϴϬй ϲϬй ϰϬй ϮϬй Ϭй

^ĐŚĞŵĂĚŽŵĂŝŶ KŶƚŽůŽŐǇĚŽŵĂŝŶ DĞƚĂŵŽĚĞůĚŽŵĂŝŶ DĂƚĐŚŝŶŐƚĞĐŚŶŝƋĞƐ

Fig. 2. Portion of systems applying a matching technique in the system’s domain

Some matching techniques are rarely used due to their recentness, or lack of appropriate input, or their speciﬁcity. These techniques show high potential for further investigation, to see how they could be adapted and reused in other domains. It is to be noted that the majority of matching techniques are applied across all three domains. Fig. 2 shows the portion of systems in each domain that apply a certain basic matching technique. As it can be seen from the distribution, there are several matching techniques that are applied by most of the systems, and these are string-based, language-based, linguistic resources, constraint-based, taxonomy-, and graph-based techniques. The logical question that follows is “Why exactly these techniques are common for the diﬀerent domains?”. We found the answer in the internal model representation of the systems and the information that they expose for matching being the same across the

Schema, Ontology and Metamodel Matching

25

domains. In the following subsection we examine the information provided by an internal data model for matching and classify this information. Based on this we show the commonalities of internal models of matching systems from diﬀerent domains. 4.2

Data Models for Matching Systems

The internal data model of a matching system aﬀects the overall capabilities of the system as it may provide only speciﬁc information for matching and thus may inﬂuence the applicability of certain matching techniques. In order to extract the features of the information, provided from an internal model to diﬀerent matching techniques, it is helpful to consider the existing classiﬁcation of matching approaches. To extract this information, those basic matching techniques need to be selected that only use information provided by the data model. All matching techniques that are classiﬁed as external are excluded from this case, because they do not actually use information coming from the internal representation but from external resources. For that reason, matching techniques from the groups of mapping reuse, linguistic resources, domain speciﬁc information, and repository of structures are not considered. Analyzing the remaining matching techniques results in a classiﬁcation of the information, provided by internal models of matching systems. The classiﬁcation is shown in Fig. 3. The information that internal models provide can be divided into two main groups:

Data model information Entity information Label

Annotation

Structural information Value Data type

Internal structure

Cardinality

ID/key

Relational structure

Inheritance

Containment

Association

Attribute Def.

Fig. 3. Classiﬁcation of information provided by internal data model

– Entity information. This is the information that entities of an internal data model provide to matching techniques. An entity is any class, attribute, property, relationship or an instance (individual) that is part of a model. Entities may provide textual information through their names or labels and optionally annotations if available. Annotations can be considered as additional documentation or meta information attached to an entity. Instance entities provide information about their values. Information coming from entities is usually exploited by terminological and extensional matching techniques, such as string-based, language-based, and data analysis and statistics matching techniques.

26

P. Ivanov and K. Voigt

– Structural information. This is the information, provided by the structure of an internal data model. The structural information can be divided into internal and relational structures. Information provided by the internal structure includes data type properties of attributes, cardinality constraints, or identiﬁers and keys. Internal structure information is provided by the structure of the entities themselves, not considering any relation with other entities. In contrast, the relational structure information is such information that considers the diﬀerent types of relationships between entities. This can be inheritance, containment or an association relationship, as well as the relationship between an attribute and its containing class. Although the relationship between a class and an attribute can be considered to be a containment relationship in some domains, in others, such as the ontology domain these entities are decoupled, which was the reason to also introduce the attribute deﬁnition type of relationship. To explore internal structure information constraintbased techniques are applied, while graph- and taxonomy-based matching techniques are used to utilize relational structure information. Table 2. Classiﬁcation of the information, provided by internal models, used in studied matching systems

OL-Graph [13]

OWL [8-12]

Role-based model [7]

ID/Key

Definition of properties/ attributes

Genie [24]

Association relationship

Containment relationship

Relational structure Inheritance relationship

Ecore [14,15]

Cardinality

Structural Information Internal structure Data Type

DLG [4-6,16,17]

Value

Annotation

Name/Label

Entity Information

Table 2 shows the diﬀerent internal models that have been used in the selected matching systems and classiﬁes them according to the information that is actually provided by each model to the matching techniques. The models are arranged alphabetically in the table. A short summary of which model is used by which system is given below. Grey ﬁelds denote that a certain model could

Schema, Ontology and Metamodel Matching

27

support this type of information but due to the main application domain or applied matching techniques this information is not represented. Directed Graphs – GUMM and the Similarity Flooding approach are the systems that use Directed Labeled Graphs (DLG) as internal models. GUMM relies mainly on the similarity ﬂooding approach, by reusing it in metamodel matching. COMA++ (and ModelCVS, as it implicitly applies COMA++) uses a variation of DLGs, namely Directed Acyclic Graphs (DAG) by putting a constraint that there should be no cycles within the built graph. Cupid uses a simpliﬁcation of DAGs, representing internally the input as trees. The concept of a DLG is a very generic representation and can be reused and utilized with diﬀerent types of data, which is why the diﬀerent matching systems have their own graph representation as internal model. The minimal set of information that is presented from the diﬀerent systems is marked with check marks as it is denoted in Table 2, grey ﬁelds show that the whole set of information can actually be represented by a DLG. Ecore – SAMT4MDE and AMW are the two metamodel matching systems that do not apply speciﬁc internal data models on their own, but directly use Ecore. Applied in the area of metamodel matching, the systems operate directly over the input format of the data, namely Ecore. Genie – MatchBox introduces its own data model Genie (GENeric Internal modEl). The model was designed to be a generic model, that covers the whole set of information for matching. For further details, see [24]. OL-Graph – OLA 2 introduces the OL-Graph model and uses it as its internal data representation. The OL-Graphs are similar to the idea of directed labeled graphs. In the OLA graphs, the nodes represent diﬀerent categories from ontology entities, as well as additional ones like tokens or cardinality. The OLA graph is speciﬁcally designed to serve ontology matching. OWL – all of the ontology matching systems except OLA 2 directly use OWL as their internal model. Anchor ﬂood claims to have its own memory model, where the lexical description of entities is normalized, but no further details are available [11]. ASMOV also claims to have a model of its own, namely a streamlined model of OWL [8], but again no further information about the model is published. Role-based model – GeRoMe takes the approach that entities within a model actually do not have an identity on their own, but play diﬀerent roles to form their identity. Thus, GeRoMe uses a role-based model as its internal data model. More information can be found in [25]. It can be seen from the analysis that most of the models cover the information used by matching techniques to a high degree. Additionally, all models allow for extension to cover the whole set of information. This shows that all models cover

28

P. Ivanov and K. Voigt

the same set of information that can be used for matching, independently from the application domain. OWL and Genie are the two models that cover the full spectrum of information. Ecore provides all structural and almost all entity information, except values of instance entities as the Ecore model does not deal with instances. GeRoMe’s role based model also covers almost all information, except annotations. OL-graphs do not cover annotation information as well as the internal structural information about identiﬁers. DLGs, as applied in the diﬀerent systems, do not cover the whole range of information that could be possibly provided, but it has to be noted that the concept of presenting models as graphs is very generic and thus it is theoretically possible to represent all information from the classiﬁcation.

5

Conclusion and Further Work

This paper presents an overview of the applied matching techniques and the internal models of ﬁfteen state-of-the-art matching systems from three diﬀerent domains of matching. The overview pointed out a set of matching techniques that are shared among systems, independently from the domain in which they are applied. This standard set of matching techniques includes string-based, languagebased, linguistic resources, constraint-based, taxonomy-, and graph-based techniques. We conclude that the three analyzed domains share a lot of commonalities in the applied matching techniques. Looking into the reasons why to such large extent most techniques are shared among the systems from the diﬀerent domains, we analyzed what information is provided for matching from the internal data models of the systems. We classiﬁed this information and pointed out that the models have a lot in common which indicates that although the systems were developed for diﬀerent domains, the core information used for matching is the same and the systems can beneﬁt from knowledge sharing across the domains. A second issue we identiﬁed during our comparison is that further studies w.r.t. result quality, level of matching, and architecture across the domains are missing. Consequently, we see the following further work to be done: 1. Knowledge sharing. (a) Transfer of matching techniques. Matching techniques such as semantically grounded or a repository of structures are not applied in every domain and thus are worth to investigate to be shared from one domain to another. Additionally, it is also of interest to apply a promising system of one domain in another to see which improvements its techniques may yield. (b) Research of matching techniques. Some techniques, e.g. mapping reuse and statistics, are not very common and thus show a lot of potential for promising future work. It would be interesting to deeper examine these techniques.

Schema, Ontology and Metamodel Matching

29

2. Further studies. (a) Result Quality. In this work, we did not examine whether the same techniques perform with similar results, in terms of quality, under the different domains. As a direct consequence, it is necessary to develop a common platform and common test cases to cover the quality of the matching results. Similar initiative has already been started in the ontology domain, but it needs to be extended to cover all three domains. As we proved that models cover indeed the same information and use to large extent the same techniques, such extension of the test cases should be possible. (b) Level of matching. Furthermore, we point out that our approach is limited to meta data matching and does not consider the area of object matching. Therefore, it is worth providing an overview in this area as well. (c) Matching System Architecture and Properties. In this work, we examined the internal models and the matching techniques, but we did not focus on other architectural feature of the matching systems, namely how results form diﬀerent matchers are combined within a system. It would be interesting to see how diﬀerent systems perform this task and whether same similarities between the domains can be revealed in this aspect. Matching in general is a very active area in all three domains of schema, ontology, and metamodel matching, thus cooperation and adoption of insights between domains are quite beneﬁcial.

References 1. Halevy, A., Rajaraman, A., Ordille, J.: Data integration: The teenage years. In: VLDB 2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 9–16 (2006) 2. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001) 3. Aßmann, U., Zschaler, S., Wagner, G.: Ontologies, Meta-models, and the ModelDriven Paradigm, pp. 249–273 (2006) 4. Melnik, S., Garcia-molina, H., Rahm, E.: Similarity ﬂooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE 2002: Proceedings of the 18th International Conference on Data Engineering (2002) 5. Do, H.H., Rahm, E.: COMA – a system for ﬂexible combination of schema matching approaches. In: VLDB 2002: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB Endowment, pp. 610–621 (2002) 6. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. The VLDB Journal, 49–58 (2001) 7. Kensche, D., Quix, C., Li, X., Li, Y.: GeRoMeSuite: a system for holistic generic model management. In: VLDB 2007: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB Endowment, pp. 1322–1325 (2007) 8. Jean-Mary, Y.R., Kabuka, M.R.: ASMOV: Results for OAEI 2010. In: OM 2010: Proceedings of the 5th International Workshop on Ontology Matching (2010)

30

P. Ivanov and K. Voigt

9. Wang, P., Xu, B.: Lily: Ontology alignment results for OAEI 2009. In: OM 2009: Proceedings of the 5th International Workshop on Ontology Matching (2009) 10. Zhang, X., Zhong, Q., Li, J., Tang, J.: RiMOM results for OAEI 2010. In: OM 2010: Proceedings of the 5th International Workshop on Ontology Matching (2010) 11. Hanif, M.S., Aono, M.: Anchor-Flood: Results for OAEI 2009. In: OM 2009: Proceedings of the 4th International Workshop on Ontology Matching (2009) 12. Cruz, I.F., Antonelli, F.P., Stroe, C., Keles, U.C., Maduko, A.: Using AgreementMaker to align ontologies for OAEI 2010. In: OM 2010: Proceedings of the 5th International Workshop on Ontology Matching (2010) 13. Kengue, J.F.D., Euzenat, J., Valtchev, P.: OLA in the OAEI 2007 Evaluation Contest. In: OM 2007: Proceedings of the 2nd International Workshop on Ontology Matching (2007) 14. de Sousa Jr, J., Lopes, D., Claro, D.B., Abdelouahab, Z.: A step forward in semiautomatic metamodel matching: Algorithms and tool. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2009. LNBIP, vol. 24, pp. 137–148. Springer, Heidelberg (2009) 15. Fabro, M.D.D., Valduriez, P.: Semi-automatic model integration using matching transformations and weaving models. In: SAC 2007: Proceedings of the 25th Symposium on Applied Computing, pp. 963–970 (2007) 16. Kappel, G., Kargl, H., Kramler, G., Schauerhuber, A., Seidl, M., Strommer, M., Wimmer, M.: Matching metamodels with semantic systems – an experience report. In: BTW 2007: Proceedings of Datenbanksysteme in Business, Technologie und Web (March 2007) 17. Falleri, J.R., Huchard, M., Lafourcade, M., Nebut, C.: Metamodel matching for automatic model transformation generation. In: Busch, C., Ober, I., Bruel, J.-M., Uhl, A., V¨ olter, M. (eds.) MODELS 2008. LNCS, vol. 5301, pp. 326–340. Springer, Heidelberg (2008) 18. Voigt, K., Ivanov, P., Rummler, A.: MatchBox: Combined meta-model matching for semi-automatic mapping generation. In: SAC 2010: Proceedings of the 2010 ACM Symposium on Applied Computing (2010) 19. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007) 20. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10, 334–350 (2001) 21. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. Journal on Data Semantics 4, 146–171 (2005) 22. Kolovos, D.S., Ruscio, D.D., Pierantonio, A., Paige, R.F.: Diﬀerent models for model matching: An analysis of approaches to support model diﬀerencing. In: CVSM 2009: Proceedings of 2009 ICSE Workshop on Comparison and Versioning of Software Models, pp. 1–6 (2009) 23. Garc´es, K., Jouault, F., Cointe, P., B´ezivin, J.: Managing model adaptation by precise detection of metamodel changes. In: ECMDA 2009: Fifth European Conference on Model-Driven Architecture Foundations and Applications (2009) 24. Voigt, K., Heinze, T.: Meta-model matching based on planar graph edit distance. In: Tratt, L., Gogolla, M. (eds.) ICMT 2010. LNCS, vol. 6142, pp. 245–259. Springer, Heidelberg (2010) 25. Kensche, D., Quix, C., Chatti, M.A., Jarke, M.: GeRoMe: A generic role based metamodel for model management. Journal on Data Semantics 82 (2005)

A Framework Proposal for Ontologies Usage in Marketing Databases Filipe Mota Pinto1, Teresa Guarda2, and Pedro Gago1 1

Computer Science Department of Polytechnic Institute of Leiria, Leiria, Portugal {fpinto,pgago}@ipleiria.pt 2 Superior Institute of Languages and Administration of Leiria, Leiria, Portugal [email protected]

Abstract. The knowledge extraction in databases has being known as a long term and interactive project. Nevertheless the complexity and different options for the knowledge achievement here is a research opportunity that could be explored, throughout the ontologies support. This support may be used for knowledge sharing and reuse. This work describes a research of an ontological approach for leveraging the semantic content of ontologies to improve knowledge discovery in marketing databases. Here we analyze how ontologies and knowledge discovery process may interoperate and present our efforts to prose a possible framework for a formal integration. Keywords: Ontologies, Marketing , Databases, Data Mining.

1 Introduction In artificial intelligence, ontology is defined as a specification of a conceptualization [14]. Ontology specifies at a higher level, the classes of concepts that are relevant to the domain and the relations that exist between these classes. Indeed, ontology captures the intrinsic conceptual structure of a domain. For any given domain, its ontology forms the heart of the knowledge representation. In spite of ontology-engineering tools development and maturity, ontology integration in knowledge discovery projects remains almost unrelated. Knowledge Discovery in Databases (KDD) process is comprised of different phases, such as data selection, preparation, transformation or modeling. Each one of these phases in the life cycle might benefit from an ontology-driven approach which leverages the semantic power of ontologies in order to fully improve the entire process [13]. Our challenge is to combine ontological engineering and KDD process in order to improve it. One of the promising interests in use of ontologies in KDD assistance is their use for process guidance. This research objective seems to be much more realistic now that semantic web advances have given rise to common standards and technologies for expressing and sharing ontologies [3]. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 31–41, 2011. © Springer-Verlag Berlin Heidelberg 2011

32

F.M. Pinto, T. Guarda, and P. Gago

There are three main operations of KDD can take advantage of domain knowledge embedded in ontologies: At the data understanding and data preparation phases, ontologies can facilitate the integration of heterogeneous data and guide the selection of relevant data to be mined, regarding domain objectives; During the modeling phase, domain knowledge allows the specification of constraints (e.g., parameters settings) for guiding data mining algorithms by, (e.g. narrowing the search space); finally, to the interpretation and evaluation phase, domain knowledge helps experts to visualize and validate extracted units. KDD process is usually performed by experts. They use their own knowledge for selecting the most relevant data in order to achieve domain objectives [13]. Here we explore how the one ontology and its associated knowledge base can assist the expert at KDD process. Therefore, this document describes a research approach to leveraging the semantic content of ontologies to improve KDD. This paper is organized as follows: after this introductory part we present related background concepts. Then, we present related work on this area following the presentation and discussion of ontological assistance. The main contribution is presented in terms of ontological work, experiments and deployment. Finally we draw some conclusions and address further research based on this research to future KDD data environment projects.

2 Background 2.1 Predictive Model Markup Language Predictive model markup language (PMML) is an XML-based language that provides a way for applications to define statistical and data mining models and to share these models between PMML compliant applications (Data Mining Group). Furthermore, the language can describe some of the operations required for cleaning and transforming input data prior to modeling. Since PMML is an XML based standard, its specification comes in the form of an XML schema that defines language primitives as follows [5]: Data Dictionary; Mining schema; Transformations (normalization, categorization; value conversion, aggregation; functions); Model statistics and Data mining model. 2.2 Ontology Web Language Ontologies are used to capture knowledge about some domain of interest. Ontology describes the concepts in the domain and also the relationships that hold between those concepts. Different ontology languages provide different facilities. Ontology Web Language (OWL) is a standard ontology language from the World Wide Web Consortium (W3C ). An OWL ontology consists of: Individuals (represent domain objects); Properties (binary relations on individuals - i.e. properties link two individuals together); and Classes (interpreted as sets that contain individuals). Moreover, OWL enables the inclusion of some expressions to represent logical formulas in Semantic Web rule language (SWRL) [16]. SWRL is a rule language that combines OWL with the rule markup language providing a rule language compatible with OWL.

A Framework Proposal for Ontologies Usage in Marketing Databases

33

2.3 Semantic Web Language Rule To the best of our knowledge there are no standard OWL-based query languages. Several RDF -based query languages exist but they do not capture the full semantic richness of OWL. To tackle this problem, it was developed a set of built-in libraries for Semantic Web Rule Language (SWRL) that allow it to be used as a query language The OWL is a very useful means for capturing the basic classes and properties relevant to a domain. However, these domain ontologies establish a language of discourse for eliciting more complex domain knowledge from subject specialists. Due to the nature of OWL, these more complex knowledge structures are either not easily represented in OWL or, in many cases, are not representable in OWL at all. The classic example of such a case is the relationship uncleOf(X,Y). This relation, and many others like it, requires the ability to constrain the value of a property (brotherOf) of one term (X) to be the value of a property (childOf) of the other term (Y); in other words, the siblingOf property applied to X (i.e., brotherOf(X,Z)) must produce a result Z that is also a value of the childOf property when applied to Y (i.e., childOf(Y,Z)). This “joining” of relations is outside of the representation power of OWL. One way to represent knowledge requiring joins of this sort is through the use of the implication () and conjunction (AND) operators found in rule-based languages (e.g., SWRL). The rule for the uncleOf relationship appears as follows: brotherOf(X,Z)AND childOf(Y,Z)→uncleOf(X,Y)

3 Related Work A KDD assistance through ontologies should provide user with nontrivial, personalized “catalogs” of valid KDD-processes, tailored to their task at hand, and helps them to choose among these processes in order to analyze their data. In spite of the increase investigation in the integration of domain knowledge, by means of ontologies and KDD, most approaches focus mainly in the DM phase of the KDD process [2] [3] [8] while apparently the role of ontologies in other phases of the KDD has been relegated. Currently there are others approaches being investigated in the ontology and KDD integration, like ONTO4KDD [13] or AXIS [25]. In the literature there are several knowledge discovery life cycles, mostly reflect the background of their proponent’s community, such as database, artificial intelligence, decision support, or information systems [12]. Although scientific community is addressing ontologies and KDD improvement, at the best of our knowledge, there isn’t at the moment any fully successful integration of them. This research encompasses an overall perspective, from business to knowledge acquisition and evaluation. Moreover, this research focuses the KDD process regarding the best fit modeling strategy selection supported by ontology.

34

F.M. Pinto, T. Guarda, and P. Gago

4 Ontological Work This research work is a part of one much larger project: Database Marketing Intelligence supported. by ontologies and knowledge discovery in databases. Since this research paper focuses the KDD process ontological assistance, we mainly focus this research domain area. In order to develop our data preparation phases ontology we have used the METHONTOLOGY methodology [12][10][4]. This methodology best fits our project approach, since it proposes an evolving prototyping life cycle composed of development oriented activities. 4.1 Ontology Construction Through an exhaustive literature review we have achieved a set of domain concepts and relations between them to describe KDD process. Following METHONTOLOGY we had constructed our ontology in terms of process assistance role. Nevertheless, domain concepts and relations were introduced according some literature directives [4][24]. Moreover, in order to formalize all related knowledge we have used some relevant scientific KDD [1] [21] and ontologies [17] [18] published works. However, whenever some vocabulary is missing it is possible to develop a research method (e.g., through Delphi method [7] [6] [19] [20]) in order to achieve such a domain knowledge thesaurus. At the end of the first step of methontology methodology we have identified the following main classes (Figure 1). Our KDD ontology has three major classes: 1. Resource class relates all resources needed to carry the extraction process, namely algorithms and data. 2. ProcessPhase is the central class which uses resources (Resource class) and has some results (ResultModel class). 3. ResultModel has in charge to relate all KDD instance process describing all resources used, all tasks performed and results achieved in terms of model evaluation and domain evaluation. Analysing the entire KDD process we have considered four main concepts below the ProcessPhase concept (OWL class): Data Understand; Data Preprocessing; Modeling; In order to optimize efforts we have introduced some tested concepts from other data mining ontology (DMO) [17], which has similar knowledge base taxonomy. Here we take advantage of an explicit ontology of data mining and standards using the OWL concepts to describe an abstract semantic service for DM and its main operations. In the DMO, for simplicity reasons, there are two defined types of DM-elements: settings and results, which in our case correspond to Algorithm and Data classes. The settings represent inputs for the DM-tasks, and on the other hand, the results represent outputs produced by these tasks.

A Framework Proposal for Ontologies Usage in Marketing Databases

35

Fig. 1. KDD ontology class taxonomy (partial view)

There is no difference between inputs and outputs because it is obvious that an output from one process can be used, at the same time, as an input for another process. Thus, we have represented above concept hierarchy in OWL language, using protégé OWL software. …

36

F.M. Pinto, T. Guarda, and P. Gago

Following Methontology, the next step is to create domain-specific core ontology, focusing knowledge acquisition. To this end we had performed some data processing tasks, data mining operations and also performed some models evaluations. Each class belongs to a hierarchy. Moreover, each class may have relations between other classes (e.g., PersonalType is-a InformationType subclass). In order to formalize such schema we have defined OWL properties in regarding class’ relationships, generally represented as: Modeling^ has Algorithm(algorithm) In OWL code: Each new attribute is presented to the ontology, it is evaluated in terms of attribute class hierarchy, and related properties that acts according it. In our ontology Attribute is defined by a set of three descriptive items: Information Type, Structure Type and allocated Source. Therefore it is possible to infer that, Attribute is a subclass of Thing and is described as a union of InformationType, StructureType and Source. StructureType(Date)  hasMissingValueTask  hasOutliersTask  hasAttributeDerive

A Framework Proposal for Ontologies Usage in Marketing Databases

37

Attribute InformationType (Personal) & Attribute PersonalType(Demographics)  hasCheckConsistency

5 Proposed Framework One of the promising interest of ontologies is they common understand for sharing and reuse. Hence we have explored this characteristic to effectively assist the KDD process. Indeed, this research presented the KDD assistance at two levels: Overall process assistance based on ModelResult class and, - KDD phase assistance. Since our ontology has a formal structure related to KDD process, is able to infer some result at each phase.

user

Ontology

Objective Definition objective Objectives Type Objective Type

Data Understanding

Initial Database

Objective Type

Data Undertanding

Data Understand Task

Data Understand Task

Data Pre-Processing

Data Selected

Attribute Pre Processing Tasks

Pre-Processing Task Pre-Processed Data Set

Modeling

Modeling Objectives

Data Pre-Processing Task

Working Data

Algorithm Selection

Algorithm

Algorithm Data specification

Data Prepared

Algorithm Working Data

Model construction Model

Evaluation & Deployment Model Evaluation

Evaluation Tasks

Evaluation Model Deployment

Deployment Tasks

Evaluation Tasks

Deployment Tasks

Deployment Result Model

Fig. 2. KDD ontological assistance sequence diagram

38

F.M. Pinto, T. Guarda, and P. Gago

To this end, user need to invoke the system rule engine (reasoner) indicating some relevant information, e.g., at data preprocessing task: swrl:query hasDataPreprocessingTask(?dpp,”ds”), where hasDataPreProcessingTask is an OWL property which infers from ontology all assigned data type preprocessing tasks (dpp) related to each attribute type within the data set “ds”. Moreover, user is also assisted in terms of ontology capability index, through the ontology index - precision, recall and PRI metrics. Once we have a set of running KDD process registered at the knowledge base, whenever a new KDD process starts one the ontology may support the user at different KDD phases.As example to a new classification process execution the user interaction with ontology will follow the framework as depicted in Figure 2. The ontology will lead user efforts towards the knowledge extraction suggesting by context. That is the ontology will act accordingly to user question, e.g., at domain objective definition (presented by user) the ontology will infer which is type of objectives does the ontology has. All inference work is dependent of previous loaded knowledge. Hence, there is an ontology limitation – only may assist in KDD process which has some similar characteristics to others already registered.

6 Experiments Our system prototype operation follows general KDD framework [9] and uses the ontology to assist at each user interaction, accordingly as despicted in figure 2. Our experimentation was developed over a real oil company fidelity card marketing database. This database has three main tables: card owner; card transactions and fuel station. To carry out this we have developed an initial set of SWRL rules. Since KDD is an interactive process, these rules deal at both levels: user and ontological levels. The logic captured by these rules is this section using an abstract SWRL representation, in which variables are prefaced with question marks. Domain objective: customer profile Modeling objective: description Initial database: fuel fidelity card; Database structure: 4 tables; The most relevant rule extracted from above data algorithms use was: if (age<27 and vehicleType=”Lig” and sex=”Female”) then 1stUsed=”p” At this model we may say that, female card owners with less than 27 years old and do have a car “lig” (ligeiro) category, it would use fuel station located in range less than 10 kilometers (p) from her address.

A Framework Proposal for Ontologies Usage in Marketing Databases

39

7 Deployment Each running KDD process must be evaluated according to the results, in order to update the knowledge base for a latter reuse. The SWRL code would be in the form: :getEvaluation[(Model?m)^ hasModeling(?met)^ hasAlgorithm(?alg)^ hasEvaluation(?m,?met,?alg) ^ hasEvaluationParameters(?par)] -> Evaluation (?m,?ev)] Each evaluation depends on e.g., model type or algorithms used. INSERT record KNOWLEDGE BASE hasAlgorithm(J48) AND hasModelingObjectiveType(classification) AND hasAlgorithmWorkingData ({idCard; age; carClientGap; civilStatus; sex; vehicleType; vehicleAge; nTransactions; tLiters; tAmountFuel; tQtdShop; 1stUsed; 2stUsed; 3stUsed }) AND Evaluation(67,41%; 95,5%) AND hasResultMoldel (J48;classification; “wds”,PCC;0,84;0,29) Once performed the evaluation, the system automatically updates the knowledge base with a new record. The registered information will serve for future use – knowledge sharing and reuse. Moreover, ontology is also being evaluated through the index precision and recall.

8 Conclusions and Further Research This work strived to improve KDD process supported by ontologies. To this end, we have used general domain ontology to assist the knowledge extraction from databases with KDD process. The KDD success is very much user dependent. Throughout our framework, it is possible to suggest a valid set of tasks which better fits in KDD process design. However, it is still missing the capability to automatically run the data, to develop modeling approaches and to apply algorithms. Nevertheless, there are four main operations of KDD that can take advantage of domain knowledge embedded in ontologies: During the data preparation phase; During the mining step; During the deployment phase; With knowledge base ontology may help analyst to choose the best modeling approach based on knowledge base ranking index.

40

F.M. Pinto, T. Guarda, and P. Gago

Future research work will be devoted to expand the use of KDD ontology through knowledge base population with more relevant concepts about the process. Another interesting direction to investigate is to represent the whole knowledge base in order to allow its automatic reuse.

References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data SIGMOD 1993, pp. 207–216 (1993) 2. Anand, S.S., Grobelnik, M., Herrmann, F., Lingenfelder, N., Wettschereck, D.: Knowledge discovery standards. Artificial Intelligence Review 27(1), 21–56 (2007) 3. Bernstein, A., Provost, F., Hill, S.: Toward intelligent assistance for a data mining process. IEEE Transactions on knowledge and data engineering 17(4) (2005) 4. Blazquez, M., Fernandez, M., Gomez-Perez, A.: Building ontologies at the knowledge level In Knowledge Acquisition wks, Voyager Inn, Banff, Alberta, Canada (1998) 5. Brezany, P., Janciak, I., Tjoa, A.M.: Data Mining with Ontologies: Implementations, Findings, and Frameworks. Information Science, 182–210 (2008) 6. Chu, H.-C., Hwang, G.-J.: A delphi-based approach to developing expert systems with the cooperation of multiple experts. Expert Systems with Applications 34, 2826–2840 (2008) 7. Delbecq, A.L., Ven, A.H.V.D., Gustafson, D.H.: Group Techniques for Planning- A Guide to Nominal Group and Delphi Processes. Scott (1975) 8. Domingos, P.: Prospects and challenges for multi-relational data mining. SIGKDD Explorer Newsletter 5(1), 80–83 (2003) 9. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. Magazine, A 17, 37–54 (1996); American Ass. Artificial Intelligence 10. Fernandez, M., Gomez-Perez, A., Juristo, N.: Methontology: From ontological art towards ontological engineering. Technical report, AAAI (1997) 11. Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological engineering, 2nd edn. Springer, Heidelberg (2004) 12. Gottgtroy, P., Kasabov, N., MacDonell, S.: An ontology driven approach for knowledge discovery in biomedicine (2004) 13. Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5, 199–220 (1993) 14. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufman, San Francisco (2001) 15. Horrocks, I., Patel-Schneider, S., Grosof, B., Dean, M.: Swrl: A semantic web rule language - combining owl and ruleml. Technical report, W3C (2004) 16. Nigro, H.O., Cisaro, S.G., Xodo, D.: Data Mining with Ontologies: Implementations, Findings and Frameworks. Information Science Reference. Information Science Reference (2008) 17. Phillips, J., Buchanan, B.G.: Ontology-guided knowledge discovery in databases. In: ACM (ed.) International Conference on Knowledge Capture, pp. 123–130 (2001) 18. Pinto, F.M., Gago, P., Santos, M.F.: Marketing database knowledge extraction. In: IEEE 13th International Conference on Intelligent Engineering Systems (2009a) 19. Pinto, F.M., Marques, A., Santos, M.F.: Database marketing process supported by ontologies. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2009. LNBIP, vol. 24. Springer, Heidelberg (2009b)

A Framework Proposal for Ontologies Usage in Marketing Databases

41

20. Quinlan, R.: Induction of decision trees. Machine Learning 1, 81–106 (1986) 21. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Analysis of recommendation algorithms for e-commerce. In: Proc. 2nd ACM Conf. Electronic Commerce (2000) 22. Seaborne, A.: Rdql - a query language for rdf. Technical report, W3C (2004) 23. Smith, R.G., Farquhar, A.: The road ahead for knowledge management: An ai perspective. American Association for Artificial Intelligence 1, 17–40 (2008) 24. da Silva, A., Lechevallier, Y.: Axis Tool for Web Usage Evolving Data Analysis (ATWEDA), INRIA-France (2009)

Proposed Approach for Evaluating the Quality of Topic Maps Nebrasse Ellouze1,2, Elisabeth Métais1, and Nadira Lammari1 1 Laboratoire Cedric, CNAM 292 rue Saint Martin, 75141 Paris cedex 3, France [email protected], {metais,lammari}@cnam.fr 2 Ecole Nationale des Sciences de l’Informatique, Laboratoire RIADI Université de la Manouba, 1010 La Manouba [email protected]

Abstract. Topic Maps are used for structuring contents and knowledge provided from different information sources and different languages. They are defined as semantic structures which allow organizing all the subjects they represent. They are intended to enhance navigation and improve information search in these resources. In this paper, we propose to study the quality of Topic Maps. Topic Map quality covers various aspects, some of them are common with conceptual schemas, others are common with information retrieval systems and some other aspects are specific to the problem of Topic Maps. In this paper, we have limited our work to treat the aspect of quality related to the volume of the Topic Map. In fact, Topic Maps are usually very big and voluminous, since they can contain thousands of Topics and associations. This large volume of information and complexity can lead to a bad organization of the Topic Map, so searching information using the Topic Map structure will be a very difficult task and users cannot find easily what they want. In this context, to manage the volume of the Topic Map, we propose a dynamic pruning method when we display the Topic Map by defining a list of meta-properties associated to each topic. The first meta-property represents the Topic score which reflects its relevance over the time and the second meta-property indicates the level to which belongs the Topic in the Topic Map. Keywords: Topic Map (TM), quality, meta-properties, Topic Map visualization.

1 Introduction Topic Maps [1] are used for structuring contents and knowledge provided from different information sources and different languages. They are defined as semantic structures which allow organizing all the subjects they represent. They are intended to enhance navigation and improve information search in these resources. In our previous works, we have defined CITOM [2], an incremental approach to build a multilingual Topic Map from textual documents. We have validated our approach with a real corpus from the sustainable construction domain [3]. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 42–49, 2011. © Springer-Verlag Berlin Heidelberg 2011

Proposed Approach for Evaluating the Quality of Topic Maps

43

In this paper, we propose to study the quality of the generated Topic Map. Topic Map quality covers various aspects, some of them are common with conceptual schemas, others are common with information retrieval systems and some other aspects are specific to the problem of Topic Maps. In this paper, we have limited our work to treat the aspect of quality related to the volume of the Topic Map. In fact, Topic Maps are usually very big and voluminous, since they can contain thousands of Topics and associations. This large volume of information and complexity can lead to a bad organization of the Topic Map, so searching information using the Topic Map structure will be a very difficult task and users cannot find easily what they want. In this context, to manage the volume of the Topic Map, we propose a dynamic pruning method when we display the Topic Map by defining a list of meta-properties associated to each topic. The first meta-property represents the Topic score which reflects its relevance over the time and the second meta-property indicates the level to which belongs the Topic in the Topic Map. This paper will be structured as follows: In section 2, we present a brief stat of the art on Topic Map quality management. Section 3 is devoted to the presentation of our approach for Topic Map quality evaluation. At least, in section 4, we conclude and give some perspectives for this work.

2 Proposals for Topic Maps Quality Management Quality is considered as an integral part of every information system, especially with the large volume of data which is continuously increasing over the time and the diversity of applications. Many research works have been proposed in this domain. They concern different aspects of quality: quality of data (for example the european project DWQ, Data Warehouse Quality proposed by [4]), quality of conceptual models (such as QUADRIS project, Quality of Data and Multi-Source Information Systems), proposed by [5] which aims at defining a framework for evaluating the quality of multisource information systems, quality of development process, quality of data treatment process, quality of business process, etc. The study of Topic Maps quality should a priori consider several works realized in the area of ontologies and conceptual models quality. Within the context of this paper, we will discuss quality of Topic Maps; related literature, our approach to evaluate the quality of Topic Maps and future directions. Based on the literature, we note that very few works [6], [7], [8], [9] [10] have been proposed to evaluate the quality of Topic Maps. We propose to classify the existing approaches proposed to evaluate the quality of Topic Maps in two different classes: those interested to evaluate the quality of the Topic Map representation and those who propose to evaluate the quality of search through the Topic Map. 2.1 Proposed Approaches to Manage Quality of Topic Map Representation In this class of approaches, we expose for example the method presented in [6] who propose to use representation and visualization techniques to enhance users' navigation and make easier the exploration of the Topic Map. These techniques consist on filtering and clustering data in the Topic Map using conceptual

44

N. Ellouze, E. Métais, and N. Lammari

classification algorithms based on Formal Concept Analysis and Galois lattices. In their work, [6] provide also representation and navigation techniques to facilitate the exploration of Topic Map. The idea is to represent Topic Maps as virtual cities; users can move and navigate in these cities to create their own cognitive map. [7] propose to use Topic Maps for visualizing heterogeneous data sources. Their approach aims at improving the display of Topic Maps because of the diversity and the big volume of information they represent. The idea is to use the notions of cluster and sector using the TM Viewer tool (Ontologies for Education Group: http://iiscs.wssu.edu/o4e/). The whole Topic Map is visualized with different levels so that users can manage the large number of Topics. This project was inspired from the work proposed by [8] which consists in implementing a tool, called TopicMaker [8], for viewing Topic Maps in a 3D environment and at several levels. 2.2 Proposed Approaches to Manage Quality of Search through the Topic Map In this class of approaches, we cite for example the method presented in [9] who work on performance search using Topic Maps. This method consists on evaluating a web application based on the Topic Maps model and developed for information retrieval; this application is implemented and tested in the field of education. [9] propose to compare this application with a traditional search engine using the measures of recall and precision calculated for both tools. In addition, in their evaluation process, the authors take into account the points of views of some students and teachers who have tested both systems. The comparison study showed that information search based on Topic Maps gives better results than the search engine. The same idea is adopted by [10] for evaluating their search system based on Topic Maps, The purpose of the study is to compare the performance between a Topic Maps-Based Korean Folk Music (Pansori) Retrieval System and a representative Current Pansori Retrieval System. The study is an experimental effort using representative general users. Participants are asked to carry out several predefined tasks and their own queries. The authors propose objective measures (such as the time taken by the system to find searched information) and subjective measures (like completeness, ease of use, efficiency, satisfaction, appropriateness, etc.) to evaluate the performance of the two systems.

3 Our Approach to Manage the Volume of the Topic Map Based on the state of the art on the quality of ontologies, conceptual models and Topic Maps, we notice that Topic Maps quality has not been studied enough with regards to several works proposed on ontologies and conceptual schemas quality. Moreover, the notion of Topic Map quality is not the same when we compare it to ontologies and conceptual schemas quality. This is because of the differences between these models. Indeed, Topic Maps are dedicated to be used directly by users; they reflect the content of documents, while an ontology is a formal and explicit specification of a domain that allow information exchange between applications.

Proposed Approach for Evaluating the Quality of Topic Maps

45

In our work, we are interested to the quality of the Topic Map representation. In fact, one of the major problems related to Topic Maps quality is that the generated Topic Map is usually very large and contains a huge amount of information (thousands of Topics and associations).This large volume of information and complexity can lead to a bad organization of the Topic Map and a lot of difficulties to users when they try to search some information using the Topic Map. This can be explained by the fact that a Topic Map, as it was designed, is a usage-oriented semantic structure so it should represent different views and different visions about the subjects of the studied domain according to various classes of users that might be interested with the Topic Map content. Because of this big amount of information in a Topic Map, it would be difficult for users especially those who are not experts in the studied domain to easily find what they search in reasonable times. One of the specificities of Topic Maps with regards to ontologies and conceptual schemas is the preponderance of the volume problem because they are intended to be viewed and used directly by the user. In our approach, to manage the big volume of a Topic Map, we propose a dynamic pruning method when we display the Topic Map by defining a list of meta-properties associated to topics. The Topic Map pruning process is a big issue to be addressed in our work, since a Topic Map is essentially used to organize a content of documents and to help users finding relevant information in these documents. So, it is required to maintain and enrich the Topic Map structure along the time in order to satisfy users’ queries and handle possible changes of the document content. To maintain the Topic Map, we propose to introduce some information, that we have called “meta properties”, about a Topic relevance according its usage when users explore the Topic Map. This information can be explored to evaluate the quality of a Topic Map. 3.1 Topics Notation In our previous works [2], [3], we have proposed to extend the TM-UML model by adding to the Topic characteristics, a list of meta-properties. Actually, we have defined two meta-properties. The first one reflects a Topic pertinence along the time. It is initialized when the Topic Map is created; this meta-property reflects a Topic relevance according to its usage by Topic Map users. It is also explored in the Topic Map pruning process, especially to delete all the Topics considered as non pertinent when we display the Topic Map. The second meta-property, allows implementing different layers in the Topic Map. Metaproperty 1 Topic relevance in the Topic Map: We propose to define a score (or level) for each Topic as a meta-property which reflects its importance in the Topic Map. As we can see on figure 1, the score is initialized when the Topic Map is created. It can be (a) very good when the Topic is obtained from three information sources which are documents, thesaurus and requests (b) good when the Topic is extracted from two information sources or (c) not very good when the Topic is extracted from one source. These qualities have to be translated into a mark between 0 and 1 in order to allow a pruning process in the visualization of the Topic Map: only Topics having a score greater than the required level are displayed.

46

N. Ellouze, E. Métais, and N. Lammari

Fig. 1. Score initialization

During the life of the Topic Map the mark will be computed to reflect the popularity level of each Topic. For this purpose the mark is a weighted average of different criteria: the number of documents indexed by the Topic (DN), the number of FAQs referring to this Topic (FN) and the number of consultation of this Topic (CN). The formula is (α *DN + β *FN + γ * CN)/ α + β + γ). Weights are parametrical; however we suggest setting γ greater than α and β in order to better reflects the usage of the Topic. Metaproperty 2 The level to which belongs the Topic in the Topic Map: The second meta-property, allows to implement different layers in the Topic Map. Our idea is to classify and organize information (Topics, links and resources) in the Topic Map into three levels [3]. (1) The upper level contains “Topic themes” obtained as a result to the thematic segmentation process applied to source documents and “Topic questions” extracted from users requests; (2) The intermediate level contains domain concepts, Topic instances, subtopics, Topic synonyms, synonyms of Topic instances, etc; and (3) the third level contains resources used to build and enrich the Topic Map which means textual documents available in different languages and their thematic fragments and all the possible questioning sources related to these documents. We explore this meta-property to organize the Topic Map in order to enhance navigation and facilitate search through its links. 3.2 Analysis of Notes We also introduce meta-meta data attached to the scores in order to store their evolution's profile and thus automatically update them in order to anticipate their popularity level. Indeed the number of consultation of a Topic is used to vary among time for several reasons as for example season's variations: we can notice that in Summer the "air conditioning' Topic is very referred to, while in winter people are more concerned by «heating devices». In this case the mark associated with "air conditioning"' will increase in summer and decrease in winter, and reversely for the « heating device Topic». Another example of score's variation is the one of news such as the crash of a plane, in this case the Topic reaches its maximal level of popularity very soon and then its popularity continually decreases until quite nobody is anymore interested in this event. So we will add meta-meta-data to capture the type of the Topic's score evolution (season dependant, time dependant, decreasing, increasing, etc.) in order to anticipate the score of a Topic and dynamically manage the pruning process when displaying the Topic Map.

Proposed Approach for Evaluating the Quality of Topic Maps

47

3.3 Using Meta-properties to Improve Topic Map Visualization The main goal of Topic Maps is to enable users find relevant information and access to the content of the source documents. Thus, there are two kinds of requirements for Topic Map visualization: representation and navigation. A good representation helps users identify interesting spots whereas an efficient navigation is essential to access information rapidly. In our approach, the idea is that we use the two meta-properties defined above (The Topic score and the level to which belongs the Topic) in our dynamic pruning process when we visualize and display the Topic Map in order to facilitate access to documents trough the Topic Map. We use the metaproperty of a Topic level to improve the Topic Map visualization by organizing it into three levels: the first one contains Topics themes and questions, the second one contains Topics that represents domain concepts, Topics instances, eventually, Topics answers which may also belong to the first level and finally the resource level that contains documents and their fragments. This organization provides to the users different levels of detail about the content of the Topic Map and allow them to move from one level to another based on their browsing options. Indeed, initially, we choose to display only the first level topics which means topics questions and Topics themes first level, then, when he navigates the user might be interested to a particular subject or theme, he will have the possibility to continue his search and browse the sub-tree of the Topic Map that contains all domain concepts Topics related to the theme chosen by the user. He can also access to documents and segments associated to this topic. In this way, the user will be able to build his own cognitive map containing information that interests him (depending on the parties which he visited). Figure 2 shows an example of a multilevel Topic Map visualization generated with our application [3]. Our application offers highlighting properties which mean: whenever a Topic Map node is selected, it is highlighted showing the current part of Topic Map related to it. More space is allocated to the focus node while the parent and children, still in the immediate visual context, appear slightly smaller. The grandparents and grandchildren are still visible but come out even smaller. In this case, Topic Map visualization will facilitate exploration and information search through the Topic Map. In addition to the multi-levels visualization, scores assigned to each Topic are also explored as selecting criteria to visualize the Topic Map. In fact, a Topic with a very good score is considered as a main Topic so in this case, we will have a default visualization of the Topic Map. The idea is that instead of definitively deleting a Topic because it is not very used, we prefer just low its score. Indeed, a Topic may be the target of very few queries in one season and coming back to the most frequently asked Topics next season (e.g. many questions concern air conditioners in summer, but only very few in winter). However some Topics - generally concerning case in the news - definitively decrease in importance. We define a rule for displaying Topics, this rule is defined as follows: only Topics with a score above a threshold will be displayed by default, this threshold is a parameter, for our case, we choose to set it at 0.5, other Topics will be displayed in gray, but the user could still view them if he wants. For example, depending on the season, there are gray Topics such as "air conditioning" in the winter and others are displayed by default (eg "air conditioning" in summer).

48

N. Ellouze, E. Métais, and N. Lammari

Fig. 2. An example of Topic Map visualization with our developed tool [3]

4 Conclusion and Future Work In this paper, we have presented an approach to evaluate the quality of Topic Maps. We note that one of the specificities of Topic Maps with regards to ontologies and conceptual models is the problem of volume, consequently the main goal of our approach id to manage the big number of topics and associations in the Topic Maps. In our approach, to resolve this problem, we have proposed a dynamic pruning process when we display the Topic Map by defining a list of meta-properties associated to each topic. The first meta-property represents the Topic score which indicates the Topic pertinence and its usage when the users explore the Topic Map. This meta-property is used to manage the Topic Map evolution especially to prune the topics considered as non relevant. The second meta-property indicates the level to which belongs the Topic in the Topic Map organized in three levels according to our meta-model that we have defined in our previous works [3]. We use these metaproperties to improve the Topic Map visualization in order to enhance users' navigation and understanding of Topic Map content. In our future works, we will discuss in more detail quality criteria of a Topic Map in order to identify an exhaustive list of meta-properties that help managing the Topic Map in the evolution process.

Proposed Approach for Evaluating the Quality of Topic Maps

49

References 1. ISO/IEC:13250. Topic Maps: Information technology-document description and markup languages (2000), http://www.y12.doe.gov/sgml/sc34/document/0129.pdf 2. Ellouze, N., Lammari, N., Métais, E., Ben Ahmed, M.: CITOM: Incremental Construction of Topic Maps. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds.) NLDB 2009. LNCS, vol. 5723, pp. 49–61. Springer, Heidelberg (2010) 3. Ellouze, N.: Approche de recherché intelligente fondée sur le modèle des Topic Maps, Thèse de doctorat, Conservatoire National des arts et métiers, Paris, France, Décembre 03 (2010) 4. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouse. Springer, Heidelberg (2000) ISBN 3-540-65365-1 5. Akoka, J., Berti-Équille, L., Boucelma, O., Bouzeghoub, M., Comyn-Wattiau, I., Cosquer, M., Goasdoué, V., Kedad, Z., Nugier, S., Peralta, V., Sisaïd-Cherfi, S.: A Framework for Quality Evaluation in Data Integration Systems. In: Proceedings of the 9th International Conference on Enterprise Information Systems (ICEIS 2007), pp. 170–175 (2007) 6. Legrand, B., Michel, S.: Visualisation exploratoire, généricité, exhaustivité et facteur d’échelle. In: Numéro spécial de la revue RNTI Visualisation et extraction des connaissances, mars (2006) 7. Godehardt, E., Bhatti, N.: Using Topic Maps for Visually Exploring Various Data Sources in a Web-Based Environment. In: Maicher, L., Garshol, L.M. (eds.) TMRA 2007. LNCS (LNAI), vol. 4999, pp. 51–56. Springer, Heidelberg (2008) 8. Weerdt, D.D., Pinchuk, R., Aked, R., Orus, J.J., Fontaine, B.: TopiMaker -An Implementation of a Novel Topic Maps Visualization. In: Maicher, L., Sigel, A., Garshol, L.M. (eds.) TMRA 2006. LNCS (LNAI), vol. 4438, pp. 32–43. Springer, Heidelberg (2007) 9. Dicks, D., Venkatesh, V., Shaw, S., Lowerison, G., Zhang, D.: An Empirical Evaluation of Topic Map Search Capabilities in an Educational Context. In: Cantoni, L., McLoughlin, C. (eds.) Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2004, pp. 1031–1038 (2004) 10. Gyun, O.S., Park, O.: Design and Users’ Evaluation of a Topic Map-Based Korean Folk Music Retrieval System. In: Maicher, L., Sigel, A., Garshol, L.M. (eds.) TMRA 2006. LNCS (LNAI), vol. 4438, pp. 74–89. Springer, Heidelberg (2007)

BH : Behavioral Handling to Enhance Powerfully and Usefully the Dynamic Semantic Web Services Composition Mansour Mekour and Sidi Mohammed Benslimane Djilali Liabes University - Sidi Bel Abbes, Computer Science Department, Evolutionary Engineering and Distributed Information Systems Laboratory(EEDIS) [email protected], [email protected]

Abstract. Service composition enables users to realize their complex needs as a single request, and it has been recognized as a ﬂexible way for resource sharing and application integration since the appearance of Service-Oriented Architecture. Many researchers propose their approaches for dynamic services composition. In this paper we mainly focus on behaviour driven dynamic services composition, and more precisely on process integration and interleaving. To highly enhance the dynamic task realization, we propose a way to not only select service process, but also to integrate and interleave some of them, and we also take advantage of control ﬂows compatibility. Furthermore, our solution ensures the correct service consumption at the provider and requester levels, by services behavioural fulﬁllment. Keywords: Semantic web service, composition, behaviour, selection, integration, interleaving, control ﬂows compatibility.

1

Introduction

There are several beneﬁts of the dynamic services composition. Unlike static composition, where the number of provided services to the end users is limited and the services are speciﬁed at design time, dynamic composition can serve applications or users on an on-demand basis. With dynamic composition, an unlimited number of new services can be created from a limited set of service components. Besides, there is no need to keep a local cataloge of available web services in order to create composite web services as is the case with most of the static-based composition techniques. Moreover, the application is no longer restricted to the original set of operations that were speciﬁed and envisioned at the design. The capabilities of the application can be extended at runtime. Also, the customisation of software based on the individual needs of a user can be made dynamic through the use of dynamic composition without aﬀecting other users on the system [19]. Dynamic composition infrastructure can be helpful in upgrading an application. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 50–61, 2011. c Springer-Verlag Berlin Heidelberg 2011

BH : Behavioral Handling

51

Instead of being brought oﬄine and having all services suspended before upgrading, users can continue to interact with the old services while the composition of new services is taking place. This will provide seamless upgrading round-the-clock service capabilities to existing applications [19]. This paper proposes to tackle the dynamic web services composition problem using the ﬂexible process handling ”selection, integration and/or interleaving”. After a review of the literature in section 2, section 3 details the proposed approach. In particular, after the service behavior speciﬁcation is shown, the scenarios matching and composition are described, and also the BH architecture is introduced. Section 4 discusses results obtained in the experiments. Finally, Section 5 presents our conclusions and future work.

2

Related Works

Several dynamic service composition approaches have been proposed and implemented in the literature. In some works [9,10,14,17,20–22] services are generally described by their signature ”inputs, outputs and some times by also theirs preconditions and eﬀects”. [17] presented an approach to combine services without any previous knowledge about how services should be chained. Its complexity is high as all the possible chaining schemes need to be investigated. In [14], Web service composition is viewed as a composition of semantic links that refer to semantic matchmaking between Web service parameters (i.e., outputs and inputs) in order to model their connection and interaction. [10] presents a myopic method to query services for their revised parameters using the value of changed information, to update the model parameters and to compose the WSC again at real-time. In [20], the authors present an approach for identifying service composition patterns from execution logs. They locate a set of associated services using Priory algorithm and recover the control ﬂows among the services by analyzing the order of service invocation. [9] suggests an optimization approach to identify a reduced set of candidate services during dynamic composition. In [21], the authors propose an approach based gradual segmentation taking into account both the limitedness of service capacity and utilization of historical data, to ensure the equilibrium between the satisfaction degrees of these temporally sequential requirements. However, in these above approaches unexpected capabilities may be employed, which generates uncertainty regarding how user’s information is manipulated because they do not take reliability properties of composition behaviours into account. Some approaches improve this solution by providing task decomposition rules in order to orient the service chaining process [22]. On the other hand, in several works, [1–3, 6–8, 11–13, 15, 16] authors argue that the process description is richer than the signature description, as it provides more information about the service’s behaviour, thus, leading to a more useful composition. In [11], two strategies are given to select component Web services that are likely to successfully complete the execution of a given

52

M. Mekour and S.M. Benslimane

sequence of operations. In [2], the authors propose a simple Web services selection schema based on users requirement of the various non-functional properties and interaction with the system. [8] tooks the user constraints into account during composition and they are expressed as a ﬁnite set of logical formulas with the Knowledge Interchange Format language. In [1,3,6] provided service capabilities are matched against capabilities required in the target user task. [12] describes services as processes, and deﬁne a request language named PQL1 . This language allows ﬁnding in a process database those processes that contain a fragment that responds to the request. [1] proposes a composition schema by integrating a set of simple services to reconstruct a task’s process. In [16], the user’s request is speciﬁed in a high-level manner and automatically mapped to an abstract workﬂow. The service instances that match the ones described in the abstract workﬂow, in terms on inputs outputs pre-conditions and eﬀects, are discovered to constitute a concrete workﬂow description. [15] focus on adaptive management of QoS-aware service composition in grid environments. The authors propose a heuristic algorithm to select candidate services for composition. [13] The authors presented a multi-agent based semantic web service composition approach. The approach adopts composition model that uses the dedicated coordinator agent and performs negotiation between the service requester agent and all the discovered service provider agents before the selection of ﬁnal service provider agent. [7] propose a Petri net based hierarchical dynamic service composition to accurately characterize user preference.

3

Our Proposal

This work is an improvement of our early contribution in [18]. It aims to favorite the user needs (tasks) realization at real-time. Our solution enhances the chance of the user task realization, and it insures the right web services usage according to the ﬂowing factors: – the ﬂexible web service behaviour handling ”selection, integration and/or interleaving”, and also the control ﬂows compatibility to enable the whole exploitation of available services at composition time, – the consideration of all the provided service scenarios as primitives2 , – the ability fo users to specify the required primitives scenarios3. We are considering these factors to enhance the dynamic task realization by the useful and powerful exploitation of provided services, and to fulﬁll both the provider’s constraints and requester’s needs.

1 2 3

PQL: Process Query Language. Primitive provided scenario: is a scenarios that must be wholly invoked as speciﬁed by his provider. Primitive required scenario: is a scenarios that must be wholly retrieved as speciﬁed by his requester.

BH : Behavioral Handling

3.1

53

Service Behaviour Specification

The speciﬁcation of service behavior should rely on a formal model in order to enable the automated reasoning providing a valid services integration to realize the user task. The services and user task are considered as complex, and they are described by complex behaviors. The web service behaviour deﬁnes the temporal relationships and properties between the service operations necessary for a valid interaction with the service [4]. Thus, we can consider it as a set of scenarios provided by a service. Each of them can be described by a set/list of capabilities. This capabilities are interconnected by the control ﬂows. To ﬁnd all the scenarios that can be provided by the service behaviour, we generate a formal grammar from the service behaviour description, then we substitute all the non-terminal elements by the set of theirs production rules. For each composite service (CS) we constitute a production rule (pr). The left side of the rule is the composed service, and the right side represent the services that compose it. The control ﬂows ”sequence (•), external choice (|)”, and also the loops constructs are represented implicitly by the formal grammar deﬁnition. In this work, we adopte two kinds of production rules to describe the iterative constructs: - C → C1 • C1 • C1 · · · •C1 , for the execution of service with known iterations times, - C → C • C1 |C1 , for the execution of service with unknown iterations times. The others control ﬂows as ”parallel (), synchronization(j ), unordered(?)” are added to grammar as terminals elements. Grammaire generation: Let us consider by – Des : The set of all services (composites or atomics), and control ﬂows retrieved in the service description, – CF : The set of all control ﬂows that my be uses in the service description, so CF = {•, |, , j , ?}, – G(S, N, T, P ) : The formal grammar that describes a service, where: • S is the main service (the only service composed by all the others ones) that never appear in the right side of any production rules. • N is the set of all the composite service (non-terminal elements). • T is the set of all the atomic services and control ﬂows (terminal elements), except the sequence, external choice and iterative constructs. • P is the set of production rules of all the composed services. So, in more formally manner, we deﬁne the grammar parameters as ﬂow: – Definition 1: Set of non-terminal elements - N = {∃x ∈ Des/x is an composite web service} – Definition 2: Set of uses controle ﬂows - U CF = {∃x ∈ Des/x ∈ CF }

54

M. Mekour and S.M. Benslimane

– Definition 3: Set of atomics services - AS = {∃x ∈ Des/x is an atomic service} – Definition 4: Set of terminal elements - T = {∃x ∈ Des/x ∈ U CF ∪ AS} – Definition 5: Set of production rules - P = {∀x ∈ N, ∃(ρ, ω) ∈ U CF X (N ∪ AS)+ /p : x → ρω} – Definition 6: The axiom - S = {∃!x ∈ N, ∀p ∈ P, ∃(ρ, y, (α, β)) ∈ U CF X N X A2 /(p : y → ραβ) and (x, x) = (α, β)} where: A ≡ (N ∪ AS)+ The algorithm of formal grammar generation from a service description is given in Algorithm 1.

Algorithm 1. Grammar Generation Input : SD; /* Service Description Output: G(S, N, T, P ); 1

N ← ∅; T ← ∅; P ← ∅;

2

for All C In SDes do /* C is a composite service

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

switch σ do /* σ is a control flow

*/

*/ */

case Construct with known iterations times: P ← P ∪ {C → (C1 • C1 ... • C1 )}; if (C1 IsAtomic) then T ← T ∪ {C1 }; case Construct with unknown iterations times: P ← P ∪ {C → C • C1 |C1 }; if (C1 IsAtomic) then T ← T ∪ {C1 }; otherwise P ← P ∪ {C → (C1 σC2 σ...σCn )}; T ← T ∪ {σ}; for i ← 1 to n do if (Ci IsAtomic) then T ← T ∪ {Ci }; N ← N ∪ {C}; S ← GetMWS(P ); /* A function to get a main service

*/

BH : Behavioral Handling

3.2

55

Non-terminals Substitution and Productions Rules Refinement

Once the grammar is generated, and as we motioned above, the diﬀerent scenarios of service behaviour can be founded by the substitution of the non-terminal elements by they production rules. To enable the generation of all scenarios that can be provided by a service behaviour, the process of substitution must be started from the main service S. For some iterative control ﬂows as ”iterate, repeat-while, repeat-until”, the iteration times is unknown until the eﬀective execution is performed, that prevents the scenario building. Authors in [5, 23] solve this challenge by the adoption of prediction techniques that use the service invocation historic to anticipate the iterations time, but that is not always true all the time, because it can be changed from invocation to another according to environment parameters4. Furthermore, the iterations time is not very important as well as the right choice and selection of an appropriate service, because two services are ideally equivalent, then they must have the same iterations times under the same invocation constraints. Thus, to overcome this challenge, we are keeping the production rule C → C1 of the iterative construct with unknown iterations time and we can perform the process of substitution over it. However, until now the service behaviour is described with one global production rule (GP R) ”that represent all the diﬀerent scenarios that can be provided by the service behaviour” ﬁtted out with a set of production rules of iteratives constructs with a unknown iterations time. This global production rule contains backwardly any non-terminal element. This enable the best generation of diﬀerent scenarios. To facilitate the matching operation and to generate the diﬀerent scenarios of the service behaviour, we use the external choice control ﬂow to decompose the global one. Thereafter, these scenarios are adapted, reorganised and also divided into sub-scenarios called ”Conversation Unites (CU )”. Each of them constitute a set/list of services ”atomics and or composites” interconnected by the same control ﬂow δ, and it is described by a preﬁxed notation: CU : δC1 C2 ···Cm . The algorithm of non-terminals substitution and productions rules reﬁnement is given in Algorithm 2. 3.3

Conversation Unites Matching

To match conversation unites, we compute the similarity among the control ﬂows and the services that constitute them. To enhance the realisation chance of the Requested Conversation Unit (RCU ), we are taking advantage of the execution order. Thus, CU s : δC1 C2 C3 , δC1 C3 C2 , δC3 C1 C2 are considered as equals (such that δ is not a sequence). To powerfully carry out the RCU , we also take advantage of control ﬂows compatibility if the required ﬂow is more generic than the provided one. For example, if the required ﬂow is ”unordered” and the provided one is a ”split” construct, so it will be useful to consider them as compatible. To compute similarity ”Sim” between the required and provided information, we propose the ﬂowing fourth formulas. The formula (1), (2), (3), (4) are used respectively for Data (Di ) ”input/output”, Predicate (Pi ) ”precondition/eﬀect”, Atomic Service (ASi ), and Conversation Unit (CUi ) similarity 4

service to invocate, parameters of invocation, etc.

56

M. Mekour and S.M. Benslimane

Algorithm 2. Non-Terminals Substitution and Optimization Input : P ; /* Set of productions rules Output: GP R; 1 2 3 4 5 6 7 8 9 10 11 12

*/

for All p In P do repeat CanSubstitute ← False; SetCS ← GetAllCS(p); /* To get all CSs from p ρ ← GetCF(p); /* To get ρ from p for All CS In SetCS do if ((ρ = ρ)or(ρ =|)) then p ← Ers (CS,P); /* To substitute CS prs in p P ← Delete (CS,P); /* To delete CS prs CanSubstitute ← True;

*/ */

*/ */

until (CanSuptitute =False); GP R ← Substitute(P ); /* To substitute all prs in S

*/

computing, where i is a required ”r” or an oﬀered ”o” information, wx is the x information weight, l is a predicate lateral-part and op is a predicate arithmetic operator part ”<, <=, >, >=, =, =”, cp is a service parameter ”data/predicate information”. We used ontologies as data description formalism. Sim(Dr , Do ) =

Depth(Co ) where Di is an instance of a concept Ci ..... (1) Depth(Cr )

Sim(Pr , Po ) =

wl ∗ Sim(lr , lo ) + wop ∗ Sim(opr , opo ) ..... (2) wl + wop cp

Sim(ASr , ASo ) =

i

wpi npi

npi j

cp i

Sim(CUr , CUo ) =

3.4

was nas

nas i

Sim(Prj , Poj ) wpi

..... (3)

Sim(asri , asoi ) + wσ ∗ Sim(σr , σo ) ..... (4) was + wσ

Scenarios Composition

As the service behaviour is an external choice of diﬀerent scenarios provided by a service, and each scenario is set of CU s, so the Required Conversation Unit ”RCU ” can be founded either by selection, integration and/or interleaving some of Provided ones ”P CU ”(see Fig. 1 ). This aspect is rarely considered by researchers. However, we are investigating it in this work in order to fulﬁll powerfully the user requirements. For example, if the required conversation is: ?C1 C2 C3 C4 and the provided ones are: C1 C3 and •C2 C3 , where the symbols ”?, , •” represent ”unordered,

BH : Behavioral Handling

57

Fig. 1. Conversations handling

split, sequence” constructs respectively, so the RCU will be founded by interleaving them. Furthermore, to fulﬁll the provider’s constraints, we assume that each provided scenario ”set of P CU ” is primitive and it must be consumed as speciﬁed by their provider. In the same way, to fulﬁll the requestor’s preferences, we have ability to specify the required primitive scenarios that must be founded as indicated by their requestor. These required primitive scenarios enables the achievement and the control of P CU s integration and/or interleaving. The algorithm of P CUs Handling is given in Algorithm 3. 3.5

Architecture

Fig. 2. BH Architecture

58

M. Mekour and S.M. Benslimane

Algorithm 3. P CUs Handling

1 2 3 4 5 6 7 8 9 10 11 12

Input : T hreshold, RCU, SetP CU ”: set of P CU ”; Output: N ewCU ; /* New CU

*/

M ax ← 0; Rang ← 1; Depth ← GetDepth(RCU ); /* To get RCU Depth

*/

repeat SetCombsP CU ← GetAllCombs (Rang, SetP CU ); /* To get all P CU s combinations with rank equals to Rang for Any CombP CU In SetCombsP CU do Sim ← GetSim(RCU, CombP CU ); if (Sim > M ax) then M ax ← Sim; N ewCU ← CombP CU ;

*/

Rang ++; until (Rang = Depth + 1) Or (M ax = 0);

As indicated in Fig. 2, the proposed architecture consists of three engines. – The CUBuilder: i)generates for a service/task description, ii)constitutes a global conversation scenario ﬁtted out by a set of repetitive constructs conversations, iii)builds a set of CU (See section 3.1), iv) deploys the generated P CU s in the UDDI and the RCU s at client machine. – The Matchmaker: This engine uses the information submitted by the client to composes and combines the candidate P CU s (retrieved automatically in the UDDI ) either by selection, integration and/or interleaving some of them to constitute RCU s(See section 3.4). – The Generator: The generated scenario will be encoded by the generator in a given language as orchestration ﬁle (i.e., the process model of OWLS). This ﬁle will be used by the client to invoke and interact with services implied in the composition.

4

Experiments

Experiments in (Fig. 3) shown that the interleaving compositions count is always greater then those obtained by the others solutions, because they are speciﬁcs case of interleaving one. Theoretically, the interleaving compositions count is greater then the summation of those obtained by the others solutions. As indicated in (Fig. 4), when the RCU depth is: – Lower then P CU depth, any solution can be found, – Equals to P CU depth, the solutions count is the same for the all strategies, and it is the selection solutions count,

BH : Behavioral Handling

59

– Greater then P CU depth, any solution can be found by selection. The interleaving solutions count always greater then the integration solutions count. We be able also to see that the integration / interleaving solutions count increase further to the increasing of the deference between RCU and P CU depth. Furthermore, our approach minimizes the scenarios count to be combined ”be integrated and or interleaved”(Fig. 5), which in turn favors ﬁrstly the preconceived scenarios direct uses, instead of constituting the new ones. Secondly the count of integrated services (Fig. 6). For example if we have two solutions, where the ﬁrst one contains two scenarios and the other contains three scenarios, so the choiced solution will be always the ﬁrst for all implied services count, because it is the best one in view of scenarios count.

Fig. 3. RCU realization chance

Fig. 5. P CU s Count

5

Fig. 4. CU depth inﬂuence

Fig. 6. Services Count

Conclusions and Future Works

Services composition involves the development of customized services often by discovering, integrating, and executing existing services. It’s not only about consuming services, however, but also about providing services. This can be done in such a way that already existing services are orchestrated into one or more new services that ﬁt better to your composite application. In this article we propose a dynamic approach to compose semantic web services. It enables process handling to constitute a concrete task description from it abstract one and the full semantic web services description. This solution:

60

M. Mekour and S.M. Benslimane

– Takes into account the complexity of both the task and services, – Fulﬁlls the right service consumption by considering the provided scenarios as primitives, – Fulﬁlls the user preferences by ability of the primitives scenarios speciﬁcation. – Facilitates the scenarios matching by decomposing them into CU s – Increase the RCU realization chance and as well the task realization by: • the P CU Handling ”selection, integration and or interleaving”, • taking into account the control ﬂows compatibility – Enables the composite web service selection and combination and also the composite web service part-invocation. Future works aims at the ﬂowing engines reﬁnements: – the Matchmaker engine by the RCU s combination to use one P CU , and also the quality of service regard. – the Generator engine to put up with all OWL-S constructs, and to enable the composite service manpower invocation.

References 1. Aggarwal, R., Verma, K., Miller, J., Milnor, W.: Dynamic web service composition in meteor-s. In: Proc. IEEE Int. Conf. on Services Computing, pp. 23–30 (2004) 2. Badr, Y., Abraham, A., Biennier, F.: Enhancing web service selection by user preferences of non-functional features. In: The 4th International Conference on Next Generation Web Services Practices, pp. 60–65 (2008) 3. Benatallah, B., Sheng, Q.Z., Dumas, M.: The self-serv environment for web services composition. IEEE Internet Computing 7(1), 40–48 (2003) 4. Benmokhtar, S.: Intergiciel S´emantique pour les Services de l’Informatique Diﬀuse. PhD thesis, Ecole Doctorale: Informatique, T´el´ecommunications et Electronique de Paris, Universite De Paris 6 (2007) 5. Canfora, G., Penta, M.D., Esposito, R., Villani, M.L.: An Approach for QoSaware Service Composition based on Genetic Algorithms. In: GECCO 2005, Washington, DC, USA, pp. 25–29 (June 2005); Copyright 2005 ACM 1595930108/05/0006 6. Chakraborty, D., Joshi, A., Finin, T., Yesha, Y.: Service Composition for Mobile Environments. Journal on Mobile Networking and Applications, Special Issue on Mobile Services 10(4), 435–451 (2005) 7. Fan, G., Yu, H., Chen, L., Yu, C.: An Approach to Analyzing User Preference based Dynamic Service Composition. Journal of Software 5(9), 982–989 (2010) 8. Gamha, Y., Bennacer, N., Naquet, G.V.: A framework for the semantic composition of web services handling user constraints. In: The Sixth IEEE International Conference on Web Services, pp. 228–237 (2008) 9. Ganapathy, G., Surianarayanan, C.: Identiﬁcation of Candidate Services for Optimization of Web Service Composition. In: Proceedings of the World Congress on Engineering, London, U.K., vol. I, pp. 448–453 (2010) 10. Harney, J., Doshi, P.: Selective Querying For Adapting Web Service Compositions Using the Value of Changed Information, pp. 1–16 (2009)

BH : Behavioral Handling

61

11. Hwang, S.Y., Lim, E.P., Lee, C.H., Chen, C.H.: On composing a reliable composite web service: a study of dynamic web service selection. In: Processing of the Fifth IEEE International Conference on Web Service, pp. 184–189 (2007) 12. Klein, M., Bernstein, A.: Towards high-precision service retrieval. In: The Semantic Web - First International Semantic Web Conference, Sardinia, Italy, pp. 84–101 (2002) 13. Kumar, S., Mastorakis, N.E.: Novel Models for Multi-Agent Negotiation based Semantic Web Service Composition. Wseas Transaction on Computers 9(4), 339– 350 (2010) 14. Lecue, F., Deltiel, A., Leger, A.: Web Service Composition as a Composition of Valid and Robust Semantic Links. International Journal of Cooperative Information Systems 18(1), 1–62 (2009) 15. Luo, J.Z., Zhou, J.Y., Wu, Z.A.: An adaptive algorithm for QoS-aware service composition in grid environments, pp. 217–226 (2009); Special issue paper: SpringerVerlag London Limited 2009 16. Majithia, S., Walker, D.W., Gray, W.A.: A framework for automated service composition in service-oriented architecture. In: 1st European Semantic Web Symposium, pp. 265–283 (2004) 17. Masuoka, R., Parsia, B., Labrou, Y.: Task computing- the semantic web meets pervasive computing. In: 2nd International Semantic Web Conference, pp. 866– 881. Springer, Heidelberg (2003) 18. Mekour, M., Benslimane, S.M.: “Integration Dynamique des Services Web Semantiques ` a Base de Conversations”. In: Doctoriales en Sciences et Technologies de l’Information et de la Communication STIC 2009, M’Sila, Alg´erie, pp. 40–45 (D´ecembre 2009) 19. Mennie, D.: An architecture to support dynamic composition of service components and its applicability to internet security. Masters Thesis, Carleton University, Ottawa, Ontario, Canada (October 2000) 20. Tang, R., Zou, Y.: An Approach for Mining Web Service Composition Patterns from Execution Logs, pp. 1–10 (2010) 21. Wang, X., Wang, Z., Xu, X., Liu, A., Chu, D.: A Service Composition Approach for the Fulﬁllment of Temporally Sequential Requirements. In: 2010 IEEE 6th World Congress on Services, pp. 559–565 (2010) 22. Dan, W., Bijan, P., Evren, S., James, H., Dana, N.: Automating DAML-S web services composition using SHOP2. In: Proceedings of 2nd International Semantic Web Conference, Sanibel Island, Florida, pp. 195–210 (2003) 23. Zeng, L., Benatallah, B., Ngu, A.H.H., Dumas, M., Kalagnanam, J., Chang, H.: QoS-aware middleware for web services composition. IEEE Transactions on Software Engineering 30(5), 311–327 (2004)

Service Oriented Grid Computing Architecture for Distributed Learning Classifier Systems Manuel Santos, Wesley Mathew, and Filipe Pinto Centro Algoritmi, Universide do Minho, Guimarães, Portugal {mfs,Wesley}@dsi.uminho.pt, [email protected]

Abstract. Grid computing architectures are suitable for solving the challenges in the area of data mining of distributed and complex data. Service oriented grid computing offer synchronous or asynchronous request and response based services between grid environment and end users. Gridclass is a distributed learning classifier system for data mining proposes and is the combination of different isolated tasks, e.g. managing data, executing algorithms, monitoring performance, and publishing results. This paper presents the design of a service oriented architecture to support the Gridclass tasks. Services are represented in three levels based on their functional criteria such as the user level services, learning grid services and basic grid services. The results of an experimental test on the performance of system are presented. The benefits of such approach are object of discussion. Keywords: Distributed learning classifier system, user level services, learning grid service, and basic grid services.

1 Introduction Day by day a phenomenal expansion of digital data is happening in all knowledge sectors. The two types of challenges that isolated data mining faces related to data are the size and location of data repositories. Execution of data mining algorithms in a single computer is no longer sufficient to solve the issues of distributed explosion of data. The necessity bring by these requirements led scientists to invent higher level of data mining architecture like the distributed data mining architectures [11, 12]. Distributed Data Mining (DDM) architectures manage the complexity of the distributed data repositories all over the world [1]. Grid computing has emerged from distributed and parallel technologies; moreover it facilitates the coordinated sharing of computing resources across geographically distributed sites. Service oriented Grid computing architecture can develop flexible and suitable services of learning classifier systems in a grid platform. This paper presents the conceptual model of the grid services that are necessary for the distributed learning classifier system. Distributed learning classifier system basically generates two levels of learning models that are the local leaning model and the central learning model [2]. The local learning models are generated at different distributed sites and global model L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 62–70, 2011. © Springer-Verlag Berlin Heidelberg 2011

Service Oriented Grid Computing Architecture

63

is generated in the central system. The design of the service oriented grid computing architecture for distributed learning classifier systems presents three levels of services that are the user level services, the learning grid services and the basic grid services. The user level services are the top level of the hierarchical structure of the service oriented design of grid based learning classifier system. It also contains services for global system management. User level services acts as the interface for users to access the services in the learning grid. The learning grid services are the middle level services between user level services and basic grid services. Based on the demands of the user level service, the learning grid services will be invoked. The required services of the learning grid are the data access service, the local model induction service, the global model induction service, the local model evaluation service, the global model evaluation service, the execution schedule service and the resource allocation service. Learning grid services are executed with the supports of the basic grid services. The basic grid services are the core services that are provided in the grid computing applications. Basic grid services are the security management services, the data management services and the execution management services. The service oriented grid application can bring benefits to the area of Distributed Data Mining (DDM) application. The purpose of the DDM method is to avoid transferring the data between different local databases to the central database. The Service oriented approach has more flexibility to apply the data mining induction algorithm (local model induction service) to each database. Similarly, the global model induction service accumulates all local models that are generated by the local model induction service and generate global model. The main attraction of this design is, without dealing with huge amount of data in the central site and without transferring large volume of data through the network, to turn possible the data mining in a distributed environment. Remaining sections of the paper are organized as follows. Section 2 explains the learning classifier system’s technology. Section 3 presents the grid computing and its importance. Section 4 describes the Gridclass system and its structure. Section 5 explains the service oriented grid especially the services for the distributed learning classifier system. Section 6 presents some experimental results on the performance of the system and section 7 concludes de paper.

2 Learning Classifier Systems Learning Classifier System (LCS) is a concept formally introduced by John Holland as a genetic based machine learning algorithm. Supervised Classifier System (UCS) is a LCS derived from XCS [3, 13]. UCS adopted many features from the XCS that are suitable for supervised learning scheme. UCS algorithm was chosen to be applied in this implementation of grid based distributed data mining due to the supervised nature of the most part of the problems in this area. Substantial work has been done to parallelize and to distribute the LCS canonical model in order to improve the performance and to be suited to inherently distributed problems. Manuel Santos [4] developed the DICE system, a parallel and distributed architecture for LCS. A. Giani, Dorigo and Bersini also did significant research in the area of parallel LCS [5]. Other approaches can be considered in this group. For instance, meta-learning systems construct the global population of classifiers from a collection of inherently

64

M. Santos, W. Mathew, and F. Pinto

distributed data sources [5]. GALE is a fine grained parallel genetic algorithm based on a classification system [6]. Finally, learning classifier system ensembles with rule sharing is another associated work related to the parallel and distributed LCS [8].

3 Grid Computing Grid computing developed from distributed computing and parallel computing technologies. Under distributed computing few resources are sharing for all other resources, but in the grid computing all resources are shared; that is the main difference between distributed computing and grid computing. Cluster computing and internet computing are the alternatives of the grid computing. Cluster computing could able to share the resources of dedicated and independent machines in a particular domain [9]. Although cluster computing has the benefits of high performance, high availability, load balancing and scalability it is only available in a single domain. The internet computing could share the resources of a local area network and a wide area net work [9]. The resources in an internet computing are voluntarily connected therefore security of the data and of the processes are the main issue. But grid computing technology is the more suitable and reliable service for resource sharing and distributed application. Grid is able to take the benefits of computing power of distributed resources that are connected to different LANs or WAN in a reliable and secure manner. There are many benefits of sharing resources in a grid [9]: 1) improve the utilization of resources; 2) capable to execute large – scale applications that cannot be executed within a single resource; 3) Using heterogeneous computing resources across different locations and different administrative domains; and 4) better for collaborative applications. A Grid can be considered as a virtual supercomputer because geographically distributed resources can provide large computational power to grid. Grid is the union of data grid, computing grid and service grid. The following lines introduce definitions of these concepts [9]. Data grid gives access to store and retrieve data across multiple domains. It manages the security of data access and policies and controls the physical data stores. Computing grids are developed for providing maximum power of computing to an application. Service grid provides services that are not limited for a single computer. Service grid provide collaborative work group that includes users and applications. Users are able to interact with applications through the services that are available in the service grid.

4 Gridclass System Gridclass system is a distributed and parallel grid based data mining system using a supervised classifier system (UCS) [2]. Two different styles for inducing data mining models may be applied in the distributed applications: Centralized Data Mining (CDM) and Distributed Data Mining (DDM) [2]. CDM is the conventional method for distributed data mining that first collects all data from every node in the distributed environment and then applies a mining algorithm to the accumulated data. DDM method has to do data mining at every node and send results (learning models) to the central system for developing the final result (global model). Gridclass system

Service Oriented Grid Computing Architecture

65

adopts the pattern of DDM in the grid environment. In Gridclass seven different tactics are available for constructing the global model in DDM [8]: Generalized Classifier Method (GCM); Specific Classifier Method (SCM); Weighed Classifier Method (WCM); Majority Voting Method (MVM); Model Sampling Method (MSM); Centralized Training Method (CTM); and Data sampling Method (DSM). Gridclass is manly composed by three modules, the toolkit, the local nodes and the central system. The toolkit is an interface that gives facility to submit the work and analysis the results. All the local nodes in the grid having thereon data therefore individual UCS will execute synchronously and generates local learning models. Each one of the local nodes is connected to the central system so the central system collects all the local models to make the global model using a subset of the available DDM strategies. Gridgain is a java based grid computing middleware that is used for the implementation of Gridclass system [10]. This middleware is the combination of computational grid and data grid furthermore this is a simple technology for grid computing implementation.

5 Service Oriented Grid The service oriented grid operation makes simple the relationship between user and grid. Figure 1 shows the basic architecture of the service oriented grid for learning classifier system. Application level

G M

Central node

N1

L M

N2

L M

N3

L M

Grid Master [Service manager] GRID

R5

R1 R2

Grid Resources

R3

R4

Fig. 1. Basic structure of service GRID based Distributed learning classifier system

The service oriented grid represents different layered services for the distributed learning classifier system; those services can be invoked by using web service technology. User can submit the requirements of the distributed learning classifier system in the application level, then application level will request the service in grid master and grid master will execute those operations within the grid and returns result back to the application level. The Application level contains different local nodes (Ni) and a central node. The grid master (service manager) is the middle layer between

66

M. Santos, W. Mathew, and F. Pinto

application level and grid level where several resources, e.g., computers, are available (Ri). The learning process should be executed at each distributed site. Figure 2 presents the layered structure of services for distributed learning classifier system. User level services are available in the application level, learning grid services are available in the grid master level and basic grid services are available in the grid. For example, the users need to execute the distributed learning classifier system in a service oriented grid with specific data on the particular sites. Firstly, the user should specify the number of nodes and locations of the data. In the learning grid services different services may be used for inducing local models (LMi), so the user needs to mention the specific service for inducing learning model and the configuration parameters of that service. Similarly, many strategies are available for constructing the global model (GM) therefore the global model construction strategies and the configuration parameters should be specified. Besides, the output format of the result should be also mentioned. This is the entire information that user can specify in the user level services. User level services (Global system management service) Learning Grid Services Resources management services

Execution services

Data access services Local model induction services Global model induction services Local model evaluation service Global model evaluation service

Execution schedule services Resource allocation service

Basic grid services Security service, Data management service, Execution management service

Fig. 2. Layered structure services for distributed learning classifier system

The user level service will invoke the services that are available in the grid master for the execution of the user requirements. The grid master will pass information to the grid for the execution of the user request. 5.1 User Level Services The global system management service is the fundamental service in the user level service. This service acts as an interface between the user and the grid system, so the

Service Oriented Grid Computing Architecture

67

global system management service will collect all the information from user about the execution of distributed learning classifier system and invokes the grid learning services accordingly. The global system management service will display information to monitor the execution of the distributed learning classifier system. 5.2 Learning Grid Services Learning grid services are the main processing unit of the service oriented distributed learning classifier system. There are two types of services available: 1) Resource management services; and 2) Execution management services. The resource management services contain five services for data access, local model induction, global model induction, local model evaluation, and for global model evaluation. Data access services provide functions for fetching or for writing data from/to the data repository. Different types of data files are supported (e.g., CSV, XML) therefore many versions of the data access services are required in the learning grid. Local model induction services are in charge for generating the local models. Various instances of the local model induction services can be available in the resource management services. Global model induction services are used for constructing the global model from the different local models. Another two services exist for evaluating the performance of the local model and the global model. Local model and global models are stored in text files therefore the evaluation function will fetch those files and generates some graphical presentation (ROC graph) and make it available to the user. The execution management services [5] contain the execution schedule services and the resource allocation services. The execution schedule service programs the services for the complete cycle of the distributed learning classifier system based on the user requirements and the complexity of the problem. The first step is to trigger the local model induction services based on the number of distributed sites then invokes suitable data access service for reading data from distributed sites. While executing the local learning services, the local model evaluation services will give feedback to the user about the progress of the learning process. After the execution of learning model induction services, the execution scheduling service will invoke the global model induction services. Then again the data access service fetches the local models from each distributed site and provides to global model induction services. After the execution of the global model induction service, the global model evaluation service will present the global model performance in a human understanding format. Resources allocation services [9] will assign the resources based on the tasks scheduled by the execution scheduling services. 5.3 Basic Grid Services The basic grid services include the security services, the data management services, and the execution management services. All these services are provided by the grid computing environment, therefore basic grid services are known as core services. The services in the learning grid perform their functions with the support of these core services. The end user of the grid is not necessary to be aware about the basic services that are available in the grid environment. The security services provide encryption

68

M. Santos, W. Mathew, and F. Pinto

and decryption services for the data, authentication and authorization services [5]. The data and the service transaction protocols are defined in the security management services. Data access services in the learning grid works with the support of the data management service in the basic grid services. Execution management services direct the services to the resources. Execution management services play an important role for the load balancing and resources failure.

6 Experimental Work Gridclass system does not paralyze any part of the UCS. Various instances of the UCS are executed in different distributed sites with different set of data. Using the conventional method of Centralized Data Mining (CDM), each distributed site s∈{1,..,NS} has to send data to the central site. If each site generates Rs records, the total effort to generate a global model tends to: Tcdm Mcdm

Where Tcdm stands for the time needed to induce a global data mining model. Mcdm is the global modeling time. T stands for the communication time needed to transfer Rs records from the site S to the central site. Data security is another concern in sending data. The key advantage of this method is it avoids sending large size of data from each distributed sites to the central site. The effort to induce a global model can be computed as: Tddm Mddm

Mddm corresponds to the global modeling time. M is the modeling time for the model Ms. When the volume of data rises Tddm tends to be much smaller than Tcdm (Tddm << Tcdm). The DDM can configure an attractive solution since the accuracies of the global models are similar. A set of experiments have been carried out in order to test the performance of the Gridclass system, comparing the execution times of the distributed data mining and centralized data mining approaches. All the experimental work was done using the Grid gain platform; a java based distributed computing middleware. Different number of nodes and different data mining model sizes in the distributed sites were considered. Eleven different topologies were considered: 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 nodes. To measure the computational effort, 8 different model sizes were considered: 400, 800, 1200, 1600, 2000, 3000, 4000 and 5000 rules. Figure 3 shows the speedup attained by the system for each configuration. The configuration parameters used in the UCS are: ProbabilityOfClassZero = 0.5, V = 20, GaThreshold = 25, MutationProb = 0.05, CrossoverProb = 0.8, InexperienceThreshold = 20, InexperiencePenalty = 0.01, CoveringProbability = 0.33, ThetaSub = 20, ThetaSubAccuracyMinimum = 0.99, ThetaDel =20, ThetaDelFra = 0.10. The lines in the graph show the execution time for each population size. In the vertical axis of the graph is presented the time and in the horizontal axis the number of nodes in the distributed setting. Each point in the graph corresponds to the average

Service Oriented Grid Computing Architecture

69

value obtained for 10 runs. The results demonstrate that the execution time of the monolithic configuration (single node) is always higher than the execution time of the parallel execution of different nodes in the distributed sites. 300 250

400-

200

800-

150

12001600-

100

20003000-

50

4000-

0

5000-

Fig. 3. Gridclass performance

7 Conclusions This paper presented the conceptual model of a service oriented architecture of a grid based data mining system. Scalability of the service, flexibility and reliability are the three key benefits of this conceptual model. The first feature of this design is the scalability of the service. User can define different implementations of the services into the learning grid without modifying the existing one, so user can execute old and new services in the same system. Service oriented approach is reliable because the evaluation services for the local model and for the global model can deliver live progress reports of the execution to the user therefore user can understand all the movements of the system. Flexibility is another key feature of this architecture because the global system management service offers the opportunity to specify the details how the problem has to be executed, e.g. which service for learning model, which service for global model, in what format will be the result. After the submission of preliminary details user don’t need to bother about further execution because the execution schedule management service will handle the distribution of the work and prioritize the services according to the complexity of the work and availability of the resources. This architecture brings new facilities to the implementation of grid based learning classifier systems, allowing for the most appropriated orchestration of services for each type of problem. Performance tests were conducted in order to prove the efficiency of the approach. Acknowledgments. The authors would like to express their gratitude to FCT (Foundation of Science and Technology, Portugal), for the financial support through the contract GRID/GRI/81736/2006.

70

M. Santos, W. Mathew, and F. Pinto

References 1. Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Kindermann, J., Dubitzky, W.: Grid-enabling data mining applications with DataMiningGrid: An Architectural perspective. Future Generation Computer System 24, 256–279 (2008) 2. Santos, M.F., Mathew, W., Santos, H.: Gridclass: Strategies for Global Vs Centralized Model Construction in Grid Data Mining. In: Proceeding of the Ubiquitous Data Mining Workshop on ECAI, Lisbon (2010) 3. Santos, M.F.: Learning Classifier System in Distributed environments, University of Minho School of Engineering Department of Information System. PhD Thesis work (1999) 4. Giani, A.: Parallel Cooperative classifier system. Dottorato di ricerca in informatica Universita di Pisa, PhD Thesis TD-4/ 99 5. Cesario, E., Congiusta, A., Talia, D., Trunfio, P.: Data analysis services in the knowledge Grid. In: Dubbitzky, W. (ed.) Data Mining Techniques in Grid Computing Environments. Wiley-Blackwell, UK (2008) 6. Llora, X., Garrell, J.M.: Knowledge- Independent Data Mining with Fine Grained Parallel evolutionary Algorithm. In: Proceeding of the Genetic and Evolutionary Computation Conference, GECCO 2001 (2001) 7. Bull, L., Studley, M., Bagnall, A., Whittley, I.: Learning Classifier System Ensembles With Rule Sharing. IEEE 1089-778x (2006) 8. Santos, M., Mathew, W., Santos, H.: Grid Data Mining by means of Learning Classifier Systems and Distributed Model Induction. In: Proceedings of the 14th International Workshop on Learning Classifier Systems, GECCO 2011, Dublin, Ireland (July 2011) 9. Sanchez, A., Montes, J., Dubitzhy, W., Valdes, J.J., Perez, M.S., Miguel, P.d.: Data mining meets grid computing: Time to dance. In: Dubitzky, W. (ed.) Data Mining Techniques in Grid Computing Environments. Wiley-Blackwell, UK (2008) 10. http://www.gridgain.com/key_features.html (consulted on February 8, 2011) 11. Cannataro, M., Congiusta, A., Pugliese, A., Talia, D., Trunfio, P.: Distributed Data Mining on Grid: Services, Tools, and Applications. IEEE Transactions on System, Man, and Cybernetics- Part B: Cybetnetics 34(6) (December 2004) 12. Luo, J., Wang, M., Hu, J., Shi, Z.: Distributed data mining on Agent Grid: Issues, Platform and development toolkit. Future Generation Computer System 23, 61–68 (2007) 13. Orriols-Puig, A.: Further Look at UCS Classifier System. In: GECCO 2006, Seattle, Washington, USA, July 8-12 (2006)

Securing Data Warehouses: A Semi-automatic Approach for Inference Prevention at the Design Level Salah Triki1, Hanene Ben-Abdallah1, Nouria Harbi2, and Omar Boussaid2 1

Laboratoire Mir@cl, Département d’Informatique, Faculté des Sciences Economiques et de Gestion de Sfax, Tunisie, Route de l’Aéroport Km 4 – 3018 Sfax, BP. 1088 {Salah.Triki,Hanene.BenAbdallah}@Fsegs.rnu.tn 2 Laboratoire ERIC, Université Lyon 2, 5 avenue P. Mendès France 69676 Bron, Cedex, France {Nouria.Harbi,Omar.Boussaid}@univ-lyon2.fr

Abstract. Data warehouses contain sensitive data that must be secured in two ways: by defining appropriate access rights to the users and by preventing potential data inferences. Inspired from development methods for information systems, the first way of securing a data warehouse has been treated in the literature during the early phases of the development cycle. However, despite the high risks of inferences, the second way is not sufficiently taken into account in the design phase; it is rather left to the administrator of the data warehouse. However, managing inferences during the exploitation phase may induce high maintenance costs and complex OLAP server administration. In this paper, we propose an approach that, starting from the conceptual model of the data sources, assists the designer of the data warehouse in indentifying multidimensional sensitive data and those that may be subject to inferences. Keywords: Data warehouse, Security, Precise Inference, Partial inference.

1 Introduction Organizations have a significant amount of data that can be analyzed to identify trends, examine the effectiveness of their activities, and take decisions to increase their profits. By gathering and consolidating data issued from the organization’s information system, a data warehouse (DW) allows decision makers to perform decision analyses and financial forecasts. In fact, several tools dedicated to data warehousing offer various operations for OnLine Analytical Processing (OLAP), assisting users in the decision analysis process. On the other hand, data in an organization’s DW are proprietary and sensitive and should not be accessed without controles. Indeed, some data, like medical data, religious or ideological beliefs, are personal and may harm their owners if disclosed. For this, several governments passed laws for the protection of the citizens’s private L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 71–84, 2011. © Springer-Verlag Berlin Heidelberg 2011

72

S. Triki et al.

lives. Among these laws, HIPAA1 (Health Insurance Portability and Accountability Act) aims to protect patient medical data by forcing American health care establishments to follow strict safety rules. Similarly, GLBA2 (Gramm-Leach-Bliley Act) requires U.S. financial institutions to protect customer data; on the other hand, Safe Harbor3 allows companies conforming to transfer and use data on European Internet ; and Sarbanes-Oxley4 Act guarantees the reliability of corporate financial data. Agencies must use strict safety rules to comply with these laws, otherwise they are punished. Securing a DW is a twofold task. The first fix the access rights of the DW users. Similar to information systems, this security task can be treated at a conceptual or logical level; the fixed access rights are enforced by the OLAP server. As for the second security task, it seeks to ban malicious users from infering prohibited information through permitted acceses. In fact, there are two types of inferences: precise inferences where the exact data values are deducted, and partial inferences where data values are partially disclosed. Inference prevention at design level reduces administration costs and maintenance of OLAP servers. Despite this, inference prevention at design level has not received enough interest from researchers. The aim of this paper is to propose an approach to model the prevention of inferences using the data source design represented as a class diagram. Our approach has two advantages over existing approaches. The first advantage is its genericity since it is applicable to any business domain. The second advantage is that it takes into account the data available to the malicious user to detect inferences; the majority of inference cases are produced by combining available data. The remainder of the paper is organized as follows: in section 2, we present a state of the art in the DW security domain at the requirement and design levels. In Section 3, we detail our approach. Section 4 presents an example illustrating the use of our approach. Finally, we summarize the work done and outline our work in progress.

2 Related Work The need for securing DW was felt long ago [1] [2]. Several proposed approaches tackled the DW security problem at the requirement, design or logical levels. At the requirement level, [3] propose a profile based on i* and an approach that can model the security requirements. The proposed profile takes into account the RBAC ("Role Based Access Control") model and MAC (" Mondatory Access Control ") model. Thus, for each data to be protected, a security class must be defined in terms of: security role, security level and compartment. Using this profile, the proposed approach to model security requirements operates on three stages : i) analyzing the rules and privacy policies that exist in the organization; ii) interviewing the securityin-charge personnel to define the data to be secured; and iii) affecting security classes for each data. This approach is informally presented. 1

http://www.hhs.gov/ocr/privacy/index.html http://www.gpo.gov/fdsys/pkg/PLAW-106publ102/ content-detail.html 3 http://www.export.gov/safeharbor/ 4 http://www.soxlaw.com/ 2

Securing Data Warehouses

73

At design level, several studies have been carried out. [4] propose a UML profile for modeling security and extensions of OCL (Object Constraint Language) to specify security constraints. The UML profile, called SECDW (Secure Data Warehouse), includes new types, stereotypes and tagged values to model the RBAC and MAC models. [5] extended SECDW to represent the concept of conflict among multidimensional elements. However, neither work proposed an approach to design a secure DW model. On the other hand, [6] proposed an approach using the UML state-transition diagram to detect inferences in a DW design. In the state-transition diagram, the states represent the data to display and transitions represent users’ multidimensional queries. The approach takes into account the possibility of inferences from empty cells in a cube (i.e., unavailable data for measures), without addressing the possibility of inferences from available data. For their part, [7] proposed an approach specific to the field of market research in general and the particular case of the company GFK [9]. This approach addressed the case of partial and precise inferences. Precise inferences occur when the exact measure values is deduced, while partial inferences occur when “an idea” about the measure values is deduced. In this work, inferences were detected manually by studying the application domain of market research. At logical level, [5] treated the case of implementation in the multi-dimensional relational model based on the extension of the CWM (“Common Warehouse Metamodel”). This extension allows the definition of security constraints and audit rules for each element of the relational model. Security constraints allow the implementation of RBAC and MAC and audit rules can log access attempts to analyze problematic cases. After our review of the state of the art of DW security, we noticed the following four points: - Modeling access rights has been treated at the requirement, design and logical levels. Existing work ([3] [4] [5] [9]) were able to offer notations for modeling the MAC and RBAC models. However, the proposed approaches were informally described. - Prevention of inferences has been widely treated at the physical level ( [10] [11] [12] [13] [14]). This level can enduce high administrative costs and high maintenance. - Prevention of inferences at the design level has not been sufficiently addressed. The existing works ([6] [7]) do not take into account the potential inferences from the data available and are specific to a particular application domain. - Existing proposals lack assistance in identifying data from the DW that are potentially subject to inferences. In this paper, we treat the last two points by proposing at the design level: i) a UMLbased language for modeling data potentially subject to inferences, and ii) a semiautomatic approach to identify such data.

3 Proposed Approach The approach we propose is based on the data sources’ class diagram. In addition, it assumes that the DW schema is already designed and mapped to the data sources.

74

S. Triki et al.

In fact, our approach fits in and complements the three types of DW design approaches: bottom-up ([15] ), top-down ([16]), and mixed ([17]). In all three types of design approaches, once the DW schema is developed, it must be matched with the data sources to indicate the source of the elements that will be used to load each element of the DW; this mapping is vital for the definition of the ETL procedures. In the case of bottom-up and mixed approaches, this mapping is produced by default since the DW schema definition is developed from the data sources. As for the topdown approaches, this mapping is needed to validate the specified DW schema. Our approach (see Fig. 1) comprises three phases. The first phase, carried out by the security designer, identifies the elements to be protected in the DW design. In the second phase, we first automatically build an inferences’ graph used to detect the elements which may lead to inferences; secondly, the designer distinguishes the elements that lead to precise inferences and those that lead to partial inferences. In the third phase, we automatically enrich the DW schema model by UML annotations highlighting the elements subject to both types of inferences. Note that we use in this paper the star schema to model the DW schema. 3.1 Definition of Sensitive Data Given a DW schema, the definition of sensitive data annotates the elements of the multidimensional model. It is made by the DW security designer who may be assisted by an expert in the field. The role of the domain expert is to identify the data to be protected. This data is indicated by annotations with the UML stereotype “Sensitive data” (see Fig. 1). 3.2 Inference Graph Construction Definition 1: An inference graph is a set of nodes connected by oriented arcs. The nodes represent the data (in the source) and the arcs indicate the direction of inference and the inference type (partial/precise). Graphical notations: An inference graph is graphically composed of: - Two types of nodes: nodes colored in gray represent the sensitive data, and nodes colored in white represent the non-sensitive data. - Two types of arcs: dotted arcs indicate partial inferences and solid arcs indicate precise inferences. Take the case of health, disease (sensitive data), treatment and service are represented by nodes. The correspondence between the disease and treatment is the inferences and their meaning (Fig. 2): Knowing the treatment, one can infer the disease. This inference is precise because two different diseases may not have the same treatment, so we have a solid arc treatment to illness. On the other hand, if in a hospital, each service treats a number of diseases, then, knowing the service, one can have an idea about the kind of disease but not its name; this is modeled by the dotted arc from service to illness.

Securing Data Warehouses

DW schema (1) Sensitive data identification DW schema: Sensitive data identified

(2.a) Inference graph construction A (2.b)Partial inference Identification

B

C

D

E

Data sources class diagram

Inferences graph A

B

C

D

Inference graph : Partial inference detected

E (2.c) Calculating the transitive closure A

B

C

D

Graph inferences: new partial inference detected

E (3) DW schema enrichment

DW schema enriched D

Fig. 1. Proposed Approach for DW schema security

75

76

S. Triki et al.

Illness

Treatment

Service

Fig. 2. Inference graph: Example

The construction of an inferences graph involves the class diagram of data sources that will load the DW. We use the mapping to prune out the inference graph built from the data sources and restrict it to only the nodes corresponding to data used for the loading of the DW schema. In our approach (see Fig. 1), the inference graph is built automatically based on the cardinality of the class diagram of the data sources. To do this, we apply the following six rules: R1. Each class is represented by a node colored in gray if the corresponding data is sensitive and colored in white otherwise. R2. Each binary association / aggregation between two classes C1 and C2 will be represented by an arc according to the following three cases: Case 1 (see Fig. 3 (a, b)): if the association/aggregation has cardinality * on C1’s side and cardinality 1 or 0..1 on the C2’s side, then an arc from C1 to C2 is added to the inference graph (see Fig. 3 (c)) Case 2 (see Fig. 4 (a, b)): if the association/aggregation has cardinality * C1’s side and cardinality 1 or 0..1 on C2’s side and C1 is also connected to C3 by an association/agregation with cardinality * on C1’side and the cardinality of 1 or 0..1 on C3’s side, then two arcs are added to the inference graph. The first from C2 to C3 and the second from C3 to C2 (see Fig. 3 (c)) Case 3 (see Fig. 5 (a)): if the association has a class C3, then two arcs are added to the inference graph; one from C3 to C1 and another from C3 to C2 (see Fig. 5 (b)). C1

*

* 1 or 0..1

1 or 0..1 C2

(a)

(b) Fig. 3. Inference first case

(c)

Securing Data Warehouses

*

1 or 0..1

*

1 or 0..1

C1

77

C3

* 1 or 0..1

* 1 or 0..1

C2

(a)

(b)

(c)

Fig. 4. Inference second case

C3

C1

(a)

C2

(b) Fig. 5. Inference third case

C1

C1

*

1 or 0..1

C2

C2

(a)

(b)

(c)

(d)

Fig. 6. Representing composition

R3. Each composition (see Fig. 6 (a)) will be represented by an arc from the component to the composite (see Fig. 6 (b)). If in addition the cardinality of the component side is 1 or 0..1 (see Fig. 6 (c)), then a second arc from the composite is added to the component (see Fig. 6 (d)). R 4. Each n-ary association with cardinalities * and 1 or 0..1 will be represented by arcs from classes with cardinalities 1 or 0..1 to those with the cardinality *. For example, the ternary association in Fig. 7 (a) is represented by the graph in Fig. 7 (b).

78

S. Triki et al.

1 or 0..1

C1

* * C2

C3

(a)

(b) Fig. 7. Representing n-ary association

R5. If an inheritance relationship exists between two classes C1 (parent) and C2 (child) and if C1 is connected to C3 by an association with a cardinality * on C1’s side and the cardinality on C3’s side is 1 or 0..1 (see Fig. 8 (a)), then an arc is added from C2 to C3 (See Fig. 8 (b)). R6. If an inheritance relationship exists between two classes C1 (parent) and C2 (child) and if C1 is connected to C3 by an association with a cardinality 1 or 0..1 on C1’s side and the cardinality on C3’s side is * (see Fig. 8 (c)), then an arc is added from C3 to C2 (See Fig. 8 (d)).

*

1 or 0..1

C1

C2

C3

(a)

1 or 0..1

(b)

C3

*

C1

(c)

C2

(d)

Fig. 8. Representing an inheritance relationship

The automatic construction of the inference graph does not indicate the type of inferences: partial or precise. This indication cannot be, unfortunately, deducted automatically. Thus, after constructing the inference graph, the designer must distinguish partial inferences (drawn by dotted arcs).

Securing Data Warehouses

79

In addition, to ensure that all possible inferences have been determined, our approach continues with the automatic calculation of the transitive closure of the graph. On the resulting grap ph, we distinguish two types of paths: - Precise path: it is a path h where all connected nodes allow precise inferences. - Partial path: it is a path h with at least one node allows partial inferences. 3.3 Enrichment of the DW W Schema The inference graph is useed to enrich the DW schema (see Fig. 1). To do so, our approach assumes that thee mapping between elements of the DW schema and the source is already done. We W exploit this mapping to apply the two follow wing enrichment rules: he inference graph belonging to a Precise path annotatee its - For each element of th corresponding element in the t DW schema with “Precise Inference: ElementNam me: NameInferedData”. - For each element of th he inference graph belonging to a partial path annotatee its corresponding element in the DW schema with "Partial Inference: ElementNam me: NameInferedData”.

4 Example Fig. 9 contains the class diaagram of a fictitious data source in the healthcare dom main. Table 1 contains details of various v classes. Fig. 10 presents a DW W schema that analyzes the costs and durations of the diagnostic along the analy ysis axes: Disease, Treatment, Critical Illness, Transsfer, Date Time, and Doctor's specialty. The latter takes the value generalist if the doctor who performed thee diagnostic is a generalist and specialty of the docctor otherwise.

Fig g. 9. Class diagram of the data sources

80

S. Triki et al. Table 1. Details of Fig. 9 classes

Class

Deta ails

Date and Time Admission Diagnostic Doctor Specialty

Date and time of a patient admission. Patieent admission. Patieent diagnostic. Doctor who made the diagnostic. Speciialty of doctor who made the diagnosis. Illneess diagnosed. Treattment necessary to cure the disease. Serio ous disease requiring patient transfer to another hospital wherre he will receive appropriate care. This association class contains the date of transfer of the patient and the t hospital will welcome.

Illness Treatment

Critical illness Transfer

Fig. 10. Multidimensional Model

4.1 Inference Graph Illness corresponding to a given patient is a sensitive information since it is partt of professional secrecy in meedical activities. In our example, we look at the data tthat allow us to infer a patient'ss illness. Fig. 11 contains the inference graph construccted from the cardinality of the class diagram of the source data. In this graph, gray noodes are the sensitive data; dotteed arcs represent the inferences that we considered parttial, and those in solid line arre believed to be precise inferences. This graph shoows inferences seven partial and d one precise. The calculation of the transitive closuree of the graph has highlighted other o potential inferences. For the sake of clarity, we hhave shown in Fig. 12 new pathss. These paths are partial because b they contain nodes that allow partial inferencess. In Table 2 the first column sh hows some new inferences composed of the paths listedd in the second column. From Table 2 and Fig. 12 2, we can deduce that: - A user with access to the dates and times of admission and transfer data, m may ness were critical, infer that the diagnosed illn

Securing Data Warehouses

Critical Illness

Illness

Treatment

Admission

Date and Time

81

Diagnostic

Transfer

Doctor

Specialty

Fig. 11. Initial Inferences graph

Critical Illness

Illness

Treatment

Admission

Date and Time

Diagnostic

Transfer

Doctor

Specialty

Fig. 12. Inference graph after calculating the transitive closure

- Access to the date and time of admission and the specialty of the doctor who performed the corresponding diagnosis on admission, allow the user to infer the type of illness the patient has - Access to treatment received by a patient, allow a user to infer the disease the patient has.

82

S. Triki et al. Table 2. Partial paths

Inference Date and Time

Partial path

→ Diaggnostic

Date and Time → Docttor

Date and Time

→ Traansfer

Date and Time

→ Illnness

Date and Time → Admission , Admission → Diagnostic Date and Time → Admission Admission → Diagnostic Diagnostic → Doctor Date and Time → Admission Admission → Diagnostic Diagnostic → Transfer Date and Time → Admission Admission → Diagnostic Diagnostic → Illness

4.2 DW Schema Enrichm ment Based on the inference grraph and the mapping between the data source and D DW schema, we get automaticaally the security annotation for the DW schema elemeents with ptential inferences (seee Fig. 13). In this model, the dimensions of time and ddate are have the same annotatiion to specify that the two sets can lead to inferences. In Fig. 11, for the sake of clariity, we have not listed all the annotations.

Fig. 13. DW schema annotated with the security information

5 Conclusion In this paper, we presented d an approach to produce a conceptual multidimensioonal model annotated with inforrmation for the prevention of inferences. Our approach has two advantages over existin ng approaches. The first is its independence from the ddata domain. The second advan ntage is the use of available data to detect inferences. O Our approach constructs a graph h of inferences based on the class diagram of data sourcces. The class diagram allows us u with the assistance of the domain expert to identify the

Securing Data Warehouses

83

elements to lead to precise and partial inference. These elements will be annotated in the multidimensional model. Currently, we are studying how to transfer to the logical level the annotations defined at the design level.

References 1. Bhargava, B.K.: Security in data warehousing (Invited talk). In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds.) DaWaK 2000. LNCS, vol. 1874, pp. 287–288. Springer, Heidelberg (2000) 2. Pernul, G., Priebe, T.: Towards olap security design - survey and research issues. In: 3rd ACM International Workshop on Data Warehousing and OLAP DOLAP 2000, Washington, DC, Novembre 10, pp. 114–121 (2000) 3. Soler, E., Stefanov, V., Mazón, J.-N., Trujillo, J., Fernández-Medina, E., Piattini, M.: Towards comprehensive requirement analysis for data warehouses: Considering security requirements. In: The Third International Conference on Availability, Reliability and Security ARES 2008, Barcelone, Espagne, pp. 104–111. IEEE Computer Society, Los Alamitos (2008) 4. Soler, E., Villarroel, R., Trujillo, J., Fernández-Medina, E., Piattini, M.: Representing security and audit rules for data warehouses at the logical level by using the common warehouse metamodel. In: The First International Conference on Availability, Reliability and Security ARES 2006, Vienne, Autriche, pp. 914–921. IEEE Computer Society, Los Alamitos (2006) 5. Triki, S., Ben-Abdallah, H., Feki, J., Harbi, N.: Modeling Conflict of Interest in the design of secure data warehouses. In: The International Conference on Knowledge Engineering and Ontology Development 2010, Valencia, Espagne, pp. 445–448 (2010) 6. Carlos, B., Ignacio, G., Eduardo, F.-M., Juan, T., Mario, P.: Towards the Secure Modelling of OLAP Users’ Behaviour. In: The 7th VLDB Conference on Secure Data Management, Singapore, September 17, pp. 101–112. Springer, Heidelberg (2010) 7. Steger, J., Günzel, H.: Identifying Security Holes in OLAP Applications. In: Proc. Fourteenth Annual IFIP WG 11.3 Working Conference on Database Security, Schoorl (near Amsterdam), The Netherlands, August 21-23 (2000) 8. Icon Group Ltd. GFK AG: International Competitive Benchmarks and Financial Gap Analysis (Financial Performance Series). Icon Group International (2000) 9. Villarroel, R., Fernández-Medina, E., Piattini, M., Trujillo, J.: A uml 2.0/ocl extension for designing secure data warehouses. Journal of Research and Practice in Information Technology 38(1), 31–43 (2006) 10. Haibing, L., Yingjiu, L.: Practical Inference Control for Data Cubes. IEEE Transactions on Dependable and Secure Computing 5(2), 87–98 (2008) 11. Cuzzocrea, A.: Privacy Preserving OLAP and OLAP Security. In: Encyclopedia of Data Warehousing and Mining, pp. 1575–1158 (2009) 12. Zhang, N., Zhao, W.: Privacy-Preserving OLAP: An Information-Theoretic Approach. IEEE Transactions on Knowledge and Data Engineering 23(1), 122–138 (2011) 13. Terzi, E., Zhong, Y., Bhargava, B.K., Pankaj, Madria, S.K.: An Algorithm for Building User-Role Profiles in a Trust Environment. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 104–113. Springer, Heidelberg (2002)

84

S. Triki et al.

14. Bhargava, B.K., Zhong, Y., Lu, Y.: Fraud Formalization and Detection. In: Kambayashi, Y., Mohania, M., Wöß, W. (eds.) DaWaK 2003. LNCS, vol. 2737, pp. 330–339. Springer, Heidelberg (2003) 15. Golfarelli, M., Rizzi, S.: A Methodological Framework for Data Warehouse Design. In: ACM First International Workshop on Data Warehousing and OLAP DOLAP, Bethesda, Maryland, USA, pp. 3–9 (Novembre 1998) 16. Feki, J., Nabli, A., Ben-Abdallah, H., Gargouri, F.: An Automatic Data Warehouse Conceptual Design Approach. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, 2nd edn. (2008) 17. Lujan-Mora, S., Trujillo, J.A.: Comprehensive Method for Data Warehouse Design Fifth International Workshop on Design and Management of Data Warehouses, DMDW 2003, Berlin, Allemagne (Septembre 2003)

F-RT-ETM: Toward Analysis and Formalizing Real Time Transaction and Data in Real-Time Database Mourad Kaddes1,2 , Majed Abdouli1 , Laurent Amanton2 , Mouez Ali1 , Raﬁk Bouaziz1 , and Bruno Sadeg2 1

2

Multimedia, Information Systems and Advanced Computing Laboratory, Sfax university, Route de Tunis km 10 PB 242, 3021 Sakeit Ezzeit, Tunisia {mourad.kaddes,majed.abdouli,mouez.ali,raf.bouaziz}@gmail.com UFR Sciences et Techniques, Universit´e du Havre, 25 rue Philippe Lebon BP 540, 76058 Le Havre Cedex, France {laurent.amanton,bruno.sadeg}@univ-lehavre.fr

Abstract. Due to the diversity of extended transaction models, their relative complexity and their lack of formalization, the characterization and the comparison of these models become delicate. Moreover, these models capture only one subset of interaction which can be found in the spectrum of the possible interactions. In front of this established fact, the framework ACTA was introduced. Our contribution in this ﬁeld is twofold: (i) we extend ACTA by adding many dependencies for capturing a new interaction between transactions in real time environment, and we extend ACTA to take into account temporal characteristics of real-time data item (ii) we presented a meta-model that capture concept of an extended real time transaction model by using UML class diagram and its formal description using Z language. Keywords: Transaction, Real-Time, ACTA, Temporal Data, Data Freshness, Meta-model, Z language.

1

Introduction

In a number of real-time applications, e.g., stock trading and traﬃc control, real-time databases systems (RTDBS) are required to process transactions in a timely fashion using a large number of temporal data, e.g., current stock prices or traﬃc sensor data, representing the real world status. RTDBS are best suited for manipulating such applications since they also handle both large amounts of data and time constraints. In the two last decades, a lot of real-time database research has been done where diﬀerent scheduling protocols, EDF, GEDF, MSF; concurrency protocols 2PL-HP, OCC are proposed. The majority of these researches adopt the ﬂat transaction model which remains the simplest and most common either in traditional database or real-time database. However, in real-time applications, the disconnection, the abort and the recovery may lead the transactions not to more respect their deadlines even if L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 85–96, 2011. c Springer-Verlag Berlin Heidelberg 2011

86

M. Kaddes et al.

all the transactions are initially scheduled. Thus, the new applications would proﬁt advantageously from extended transaction model which relaxed Atomicity, Consistency, Isolation and Durability properties. Various extensions to the traditional model have been proposed referred to herein as extended transaction e.g. nested transaction, saga transaction, split/join transaction, adaptable transaction model. This diversity of the models, their relative complexities and their lacks of formalism encouraged Chrysanthis et al. [1] to deﬁne the framework ACTA. ACTA is a tool to specify and reason out the eﬀects of transactions on objects and the interactions between transactions. Speciﬁcally, it can be used to specify the properties of atomic and extended transaction models and to synthesize new transaction models. However, it has not been used to specify time related requirements which are essential to the speciﬁcation of real-time databases [2]. However, there has been a few works to formalize the properties of transactions and data in RTDBS. In this paper, based on the ACTA framework, we attempt to overcome this shortcoming. In this paper, we develop a transaction framework based on ACTA to facilitate the formal description of transaction properties in real-time databases. We present an overview of ACTA and our motivation in section 2. Section 3 extends ACTA formalism to take into account temporal characteristics. Section 4 we propose a meta-model, called RT-ETM Meta-model for Real-Time Extended Transaction, that captures all concept of extended real-time transaction models using UML. We give a F-RT-ETM formal description of RT-ETM. We conclude our work in section 5.

2

Overview of ACTA

ACTA is not a new model of transactions, but rather a formalism allowing the formal description of properties of the complex transactions. Precisely, using ACTA, we can specify and reason about the eﬀect of transaction on objects and the interactions between the transactions in a particular model. They are formulated by: 1. The eﬀects of the transactions on each other (see ﬁg1 continuous line). 2. The eﬀects of the transactions on the objects which they handle (see ﬁg2 continuous line). 2.1

Eﬀects of the Transactions on Each Other

The dependencies provide a practical manner to simplify and to reason on the behaviour of the concurrent transactions. In fact, the dependencies describe the eﬀects of the transactions on other transactions, and represent constraints on possible stories. By examining the possible eﬀects of the transactions acting ones on the ‘others, it is possible to determine the dependencies which can be developed between them. The ﬁrst two and important dependencies presented in [1] are Commit dependency and Abort dependency deﬁned in the following way:

F-RT-ETM

87

1. Commit Dependency (ti CD tj ). If ti and tj are committed, then the commit of tj must precede the commit of ti . 2. Abort Dependency (ti AD tj ). If tj aborts, ti has to abort too. Many others dependencies are added to extend ACTA to expressive more behavior [3], [4], [5], [6], [7]. 2.2

Eﬀects of Transaction on Objects

A transaction invokes an operation on an object and modiﬁes its state and its statute which characterize it. The state of an object is represented by its contents. The status of an object is represented by the synchronisation of information associated with the object. A transaction eﬀects on objects are characterized by the set of eﬀects which are visible to it, the whole conﬂict operations that it carries out, and the whole eﬀects that it delegates to other transactions. ACTA allows to capture these eﬀects by introducing three sets (ViewSet, AccessSet, ConﬂictSet) and by using the concept of delegation (cf. Fig 1). In fact, ACTA allows ﬁner control over the visibility of objects by associating three entities, namely ViewSet, ConﬂictSet and AccessSet with every transaction. We mean by Visibility the ability of one transaction to see the eﬀects of another transaction on objects while they are executed. 1. The ViewSet contains all the objects potentially accessible to the transaction that can be operated by transaction. 2. The ConﬂictSet contains objects that operations wants access but it is already accessed by incompatible operations. 3. The AccessSet contains all the objects which are already accessed by a transaction. 4. Delegation: traditionally, the committing or aborting of an operation is part of the responsibility of the invoker. However, the invoker and the one committing operation may be diﬀerent when a transaction delegates its responsibility to another transaction.

3

Temporal Characteristics in ACTA

A great deal of ACTA work concerns the formal description of properties of extended transaction models, such as nested transactions, split/join transactions, transactions in active databases and the synthesis of extended transaction models [1]. Basically, transaction models in non real-time databases are the major concern of the past work. Through these last two decades, a lot of work has been done in real-time databases which investigate transaction scheduling in real-time database systems with transaction and data timing constraints. So, Kaddes et al. [7] have proposed a Real-Time ACTA framework (CRT-ACTA) as

88

M. Kaddes et al.

an extension of ACTA to specify the transaction and data timing constraints. Unlike RT-ACTA [2], which is based on E-C-A (event-condition-action) model and deals only with active databases, (CRT-ACTA) deals with all RTDBs. In this section, we brieﬂy recall the concept of eﬀects of transactions, we deﬁne new transaction dependencies, and we introduce the new concepts of temporal validity, priority and dependencies on occurrence. Then, we deal with eﬀects of transactions on data, data time constraints and data freshness. 3.1

Temporal Characteristics of Transactions

The distinction between real-time and non real-time database systems lies in the timing constraints associated with transactions and data values in a real-time database. In real-time database systems, transactions have to complete before deadlines, and they have to read data values which are temporally valid, because data values may become stale as time goes by. To reach its goal and to satisfy this constraint, transactions are handled and scheduled by the system according to their priority. In other words, a transaction is characterized not only by its eﬀects on objects and other transactions as we have seen in previous section but also by temporal validity and priority (cf.Fig1).

Transaction

Temporal validity

Priority

On Occurence

Effects of Transaction (ACTA)

On Transaction

View Set

On Objects

Access Set

Conflict Set

Delegation

Fig. 1. Transaction characteristics and eﬀects

Temporal validity. Temporal validity permits to specify the validity interval of transaction, periodicity of transaction and the type of transaction hard, ﬁrm or soft ,i.e., hard transaction must absolutely met it deadline. A ﬁrm transaction can missed it deadline in transient overload ﬁrm. A soft transaction can continue even its deadline is missed. Priority. Priority permits to specify how priority is assigned to transaction. Priority is used in most conﬂict resolution protocols, so the priority assignment policy plays a crucial role. Eﬀects of transactions. The eﬀects of transaction are divided on three subcategories: on transaction, on occurrence and on objects.

F-RT-ETM

89

1. On transaction: Recent researches in real-time transactions in RTDBS have focused on the idea of using imprecise computation results so that transactions meet their deadlines [6], [8]. In other words, transactions reports estimate or approximate results when they cannot complete within their time quotas. Transactions are composed of two types of sub-transactions: optional and required. When required sub-transactions of transaction ti are executed, transaction ti pre-commits. Optional sub-transactions may be aborted during execution if they do not have enough time to complete. They strive to improve the result being provided to users. Consequently, pre-committed transactions will not abort. Due to induction of pre-commit event, transaction relationships and behaviour between transactions are modiﬁed. So, dependencies described in the preceding section can’t express the new relationships between transactions. To ﬁll these deﬁciencies, we propose many new dependencies for more expressiveness of these relationships. In former work, if two transactions ti and tj must commit, we can deﬁne only one dependency “ ti CD tj ”. This means that transaction ti can’t commit until tj commits. So with such dependency, we can’t express the relationships between pre-commit transactions. We deﬁne following new dependencies for more precision in execution of transaction: – Pre-Commit dependency (ti PCD tj ). Transaction ti can’t pre-commit until transaction tj pre-commits. – Strict-Pre-Commit dependency (ti SPCD tj ). Transaction ti can’t precommit until transaction tj commits. – Weak-Commit dependency (ti WCD tj ). Transaction ti can commit if tj is already pre-committed. In the same way, we can introduce the pre-commit in the exclusion dependency. In the literature, exclusion dependency (ti ED tj ) expresses that ti must abort if tj commits. In real time application it’s preferable to exclude the transaction ti when tj pre-commits to release system. Hence ti are not forced to wait the commit of tj . – Pre-Exclusive dependency (ti PED tj ). If the transaction tj pre-commits then ti must abort. – Pre-Exclusive Pre-Exc(ti , tj ). Two transactions ti and tj are excluded mutually iﬀ each transaction develops a relation of pre-exclusive dependency on the other. 2. On occurrence: The eﬀects of transaction on occurrences are deﬁned by the dependencies between diﬀerent occurrences of transactions and permit to take into account the periodicity of transaction. Particularly when a transaction is repetitive, an occurrence of transaction can depend on the other occurrences, e.g., in m-k ﬁrm model only m must commit among k occurrences. If m occurrences are already committed, the remainder occurrences can be aborted or not initialed [9], [10]. So, for taking into account the behaviour between diﬀerent occurrences of a transaction, we deﬁne these dependencies:

90

M. Kaddes et al.

– Conditional Not Admit occurrence on Commit Dependency CNADOCD (tji , K, m, [C]): the occurrence i of transaction tj is not admitted if the k previous occurrences are committed and if condition C is true. – Conditional Aborted Occurrence on Commit Dependency CAOCD (tji , k, m, [C]): the occurrence i of transaction tj is aborted if k precedent occurrences are committed and if condition C is true. In dynamic systems, such web servers and sensor networks with non uniform access patterns, the workload of RTDB cannot be precisely predicted and, hence, RTDB can become overloaded. As a result, uncontrolled deadline misses may occur during the transient overloads. So, many researches propose [11] [12] [13] to not admit or to abort an occurrence in some conditions. This will imply modiﬁcations on the behavior of the transactions and their interrelationships. We can drive new dependencies by applying the concept of conditional dependency on the previous dependencies and relaxing them. A conditional dependency is noted as follow: ti (dependency [C]) tj where the condition C is an optional part. When condition is mentioned and satisﬁed the dependency must be respected. If the condition is omitted, the dependency must be respected all the time. – Conditional Not Admit occurrence on Commit Dependency CNADOCD (tji , K, m, [C]): the occurrence i of transaction tj is not admitted if the k previous occurrences are committed and if condition C is true. – Conditional Aborted Occurrence on Commit Dependency CAOCD (tji , k, m, [C]): the occurrence i of transaction tj is aborted if k precedent occurrences are committed and if condition C is true. 3. On objects: The eﬀects of transactions on objects are described by the states and status of objects (cf. section 3.2). Before we close this sub section we note that in RTDBS we distinguish two classes of transactions: update transactions and user transactions. Where update transactions are used to update the values of real-time data in order to reﬂect the state of real world. We can divide this class in two sub-categories. Sensor transactions, composed by a simple operation, executed periodically and having only to write a real-time data item. Recomputation or sporadic have to derive a new data item from basic data items when these later are updated [11]. User transactions, representing user requests, arrive aperiodically and may read real-time data items, and read or write non real-time data items. Each type of transactions has its own structure and has a diﬀerent behavior, e.g., diﬀerent interactions with the others transactions and objects. For example, each sensor transaction updates its own data items, so no concurrency control is considered for sensor transactions. 3.2

Temporal Characteristics of Data

A RTDB is composed of temporal data items and non-temporal data items. These items are both considered as passive components. The non temporal data

F-RT-ETM

91

items are found in traditional databases and their validity is not aﬀected as times goes by. The temporal data items, which can be sensor data items or derived data items, represent the states of real-time objects which change continuously [14]. These states may become invalid as time goes by. To reﬂect these changes, data objects that model the entities need to be updated. Only non temporal data items are taken into account by ACTA. So, we propose to extend ACTA to consider temporal characteristics. We categorize data freshness into “database freshness” and “perceived freshness”. 1. Database freshness (DF) describes the state of data. In addition, data freshness speciﬁes temporal status and it describes the method of freshness[15]. 2. Perceived freshness (PF) is deﬁned for the data accessed by user transactions [15]. Database Freshness is described by 4 components: validity interval, value, methods of freshness and impact on other data. 1. Validity interval: leads to the notion of temporal consistency. Temporal consistency has two classes: – Absolute-consistency: is the state between the environment and its reﬂection in the database. As mentioned earlier, this arises from the need to keep the controlling system’s view consistent with the actual state of the environment. – Relative-consistency: is among data used to derive other data. A RTDB contains basic data items which record and model a physical real world environment. These basic data items are summarized and correlated to derive views. When the environment changes, basic data items are update, and subsequently view recomputations are triggered [16]. To formalize the notion of temporal consistency for continuous objects; we denote a real-time data item by d (value, avi, timestamp) where value denotes the current state of d, timestamp denotes the time when the observation relating to d was made and avi denotes d’s absolute validity interval. For discrete data model, e.g. stock price, it is diﬃcult to assign reasonable avi due to sporadic update, the data is considered valid until the new sampling. Hence a data objects values remains unchanged until an update arrives [16]. 2. Value: it deﬁnes the contents of the data. A data can be imprecise and multi-versions. – Data error: as we have already mentioned, a RTDBS can become overloaded and this overload is unpredictable. Hence, many approaches propose through the transient overload to discard many update transactions to reduce the workload avoiding consequently the degradation of system and eventually its crash. The discard of update transactions implies a certain degree of deviation compared to real-world value and then the decrease of data quality. In order to measure data quality, many researches introduce the notion of data error, denoted DE. A data error which gives an indication of how much

92

M. Kaddes et al.

the current value of a data object di stored in the database deviates from the corresponding real-world value, given by the latest arrived transaction updating di. The upper bound of the error is given by the maximum data error (MDE). If MDE decreases the number of update transaction discarded decrease also and the quality of data and user transaction quality increase and vice versa. – Versioning: the conﬂict between transactions is one of major factor that makes some blocked, or aborted and restarted and thereafter this may lead to the transactions miss deadlines especially, in transient overload. To address this problem, many approaches proposed a multi-versions data [11]. An object is deﬁned by a set of values and can be accessed by many read and write operations which belong to diﬀerent transactions. So we limit data access conﬂicts between transactions, enhance the concurrency and limit the deadline miss ratio. 3. Method of freshness: it is important to guarantee the freshness of data independently [17]. Hence many approaches, particularly for derived data items, are proposed to guarantee the freshness of data with inducing an acceptable number of triggered recomputation transactions. We can mention delayed forced update, periodic update, on demand update and deferred update. Similarly, for sensor data, diﬀerent approaches are proposed to maintain the freshness, e.g., periodic update, on demand, deferred update. . . 4. Impact on others data: deﬁne the set of objects whose values are derived from current object. The change of its value implies the obsolescence of related objects and triggers their recomputation. Fig 2 shows the structural relationship of these components and update transactions.

Effect of user transaction on objects

Update Transction

Temporal Data

Non−temporal Data

Perceived Freshness

Conflict Set

Acces Set

Delegation

Validity interval

VIew Set

Conflict User/Update

Database freshness

methodes of freshness Versioning

Value

Impact on others Data Error on data

Extension to ACTA Exist in ACTA

Fig. 2. Eﬀects of user and update transactions on temporal and non temporal data

Perceived Freshness describes the eﬀect of a user transaction on real time data. It is represented by AccessSet, ViewSet, delegation and ConﬂictUser/ Update Set. This later describes the conﬂicts between user and update transactions. We extend ViewSet and AccessSet by adding temporal constraints to allow ﬁner control over the visibility of objects.

F-RT-ETM

93

1. ViewSet: It contains all the valid objects potentially accessible to the transaction. It speciﬁes what objects can be operated by the transaction, i.e., the state of these objects that is visible to the operations invoked by the transaction and speciﬁes temporal data constraints, i.e., only fresh data are allowed to be viewed. Otherwise, this constraint is relaxed. 2. AccessSet: It contains all objects already accessed by a transaction. AccessSet is extended to specify temporal access policy, e.g., a transaction must be temporally correct or relaxed: absolute consistency and/or relative consistency and/or timely commit. 3. Conﬂict User/Update Set: It contains objects that users transaction wants access which are already locked by update transactions and vice versa. Fig 2 shows the structural relationship of elements used by a perceived freshness of a user transaction with update transactions.

4

Formal Definition of RT-ETM

In this section, we propose a formal deﬁnition of real time extended transaction. The class diagram presented in Fig 3 show a meta-model of real-time transaction RT-ETM. It represents diﬀerent concepts and their interrelationship previously deﬁned. e.g.; The class Transaction is specialized in two classes “UpdateTransaction” and “UserTransaction”. A UserTransaction can be composed by another UserTransaction named sub-transaction. To describe that the sub-transaction can leave the scope of the upper transaction, we use an aggregation between upper and sub transaction and not by a relation of composition. The specialization of RT-ETM allows the description of diﬀerent transaction model and the deﬁnition of new transaction model. Based in some rules in [18] [19], we present a Formal deﬁnition of RT-ETM using a Z language. Due to space, we show formal deﬁnition of a subset of class, relation and operation. [VALUE, TYPE SET, PRIORITY, PERIODICITY, DURATION, STATE TRANS, TYPE OPERATION, TYPE TRANS, FRESHNESS TYPE, STATE T DATA]

Fig. 3. Eﬀects of user and update transaction on temporal and non temporal data

94 M. Kaddes et al.

F-RT-ETM

95

Example of methods

5

Conclusion

In this paper, we have presented the components of ACTA and then we have proposed many extensions to facilitate: (i) the formal description of properties of transactions in real-time databases, reasoning about the transaction interactions and eﬀects on objects. To this purpose, we have added many new dependencies to capture more interactions between transactions in real-time environment and we have introduce many concepts such perceived freshness, conﬂictSetUser/Update to formalize and control the eﬀect of user transactions on data (ii) the formal description of properties of temporal data. For this, we have deﬁned many new concepts to permit a ﬁner control of real-time data items. We have presented a meta-model of real-time extended transaction model and we have showed a part of its description using a Z language.

References 1. Chrysanthis, P.K., Ramamritham, K.: Synthesis of Extended transaction Models Using ACTA. ACM Transactions, Database Syst. 19(3), 450–490 (1994)

96

M. Kaddes et al.

2. Xiong, M., Ramamirthan, K.: Towards the Speciﬁcation and Analysis of Transactions in Real-Time Active Databases. In: RTBD 1997, Burlington Vermont, USA, pp. 327–348 (1997) 3. Schwarz, K., Turker, C., Saake, G.: Analyzing and Formalizing Dependencies in Generalized Transaction Structures. In: Proc. of Int. Workshop on Issues and Applications of Database Technology, Berlin, Germany, July 6-9 (1998) 4. Schwarz, K., Turker, C., Saake, G.: Transitive Dependencies in Transaction Closures. In: Database Engineering and Applications Symposium, Cardiﬀ, Wales, UK, July 8-10 (1998) 5. Schwarz, K., Turker, C., Saake, G.: Extending Transaction Closures by N-ary Termination Dependencies. In: Symposium on Advances in Databases and Information System (Adibis 1998), Poznan, Poland, September 8-11 (1998) 6. Abdouli, M.: Study of Extended Transaction Model Adaptation to Real-time DBMS. PhD Thesis, Le Havre university, France (2006) (in French) 7. Kaddes, M., Abdouli, M., Bouaziz, R.: Adding New Dependencies to Acta Framework. In: Proceedings of 22nd European Simulation and Modelling (ESM 2008), Le Havre, France, October 27-29 (2008) 8. Haubert, J., Sadeg, B., Amanton, L.: (m-k) ﬁrm real-time distributed transactions. In: Proc. of the 16th WIP Euromicro Conference on Real-Time Systems, ECRTS (2004) 9. Hamdaoui, M., Ramanathan, P.: A Dynamic Priority Assignment Technique for Streams with (m,k)-Firm Deadlines. IEEE Transactions on Computers 44(4), 1325– 1337 (1995) 10. Koren, G., Shasha, D.: Skip-over: Algorithms and complexity for overloaded system that allows skips. In: Real-Time System Symposium, pp. 110–117 (1995) 11. Bouazizi, E., Duvallet, E., Sadeg, E.: Multi-Versions Data for Improvement of QoS in RTDBS. In: Proceedings of 11th IEEE International Conference on Real-Time and Embedded Computing Systems and Applications (IEEE RTCSA 2005), Hong Kong, China, August 17-19, pp. 293–296 (2005) 12. Amirijoo, M., Hansson, J., Son, S.H.: Speciﬁcation and Management of QoS in Real-Time Databases Supporting Imprecise Computations. Transactions on Computers 55(3) (March 2006) 13. Amirijoo, M., Hansson, J., Son, S.H.: Speciﬁcation and Management of QoS in Imprecise Real-Time Databases. In: 15th Euromicro Conference on Real-Time Systems (ECRTS 2003), p. 63 (2003) 14. Purimetla, B., Sivasankaran, R.M., Ramamaritham, K., Stankovic, J.A.: Real-time databases: issues and applications. In: Hill, P. (ed.) Principles of RTS (1994) 15. Kang, K., Son, S.H., Stankovic, J.A., Abdelzaher, T.F.: A QoS-Sensitive Approach for Timeliness and Freshness Guarantees In Real-Time Databases. In: The 14th Euromicro Conference on Real-Time Systems (2002) 16. Kao, B., Lam, K., Adelberg, B., Cheng, R., Lee, T.: Maintaining Temporal Consistency of Discrete Objects in Soft Real-Time Database Systems. IEEE Trans. Comput. 52, 373–389 (2003) 17. Xiong, M., Ramamritham, K.: On earliest deadline ﬁrst scheduling for temporal consistency maintenance. Real-Time Systems 40(2), 208–237 (2008) 18. Dupuy-Chessa, S., Bousquet, L.D.: Validation of UML Models Thanks to Z and Lustre. In: Oliveira, J.N., Zave, P. (eds.) FME 2001. LNCS, vol. 2021, pp. 242–258. Springer, Heidelberg (2001) 19. Ali, M.: Veriﬁcation Et Validation Formelles de Modeles UML: Approches et outils Editions universitaires europ´eennes (December 2010) ISBN-10: 6131551359

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache for Flash Memories: A Case Study Jalil Boukhobza, Ilyes Khetib, and Pierre Olivier Université Européenne de Bretagne, France Université de Brest; CNRS, UMR 3192 Lab-STICC, 20 avenue Le Gorgeu, 29285 Brest cedex 3, France [email protected], {ilyes.khetib,pierre.olivier}@etudiant.univ-brest.fr

Abstract. More and more enterprise servers storage systems are migrating toward flash based drives (Solid State Drives) thanks to their attractive characteristics. They are lightweight, power efficient and supposed to outperform traditional disks. The two main constraints of flash memories are: 1) the limited number of achievable write operations beyond which a given cell can no more retain data, and 2) the erase-before-write rule decreasing the write performance. A RAM cache can help to reduce this problem; they are mainly used to increase performance and lifetime by absorbing flash write operations. RAM caches being very costly, their dimensioning is critical. In this paper, we explore some OLTP I/O workload characteristics with regards to flash memory cache systems structure and configuration. We try, throughout I/O workload analysis to reveal some important elements to take into account to allow a good dimensioning of those embedded caches. Keywords: OLTP workloads, NAND flash memory, cache, performance, SSD, storage systems.

1 Introduction NAND flash memories are more and more used as main storage systems. We can find them in a huge set of electronic appliances: cameras, mp3 players, smartphones, etc. Beyond mobile applications, flash memory is considered as a replacing technology for traditional disk systems. For instance, Google announced their intention to migrate toward flash based systems [1]. Another example is the presence in the market of laptop computers (from market giants like Dell, Sony and Samsung) entirely based on Solid State Drives (SSD). The continuous price-per-byte falling and compatibility with the traditional disk systems encourages datacenters administrators to seriously think about flash drives for future evolutions. They are mainly attracted by faster read performance, greater power efficiency and lower cooling costs [9]. When switching toward flash based storage systems, enterprises are faced with a major problem related to flash poor performances [2]: some disks are still outperforming flash memories for write intensive workloads, both sequential and L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 97–109, 2011. © Springer-Verlag Berlin Heidelberg 2011

98

J. Boukhobza, I. Khetib, and P. Olivier

random ones. Unlike disk storage systems, flash memories show asymmetric read/write performances. This is mainly due to two factors: 1) it is not possible to directly achieve in-place data modification because each write must be preceded by an erase operation that is very time consuming. 2) A given flash memory block can only be erased a limited number of times after which data cannot be written anymore. This memory cell wear out issue led researchers to find out solutions to level the wear out over the whole memory area. Those algorithms are integrated into the Flash Translation Layer (FTL) and contribute in reducing the write operation performance. In order to improve flash memory write performance in SSDs, cache mechanisms have been designed on top of FTLs [3-9]. The purpose of such buffers is to reveal sequential patterns from a bunch of random requests. Those buffers do also absorb write operations thereby reducing the number of block erasures. In our previous work, we proposed such a cache structure, named C-lash [10][23]. In addition to the mentioned advantages, C-lash uses smaller RAMs and operates without an FTL underneath though saving energy and money without compromising on performance. Previous cache studies define one or two regions in a RAM having different operation granularities (see the related work section), but their objective is the same. This RAM being very expensive, we think that one should consider the dimensioning problem of such caches as a crucial step toward the design of an efficient and cost effective flash based storage system. This dimensioning cannot be achieved without an accurate characterization of the applied I/O workload [11][12][21]. We present, in this paper, a case study of dimensioning such caches through the Clash example [10][23]. The considered I/O workloads are made available by the Storage Performance Council [14], and represent traces from the online transaction processing (OLTP) applications running at two large financial institutions [13]. The aim of the study, beyond the C-lash specific example for the studied workload, is to analyze and discuss interactions between the cache metrics, in terms of sizes and performance, and the I/O workloads parameters, in terms of sequentiality, spatial and temporal localities, inter-arrival times, etc., and try to reveal some important metrics to take into consideration for the configuration of flash cache systems. In fact, a small well dimensioned cache is much more optimal in terms of performance and cost. The paper is organized as follows: Section 2 gives some background on flash memories and related work. Section 3 details the structure and algorithms of the studied cache. Section 4 discusses the I/O workload issues according to the cache. Finally, section 5 summarizes the contribution gives future perspectives.

2 Background and Related Work 2.1 Flash Storage System Background Flash memories are nonvolatile EEPROMs. They are of mainly two types: NOR and NAND, they are named after the logic gates used as the basic structure for their fabrication. NOR flash memories support bytes random access and have a lower density and a higher cost. NAND flash memories are, by contrast, block addressed, but offer a higher bit density and a lower cost and provide good performance for large

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache

99

read/write operations. Those properties make them more suitable for storing data [16]. Our study only concerns NAND flash memories. Flash memory is composed of one or more chips; each chip is divided into multiple planes. A plane is composed of a fixed number of blocks; each of them encloses a fixed number of pages that is multiple of 32. Current versions of flash memories have between 128 and 1024 KB blocks (with pages of 2, 4 or 8KB). A page consists of user data space and a small metadata area [17][2]. Three key operations are possible on flash memories: read, write and erase. Read and write operations are performed on pages, while erase operations are performed on blocks. NAND flash does not support in-place data modification. So, in order to modify a page, the entire block containing the page must be erased and the page rewritten in the same location or completely copied to another page. The fundamental constraint of flash memories is the limited number of write/erase cycles which varies between 104 to 105-106 [18]. After the maximum number of erase cycles is reached, a given memory cell can no more retain data. Due to data locality, some blocks would be much more used than the others. They consequently tend to wear out more quickly. This very central problem pushed researchers to seriously consider wear leveling techniques to even out the erasures over the whole area. If the system has to modify a previously written page, it looks for a clean page to write the data while invalidating the previous version. If there are no clean pages, a garbage collector is run to recycle invalidated pages. Both wear leveling and garbage collection algorithms are implemented through the Flash Translation Layer (FTL). 2.2 Cache for Flash Based Storage Systems Even though designed FTL techniques are more and more efficient (and complex), performance of write operations are still very poor. Buffering systems have been designed to cope with this issue by reorganizing non-sequential request streams before sending them to the FTL. Those buffers are generally placed on top of FTLs. The idea behind Flash Aware Buffer (FAB) [3] is to keep a given number of pages in the buffer and flush pages that belong to the fullest block. The Cold and Largest Cluster policy (CLC) [17] system implements two clusters of pages in the cache, a size-independent and a size-dependant one based on LRU. Block Padding Least Recently Used (BPLRU) is a write buffer [4] that uses three main techniques: block level LRU, page padding and LRU compensation. Block-Page Adaptive Cache (BPAC) [7] partitions the cache into two parts: an LRU page list used to store data with high temporal locality, and a list of blocks divided into two data clusters organized by recency and size. LB-Clock [9] adds a block space utilization parameter to decide on the block to flush while PUD-LRU [8] mixes frequency and recency to know which block to evict. Our cache system (C-lash) is different from those state-ofthe-art buffers as it does not lie upon a given FTL. 2.3 I/O Workload Studies To achieve a successful storage system deployment, it is necessary to include accurate I/O workload characteristics in the design process [11]. In fact, if well integrated, I/O

100

J. Boukhobza, I. Khetib, and P. Olivier

workload knowledge can provide designers with clues on implementation tradeoffs for file systems, buffers and caches, storage drivers and all the components involved in the storage subsystem [12]. There have been research studies to optimize database layout for flash storage [19], to design heterogeneous SSD storage systems to fulfill different I/O workload needs [20], or to integrate different NVRAM (Non-Volatile RAM) in the storage system through the development of a solver that determines the least cost storage configuration [21]. The work presented in this paper is different from and, complementary to previous ones. We propose a qualitative and quantitative study of the cache for flash systems with regards to different I/O characteristics of two well known OLTP I/O workloads.

3 The C-lash Cache for Flash

C-lash

For the sake of this study, we chose to use the cache system described in our previous work [10], we find it to be representative of actual buffering trends for flash memories. We roughly summarize in this section the cache structure and algorithms, for more details, the reader can refer to [10][23]. In C-lash (see Fig. 1), the cache space is partitioned into two areas, a page (pspace) and a block space (b-space). P-space consists of a set of pages that can come from different blocks while b-space is composed of blocks that are directly mapped. C-lash is also hierarchical, it has two levels of eviction policies: one which evicts pages from p-space to b-space (G in Fig. 1) and another level in which blocks from bspace are evicted into the flash media (I in Fig. 1). With this scheme, we insure that the system always flushes blocks rather than pages to the flash memory. P-space and b-space regions always contain respectively either valid or free pages or blocks. When a read request arrives, data are searched in both spaces. If we get a cache hit, data are read from the cache (B or E in Fig. 1), otherwise, they are read from the flash memory (A of Fig. 1). A read miss does not generate a copy into the cache. When a write operation is issued, in case of a cache hit, data are overwritten (C or D in Fig. 1) with no impact on the flash media. In case of a cache miss, data can only be written in the p-space (C in Fig. 1). If enough pages are available, we directly write the data. If not, we choose some pages to evict from the p-space to b-space (G in Fig. 1) and copy the new data. Two eviction policies are implemented for each space (G and I in Fig. 1): − P-space eviction occurs when C-lash needs to free some pages for new data. Pspace eviction is only performed in B-space and not in the flash media. The pages to evict are those of the biggest number of pages belonging Applicative layer D C to the same block in the B flash. With this policy we E B-space P-space G insure to flush blocks with a A (blocks) (pages) large number of valid pages. F I − B-space eviction policy is LRU based. It evicts a block Flash Media J into the flash media. In case the flash still contains some Fig. 1. Structure of the C-lash cache for flash system

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache

101

valid pages of the block to evict from the B-space, a merge operation (J in Fig. 1) is performed. This merge operation consists in reading the still valid pages in the flash memory before flushing the whole block.

4 Workload Characterization for Cache Dimensioning For this study, we have chosen some widely used state-of-the-art I/O traces available thanks to the Storage Performance Council (SPC) [14] and available on the UMass Storage Repository [13]. The workloads describe traces of OLTP applications extracted from two large financial institutions. Those workloads are widely used for storage performance evaluation [2][7][8][9]. Global workload characteristics can be found in Table 1. Financial 1 includes an array of 24 disks drives while Financial 2 contains 19 ones. As we can observe, Financial 2 is less write intensive and presents a smaller sequentiality rate. The information given in this table gives an approximate idea of the workload shape but is far from being satisfactory. The disparity between devices from the same workload is huge (see the min/max values), and so, one have to dig a bit deeper in details to understand the behavior. As stated in [12], means and variances are insufficient to characterize a workload, while histograms can provide good representations. Throughout the following sections, we try to find out pertinent data representations according to the different chosen metrics. Table 1. Global Financial 1 and Financial 2 I/O trace characteristics Workloads Write rate Sequential Mean req. Inter-arrival Trace Mean, (min, max) rate size (KB) times (ms) time (hr) Financial 1 76%, 22%, 3, 183.55, ~12 (24 disks) (4%, 100%) (0%, 99%) (0.5, 3148) (0, 34604979) Financial 2 17% 9% 2 197.6, ~12 (19 disks) (4%, 97%) (3%, 43%) (0.5, 256) (0, 3125627)

Request number 5323018 3697209

4.1 Storage Workload Metrics and Simulation Environment Throughout this paper, we try to analyze a subset of standard I/O metrics for the financial traces with regards to the cache structure. The main studied metrics are sequentiality, spatial and temporal locality, Inter-arrival times and request sizes. Since we study a write cache, we will not detail all that is related to reads even if C-lash also contributes in reducing their latencies [10]. We enriched the FlashSim module [5] integrated to Disksim [6] with the capacity to simulate C-lash structures and algorithms. We simulated a flash memory with 2KB pages and 128KB blocks. The three operations have the following delays: 130.9µs for a page read: 405.9µs for a page write, and 2ms for a block erase [15]. From the cache point of view, as we have seen earlier, C-lash divides its space into a page region and a block region. Throughout the experimentations, we will show that depending on the workload characteristics, it can be more relevant to increase one of the two spaces, and we will try to identify up to what threshold it is cost effective.

102

J. Boukhobza, I. Khetib, and P. Olivier

For many graphical representations, we still use the sector as the lowest data granularity. Even though in flash memory based storage systems the smallest unit is the page, we deliberately kept the sector representation because the traces were extracted from disk systems and migrating toward a page representation would alter the I/O workload in some cases. 4.2 Sequentiality of I/O Workloads Sequentiality is just a special case of spatial locality, requests R1 and R2 are sequential if they are strictly contiguous: the end address of R1(+1) corresponds to the start address of R2. The sequentiality rate is one of the most characterized metric for the I/O workloads; it gives indications on the pertinence of the use of a caching system. For flash memories, sequentiality is extremely important Fig. 2. page and request size sequentiality as it contributes in minimizing erase rate for 10 disks of Fianacial 1 workload operations by grouping write requests by blocks, and so, optimizing both lifetime and I/O performance by reducing the number of erasures. We think that measuring the sequentiality only from the request address number point of view, as it is achieved by many studies can be misleading. In fact, External (address) sequentiality is part of the global sequentiality that must also take into consideration the internal sequentiality that is the request size: the larger the request size, the greater the internal sequentiality. Since data are to be stored in a given media, sequentiality must take into account the nature of the media itself. For the sake of this study, we used a flash memory with a 2KB page size. We think that the sequentiality rate must be aligned to this parameter. As we can observe in Fig.2 representing both sequential rates in the same graphic for 10 devices of Financial 1, for the majority of devices, the page sequentiality is very high and the gap between the request and page size sequentiality is significant due to large request sizes. The average sequentiality for Financial 1 and 2 passes respectively from (22%, 9%) to (57%, 41%) which must be considered differently. Another point to consider is that concurrent streams complicate the extraction of sequentiality. By buffering some data requests in the cache, C-lash allows to highlight this sequentiality. Fig. 3 shows in the left hand side the start address of accessed data in time, and in the right hand side, the mean response time corresponding to the workload with different cache configurations. In left hand side graphic, the continuous lines represent sequential accesses: perfect horizontal lines represent requests with high temporal locality. The slopes of the continuous lines (if the access is perfectly sequential) are relative to the request size. The steeper the slope, the higher the request size. We can observe that there are approximately six sequential streams. The right hand side graphic shows response time according to the total cache size. We vary both b-space size and p-space size by a 128KB steps (1 bloc for the bspace and 64 pages for the p-space). We clearly observe that we better take benefit

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache

103

Fig. 3. A case of sequentiality on disk 2

from the cache size by increasing the block number to 6 blocks and 128 pages (6*128+128*2=1024KB). Increasing the p-space by the same sizes is not relevant in this case because of the sequentiality rate. 4.3 I/O Request Size Fig.4 shows the request size distribution for Financial 1 and 2 workloads for both read and write operations and for all devices. The larger the request size the greater the sequentiality. We clearly see that the global granularity of the request sizes is the sector. Most request sizes for Financial 1 range from 1 to 20 sectors (0.5 and 10 KB) while for Financial 2, they range from between 1 and 6 sectors (0.5 and 3KB). We can, however, observe that write request sizes are bigger than read request sizes. Once again, there is a big disparity of request sizes according to the chosen device although all of them contain a bunch of small request size I/Os. We think that request sizes alone do not give a complete idea of workload characteristics. However, coupled with sequentiality (see preceding section) or with temporal and spatial locality (see next sections), it can give good indications. 4.4 Inter-Arrival Times Inter-arrival times are the time difference between two successive I/O requests on a given drive. This measure helps to understand the load applied on a storage system. Fig. 5 shows the distribution of the Inter-arrival times for both Financial 1 and 2, the x-axis follows a logarithmic scale. For those figures, we summed the Inter-arrival times of all the devices of the workloads. We also showed the occurrence frequencies of Inter-arrival times on the right axis (green curves) While showing a steeper curve, Financial 1 workload seems to be less stressed; 70% of the Inter-arrival times are less than 10ms while for Financial 2, 80% of Interarrival times are less than 10ms. For both workloads, more than 95% of the requests arrive within one second. However, we observe that for both workloads, there exists some I/O bursts. Indeed, we observe that for both workloads 22% of the Inter-arrival times are less than or equal to 1ms.

104

J. Boukhobza, I. Khetib, and P. Olivier

Fig. 4. Request sizes distribution for Financial 1 (left) and Financial 2 (right)

Fig. 5. Inter-arrival times distribution for Financial 1 (left) and Financial 2 (right)

There are many ways to reduce the effect of I/O bursts: 1) whether we choose flash memories guaranteeing a good enough response time to cope with the bursts so that the queuing system at the storage device will stay as empty as possible. 2) Implementing better algorithms at the queue level to make requests waiting to be served as sequential as possible. 3) Finally acting at the cache level to absorb as many I/O requests of the burst as possible into the cache throughout a good management of the spatial and temporal locality. We focus, in this paper, on the last solution that is discussed in the following sections. 4.5 Temporal Locality Consideration To represent the temporal locality, we have modified the method used in [22]. This method consists in defining a write reuse as an I/O write request (W2) to the same block address of a previous write request (W1). [22] defines the reuse distance as being the time from issuing W1 to issuing W2. Instead of considering the time metric to define the reuse duration, we found it more significant to consider the addressed I/O workload space (sum of all request sizes issued) during that time period allowing

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache

105

to observe temporal locality. This is done with the idea to make it possible to dimension the cache to take benefit of the temporal locality. Fig.6 shows the cumulative distributions of write reuses (as compared to the total number of writes in the workload) for different address space windows. For instance, we can see that for the Financial 1 workload (averaged for 24 devices), for a window of 128KB (256 sectors), approximately 20% of the total write operations are reused while for Financial 2 (average Fig. 6. Temporal locality: average for Financial 1 and 2 and min. and max. for each. for 19 devices) less than 7% are reused. Those average results hide different behaviors depending on devices, some will show no temporal locality (no reuse, e.g. device 0 from Financial 1 and 2), while others can show very high values (e.g. device 17 of Financial 1 and device 15 of Financial 2). In the latter case, it would be interesting to know how to utilize the cache to satisfy this locality. In fact, temporal locality can be high for two main reasons: 1) the same block is reused many times; 2) many blocks are reused some times. For example, reusing one data request a hundred times gives the same temporal locality as reusing ten data requests, ten times each. In the left graph of Fig. 7, the sectors addresses of requests for device 17 of Financial 1 are shown, we can observe that the requested data fall in a limited pool of addresses (horizontal lines mean temporal locality). While in the right graph of Fig. 7, corresponding to device 15 of Financial 2, we can see that there is a given number of horizontal lines meaning that many data blocks are reused (this can be quantified from the trace). For the example above, it is only relevant to increase the cache size for the device 15 of Financial 2, since for the first one only the same references are reused many times (increasing the cache size is useless).

Fig. 7. Addressed sectors in time for both disk 17 of Financial 1 and disk 15 of Financial 2

106

J. Boukhobza, I. Khetib, and P. Olivier

From the preceding example, one can infer that a good cache dimensioning, using temporal locality alone is not sufficient. Consequently, to make temporal locality a good metric for cache dimensioning, one must consider two other parameters: 1) the total size of reused data (without considering duplicates) in order to know how much space should we add to the cache, and 2) the sequentiality rate of the reused data: this metric indicates if the cache to add should be in the p-space, in case of low sequential rate, or in the b-space, in case of high sequential rate. 4.6 Spatial Locality Consideration As for temporal locality, we modified the technique used in [22]. The aim of spatial locality is to detect I/O requests accessing neighboring data. We define the neighboring metric by identifying a spatial distance by conceptually partitioning the storage space into chunks of the given distance. We then browse the whole workload and count the number of writes to each chunk (if the start address of the request falls into the chunk). We only consider chunks that have been written at least once. We modified the metrics used in [22] to take into account the cache and the flash memory structure by considering chunks with sizes being multiple of the flash block size. This allows to visualize how would we take benefit of varying the cache size. Upper curves of Fig. 8 show the cumulative probability distribution of the number of write operations to a unit/chunk of a given size in terms of flash blocks. For example, taking the Financial 1 graph, if we consider chunks on 10 blocks (1280KB),

Fig. 8. CDF of the spatial locality metric for different unit sizes. The upper 2 curves represent spatial localities relative to request sizes while the lower ones are relative to flash page size.

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache

107

80% of the chunks are accessed between 1 and 2000 times, and 15% (between 80% and 95%) of the chunks are accessed between 2000 and 4000 times. Note that a steep curve means a very poor spatial locality, which is the case for Financial 2 curves. Once again, we observed large disparities between different devices of Financial 1 and Financial 2 workloads, some showed a very smooth curve (significant spatial locality) while others were very steep. Spatial locality alone gives a weak idea of the per-device observed locality. It gives the references count in a given unit of data with no information on the sizes of those references. Spatial locality must be coupled with request sizes and the devices characteristics to extract relevant information for cache dimensioning. In the lower curves of Fig. 8, we show the same curves as the upper ones, but we have taken into account the request sizes and we counted the accesses to the flash pages for each reference. For instance, if we have a write operation of 32KB, we increment our counter by 32KB divided by the page size (2KB). This gives us a precise idea of the spatial locality seen by the device. Taking the example seen earlier, for Financial 1 graph, if we consider chunks of 10 blocks, 50% of the chunks have more (or less) than 2000 references to their pages and 20% have more than 4000 references. We can also observe that the curve of Financial 2 is still steep.

Fig. 9. Spatial locality for disk 10 of Financial 1 workload with the sectors requested over time

We can observe in Fig. 9 that spatial locality increases according to the chunk size. We can notice that for 10 block chunks, 30% of the units (from 70% to 100% in the curve) contain more than 2500 references. By looking at the addressed sector curves, spatial locality is roughly confirmed through the visible thick lines. Fig. 10 shows the simulations with different cache configurations (different values for bspace and p-space). We can observe that when passing from 1 block in the b-space (3.9ms) to 10 blocks (1.5ms)

Fig. 10. Average response times according to cache size variation for disk 10 of Financial 1

108

J. Boukhobza, I. Khetib, and P. Olivier

we have reduced the mean response time by 68%. We can also notice that for this special case increasing either b-space or p-space leads to a close performance improvement. Choosing which space to increase depends highly on the nature of the spatial locality, if the references to each unit are sparse, it would be more relevant to increase the b-space, but if they are localized (streams), we can increase the p-space. From Fig. 10, we can say that references are moderately localized. In the preceding example, we varied the cache size, but we can also choose to use a flash memory with larger blocks. Larger block sizes coupled with a high rate of spatial locality decreases the number of erase operations.

5 Summary and Future Work This paper highlights the importance of an adequate device specific I/O workload characterization to better dimension storage caches for flash devices. Indeed, those RAM caches are very expensive and are more and more used in SSDs. To the best of our knowledge, this is the first study on the dimensioning of such cache systems. We characterized and analyzed two state-of-the-art OLTP workloads coming from financial institutions. We studied the workloads and extracted relevant metrics and evaluated their impact on performance of flash devices with caches. We focused our work on sequentiality, temporal and spatial locality as we defined new ways to represent those metrics for a better and easier dimensioning of caches for flash based storage systems. Beyond the specific case study, the applied methodology and some of the results can be generalized to other caches and I/O workloads. We expect to pursue the study on more I/O workloads and different caches and more generally flash based devices. As we have seen throughout this paper, the disparity of workloads at the device level may take benefit of different per-device configurations. That is why we think it is relevant to develop an adaptive version of C-lash that changes configuration according to the I/O workload for a given size.

References 1. Calburn, T.: Google plans to use Intel SSD storage in servers, http://www.informationweek.com/news/storage/systems/ showArticle.jhtml?articleID=207602745 (accessed December 2010) 2. Gupta, A., Kim, Y., Urgaonkar, B.: DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings. In: ASPLOS (2009) 3. Jo, H., Kang, J., Park, S., Kim, J., Lee, J.: FAB: Flash-Aware Buffer Management Policy for Portable Media Players. IEEE Trans. on Consumer Electronics 52, 485–493 (2006) 4. Kim, H., Ahn, S.: BPLRU: A Buffer Management Scheme for Improving Random Writes in Flash Storage. In: USENIX FAST, pp. 239–252 (2008) 5. Kim, Y., Tauras, B., Gupta, A., Nistor, D.M., Urgaonkar, B.: FlashSim: A Simulator for NAND Flash-based Solid-State Drives, Tech. Report CSE-09-008, Pensylvania (2009) 6. Ganger, G.R., Worthington, B.L., Patt, Y.N.: The Disksim Simulation Environment Version 3.0 Reference Manual, Tech. Report CMU-CS-03-102, Pittsburgh (2003) 7. Wu, G., Eckart, B., He, X.: BPAC: An Adaptive Write Buffer Management Scheme for Flash-Based Solid State Drives. In: IEEE 26th MSST (2010)

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache

109

8. Hu, J., Jiang, H., Tian, L., Xu, L.: PUD-LRU: An Erase-Efficient Write Buffer Management Algorithm for Flash Memory SSD. In: MASCOTS (2010) 9. Debnath, B.K., Subramanya, S., Du, D.H., Lilja, D.J.: Large Block CLOCK (LBCLOCK): A write caching algorithm for solid state disks. In: MASCOTS (2009) 10. Boukhobza, J., Olivier, P.: C-lash: a Cache System for Optimizing NAND Flash Memory Performance and Lifetime. In: Cherifi, H., Zain, J.M., El-Qawasmeh, E. (eds.) DICTAP 2011 Part II. CCIS, vol. 167, pp. 599–613. Springer, Heidelberg (2011) 11. Riska, A., Riedel, E.: Evaluation of disk-level workloads at different time-scales. In: IISWC (2009) 12. Kavalanekar, S., Worthington, B.L., Zhang, Q., Sharda, V.: Characterization of storage workload traces from production Windows Servers. In: IISWC (2008) 13. OLTP Traces, UMass Trace Rep., http://traces.cs.umass.edu/index.php/Storage/Storage 14. Storage Performance Council, http://www.storageperformance.org 15. Micron: Small Block vs. Large Block NAND Flash Devices, Micron Technical Report TN-29-07 (2007), http://download.micron.com/pdf/technotes/nand/tn2907.pdf 16. Forni, G., Ong, C., Rice, C., McKee, K., Bauer, R.J.: Flash Memory Applications. In: Brewer, J.E., Gill, M. (eds.) Nonvolatile Memory Technologies with Emphasis on Flash, USA. IEEE Press Series on Microlelectronic Systems (2007) 17. Kang, S., Park, S., Jung, H., Shim, H., Cha, J.: Performance Trade-Offs in Using NVRAM Write Buffer for Flash Memory-Based Storage Devices. IEEE Transactions on Computers 58(6), 744–758 (2009) 18. Caulfield, A.M., Grupp, L.M., Swanson, S.: Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications. In: ACM ASPLOS (2009) 19. Lee, S., Moon, B., Park, C., Kim, J., Kim, S.: A case for flash memory ssd in enterprise database applications. In: SIGMOD (2008) 20. Kim, S., Jung, D., Kim, J., Maeng, S.: HeteroDrive: Reshaping the Storage Access Pattern of OLTP Workload Using SSD. In: IWSSPS (2009) 21. Narayanan, D., Thereska, E., Donnelly, A., Elnikety, S., Rowstron, A.I.T.: Migrating server storage to SSDs: analysis of tradeoffs. In: EuroSys (2009) 22. Chen, S., Ailamaki, A., Athanassoulis, M., Gibbons, P.B., Johnson, R., Pandis, I., Stoica, R.: TPC-E vs. TPC-C: characterizing the new TPC-E benchmark via an I/O comparison study. SIGMOD Record (2010) 23. Boukhobza, J., Olivier, P., Rubini, S.: A Cache Management Strategy to Replace Wear Leveling Techniques for Embedded Flash Memory. In: SPECTS (2011)

Toward a Version Control System for Aspect Oriented Software Hanene Cherait and Nora Bounour Dept. of Computer Science, Badji Mokhtar University, BP.12, 23000, Annaba, Computer Science Research Laboratory (LRI) Algeria {hanene_cherait,nora_bounour}@yahoo.fr

Abstract. During the lifetime of a software system, series of changes are made to the software. So many versions will be produced. Version control systems contain significant amounts of data that could be exploited in the study of software evolution. Analyzing the source code of these versions can help to identify necessary changes, understand the impact of changes, and provides a facility to track the changes and to deduce logical relations between changed entities. We are interested in this paper to the evolution analysis of Aspect Oriented Systems. This last will become the legacy systems of the future and will be subject to the same evolutionary demands as today’s software systems. In this paper, we propose a Version Control System for Aspect Oriented Programs, using graph transformation formalism to manage and control their evolution. Keywords: Version Control Systems, Aspect Oriented Programming, Software evolution, Graph rewriting.

1 Introduction Software evolution analysis refers generally to progressive change in the software’s properties or characteristics. This process of change in one or more of their attributes leads to the emergence of new properties or to improvement, in some sense [26]. Divers studies have shown that more time is spent on changing than developing the software [3]. Version control systems contain large amounts of historical information that can give deep insight into the evolution of a software project. The majority of research has focused on examining the artifacts stored in a software repository, and their associated metadata. Analyzing the source code of the software repository can help identify necessary changes, understand the impact of changes, provides a facility to track the changes and to deduce logical relations between changed entities. Aspect-oriented programming (AOP) languages provide a new kind of modules, called aspects that allow one to modularize the implementation of crosscutting concerns which would otherwise be spread across various modules. In spite of the more advanced modularization mechanisms, aspect-oriented programs still suffer from evolution problems i.e. more relationships are introduced by this paradigm L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 110–121, 2011. © Springer-Verlag Berlin Heidelberg 2011

Toward a Version Control System for Aspect Oriented Software

111

because Aspects are not explicitly invoked but, they are implicitly invoked [12]. Changes introduced with AOP are not visible directly in the base system’s source code, making program comprehension more difficult. Aspects are usually stored in separate files; but the effects of this code can influence the whole system [4]. In this paper, we are going to analyze the needs of aspect oriented systems in order to propose a version control system to manage and control rigorously their evolution. Until now there is not a specific version control system for Aspect oriented paradigm. Such Version Control System is useful for system developers to manage the aspect oriented source code evolution (in our case the AspectJ source code), and the change repository can be further used to enhance other evolution analysis. It can be used to execute no-trivial evaluations of the aspect oriented systems. We can answer interesting questions, as "which of the aspects contain a given pointcut?” What is the aspect that changes frequently (hotspot)? What are the coupled concerns in the system? i.e. if one changes a concern (aspect) he must change the other, and so on. Consequently, this can help to understand the aspect oriented software evolution, predict future changes, identifying potential faults, detecting new concerns and develop new refactoring algorithms. The paper is organized as follows. In the next section, we describe the characteristics of actual version control systems. Then, in section 3 we explain the basic foundations of the aspect oriented programs. Section 4 gives all details on our version control system for aspect oriented source code. Finally, we conclude our discussion and present the future work in section 5.

2 Actual Version Control Systems 2.1 Version Control Systems Software Configuration Management (SCM) is the control of the evolution of complex systems. More pragmatically, it is the discipline that enables us to keep evolving software products under control, and thus contributes to satisfying quality and delay constraints [14]. SCM is a critical element of software engineering; it is needed because of the increased complexity of software systems, increased demand for software and its changing nature [17]. Version control systems (or software configuration management systems) such as CVS [20] or Subversion [16] help coordinate team development for large complex projects. They permit developers to work simultaneously on the same software system while ensuring that their modifications do not interfere with work done by other team members. Source control systems provide a history of changes to the code of the software system [11]. Yet there are at least 50 SCM systems that can be acquired for use today, most are commercial products. The source code of the system is stored in a source repository; for each file in the software, the source repository records details such as the creation date of the file, modifications to the file over time along with the size and a description of the lines affected by the modification. Furthermore, the repository associates for each modification the exact date of its occurrence, a comment typed by the developer to indicate the reason for the change, and in some cases a list of other files that were part

112

H. Cherait and N. Bounour

of the change described by the developer’s comment. Such detailed records permit the roll back of the code to any point in time to either retrieve an old version of the code or to abandon new changes that were found to be irrelevant or buggy [11]. Additionally, a version control system will provide facilities for merging the changes, using one or more methods ranging from file locking to automatic integration of conflicted changes [16]. Researchers have described the many benefits of using the development history to gain a better understanding of bugs in source code, to locate hidden dependencies, or to assist in searching and browsing source code [11]. 2.2 Analysis of Software Repositories Software repositories contain a wealth of information about the software. The task is just to analyze them and uncover the information. Some works visualize the software evolution histories. There are three kinds of visualization techniques. The first is those which visualize metrics (code age, number of bug fixes …) on a flat representation of the software (ex. [1]). The second is those that also show structural information (ex. [2]). The third is techniques that extract recurring patterns from the software history using data-mining and visual data-mining techniques (ex. [5]). Semantically coupled components may not structurally depend on each other [6]. Such logical dependencies (called evolutionary coupling) can be uncovered by analyzing the evolution history of a system. For example in [7] the authors exploit historical data extracted from repositories such as CVS and focus on change couplings. Ying et al [8] proposed a technique to determine the impact of changes based on association rules. In [6], the authors formalize logical coupling as a stochastic process using a Markov chain model. In [9] data mining is applied to version histories in order to guide programmers along related changes. Many techniques are proposed to extract meta-information from the software repositories. For example in [10] the authors propose a model as a graph in which the different entities stored in the repository become vertices and their relationships become edges. They then define SCQL, a first order, and temporal logic based query language for source control repositories. In [11] using sound mathematical concepts from information theory (Shannon’s Entropy), the authors study the complexity of the development process by examining the logs of the source control repository for large software projects. Seldom effort has been made for aspect-oriented software; in [19], the authors propose an approach for mining change patterns in AspectJ software evolution. They first analyze the successive versions of an AspectJ program, and then decompose their differences into a set of atomic changes. Finally, they employ the Apriori data mining algorithm to generate the most frequent item-sets. Those change patterns can be used as measurement aid and fault predication for AspectJ software evolution analysis. 2.3 Limits of Current Version Control Systems In a previous study [18] it is showed that most versioning systems in use today are indeed losing a lot of information about the system they version. So, they are not plainly satisfactory for evolution research. There are two shortcomings which have

Toward a Version Control System for Aspect Oriented Software

113

major consequences, and are the cause of most of the other ones: (1) Most systems are file-based, rather than entity-based, and (2) the fact that they are snapshot-based, not change-based, i.e., the program is frozen as a snapshot with a particular time stamp without recording the actual changes that happen in between two subsequent snapshots. Romain Robbes and Michele Lanza [18], claim that better versioning systems are needed still hold, and they think that an entity-based versioning system for a widespread language would be a great step forward for both software developers and researchers, as these systems handles a lot of parsing task for themselves. This version control systems opens new ways for both developers and researchers to explore and evolve complex systems. We are based on this idea to propose our version control system for aspect oriented software.

3 Foundations of the Aspect Oriented Systems The aspect oriented programming (AOP) is a new methodology that permits to separate the subsidiary concerns that crosscut the main functionalities of a system [22]. We think for example about functionalities as the synchronization, the authorization or the logging that generally intervene before or after the call of the main functions. Up to here, the code associated to these subsidiary concerns were dispersed generally almost everywhere in the main program. Thus, these scattered functions were inevitably more difficult to maintain. In the aspect paradigm, the code of the crosscutting concerns can be regrouped in special modules named "aspects", instead of being dispersed in the system classes. The AOP permits to solve the problems due to the tangled and scattered code. It also permits to modularize the implementation of the transversal problematic, to create more evolutional systems and to assure a better reuse of the code. Besides, the studies show that the overhead introduced by the aspect oriented approaches is relatively weak. Finally, the aspect oriented implementations have some more elevated levels of adaptability and reuse than the object implementations. In this paper, we use AspectJ as our target language to show the basic idea of our version control system for aspect-oriented software. Before presenting our approach, we first briefly introduce the background of AspectJ semantics. More information about AspectJ can be found in [23]. Below, we use a sample program to brief introduce the AspectJ. The program showed in Listing 1 changes the monitor to refresh the display as needed. It contains an aspect UpdateDisplay that update the Display when objects move, and a class Point. The decomposition of an AspectJ application makes appear: ⎯ The base code that defines the set of the services (i.e. functionalities) achieved by the application. In other words, the code corresponds to the "What" of the application. ⎯ Several complementary aspects that specify the mechanisms governing the execution of the application i.e. the non-functional aspects defining the "How" (for example the synchronization, the persistence or the security).

114

H. Cherait and N. Bounour Listing 1. The program UpdateDisplay class Point { private int _x = 0, _y = 0; int getX() { return _x; } int getY() { return _y; } void setX(int x) { _x = x; } void setY(int y) { _y = y; } }

aspect UPDATEDISPLAY{ pointcut move(): call(void Point.setX(int))|| call(void Point.setY(int))); BEFORE() : MOVE() { System.out.println("figure is going to be displaced"); } AFTER(): MOVE () { DISPLAY.UPDATE(); } }

The aspects are used to regroup choices of implementation that have an impact on the whole system and that would be scattered otherwise through the whole code. Every aspect is destined to be developed in an independent way then integrated to an application by a process called “aspect weaving” i.e. the aspects are modules defined separately, it is necessary to define their integration rules to compose them in order to “construct" the application. Therefore, new concepts are introduced with the AOP in order to allow the developers to specify and to implement the transversal concerns. Table_1 presents these different concepts. Table 1. Concepts of an AspectJ program Concept Join point Pointcut Advice

Introduction Aspect

weaver

Role Precise place in the program execution (Table 2). The advices are inserted to the level of the Join Points. Constitute the means to specify a set of particular Join Points. A pointcut is often a regular expression. A code fragment to insert to the level of Join Points. Implement a transversal concern. There are three kinds of advice: before, after (or after returning/throwing) and around. Used by an aspect to add new fields, constructors, or methods (even with bodies) into given interfaces or classes. It is a unit of regrouping of one or several definitions of: - Pointcuts - Advices. - Associations of Pointcuts to advices - Introductions It is a special tool permitting to apply the aspects to the base code.

According to this table, the key concepts that assure the integration between the base code and the aspects of the system are "The join points". They are particular points in the dynamic graph of the calls. Table_2 describes this join points.

Toward a Version Control System for Aspect Oriented Software

115

Table 2. Kinds of Join Points Join point Method call Method execution Constructor call Constructor execution Static initializer execution Object pre-initialization Object initialization Field reference Field set Handler execution Advice execution

Description When a method is called When the method‘s body is executed When a constructor is called When a constructor' s body is executed When the static initialization of a class is executed Before the initialization of the object When the initialization of an object is executed When a non-constant attribute of a class is referenced When an attribute of a class is modified When a treatment of an exception is executed When the code of an advice is executed

If we compare these concepts with the program in Listing 1, we distinguish: one aspect named “UpdateDisplay”. This last contains a pointcut declared with the name “move”. Which specifies two Join Points of the type Method call; “when the method setX is called”, and “when the method setY is called” (class Point’s methods). The two are joined with the operator “or”. And two advices (before, after).

4 Our Version Control System for Aspect Oriented Systems The aim of our version control system for aspect oriented software is to accurately model how this software evolves by treating change as a first-class entity. We model software evolution as a sequence of changes that take a system from one state to the next by means of semantic transformations. In short, we do not view the history of a software system as a sequence of versions, but as the sum of changes which brought the system to its actual state. 4.1 Program Representation We use graph transformation (graph rewriting) as a formal technique to give a formal semantics to our versioning system. Graphs are based on a well-understood mathematical foundation (graph theory); this makes them very interesting from a formal point of view. From a practical point of view, graphs are also very useful, since they are used often as an underlying representation of arbitrarily complex software artifacts and their interrelationships [13]. Our program representation is based on a code model composed of the following parts: The base code model and aspect Models. The model associated to the aspect oriented source code consists in a colored graph [25]. This graph is generated from the AspectJ source code. Formally a colored graph, for example G, is represented by a 6-tuple as ; ; ; ; ; . Here, denotes the set of nodes, denotes the set of edges; is a mapping that maps the edges to their sources and maps them to their targets. and are mappings

116

H. Cherait and N. Bounour

that map the nodes and the edges in the graph to the fixed alphabets of node and edge colors respectively. 1. The base code model: the base code of the AspectJ system is an object oriented program. It is modeled with a colored graph. The nodes in the graph present the entities of the system (Class, Attribute, Method, Parameter, Return value), and the edges describe the relations between these entities: (1) the attributes and the methods that belong to a class, annotated with Has (private, public, protected): attribute, method. Takes parameter; or returns for the return value (2) the connection between the classes, annotated with Association, Aggregation, Generalization, Composition; (3) the calls between methods, annotated with call. 2. Aspect models: every aspect of the system is also modeled with a colored graph. This last is similar to the graph that models an object oriented system, but we must add other concepts proper to the AspectJ source code. A colored graph has a pair of color alphabets, one to color the edges and one to color the nodes. In our representation the elements of the color alphabet are: ⎯ ⎯ ⎯ ⎯ ⎯ ⎯ ⎯

Aspect Attribute: « Type » Method Parameter: « Type» Return value: « Type » Pointcut Advice: « Kind». Where “Kind” is one of: ⎯ before ⎯ after returning ⎯ after throwing ⎯ after ⎯ around

The edges in our model are used to identify the relations between the different elements of the system (graph). We have three classes of relations: ⎯ The first depicts the elements belong to an Aspect, these edges are annotated with one of the following colors: Has (private, public, protected): attribute; method; or Takes parameter; or the color returns for the return value. Besides, Has pointcut for the pointcuts. ⎯ The second class depicts the relations between the advices and the pointcuts using the colors: before, after (or after returning, or after throwing) and around. We must note here that, if the advice is from the Kind “around”; an edge is created between this advice and its return value colored with Returns. ⎯ The third class depicts the relations between the objects of the Aspect, i.e. the calls between the methods of the Aspect, or between the advices and these methods. These edges are annotated with call. 3. Modeling the global system: to integrate the different sub-graphs that represent the classes and the aspects, we need two types of edges (we can also call them the edges of dependence).

Toward a Version Control System for Aspect Oriented Software

117

Fig. 1. The colored graph of the program UpdateDisplay

Call edges: ⎯ Edges of call from the methods of the aspects to the methods of the classes. ⎯ Edges of call from the advices of the aspects to the methods of the classes. Crosscut edges: ⎯ Edges that join every pointcut with its join points in the base code: annotated with crosscut. 1 The figure_1 presents the graph of the AspectJ program “UpdateDisplay” of the Listing 1. In this figure we can see that there are two sub-graphs: in the left hand side of the figure, we depict the colored graph of the class Point. And in the right hand side the colored graph of the Aspect UpdateDisplay. The two sub-graphs are related with two edges of dependence of the type crosscut. 4.2 Change Representation While AspectJ source code is formalized by colored graph, program changes are mapped to graph rewriting rules. Like graphs, graph rewriting is very intuitive in use. Nevertheless, it has a firm theoretical basis. These theoretical foundations of graph rewriting can assist in proving correctness and convergence properties of the Aspect oriented software evolution. We represent changes to the program as explicit rewriting rules to its colored graph. 1

This graph has been made by the AGG tool: Attributed Graph Grammar. It is based on the notion of graph transformation rules.

118

H. Cherait and N. Bounour

Pattern Graph

Rewrite Graph Preservation Morphism Rule

Match Rule Application

Host Graph

Result Graph

Fig. 2. Basic Idea of Graph Rewriting [15]

When a change rewrite rule is applied takes as input a program state and returns an altered program state. Since each state is a colored graph, rewriting rules are graph operations, such as addition or removal of nodes, and modifications of the properties of a node. The calculus of graph transformation has a solid background; the interested reader is referred to seminal work in this area [25]. In this paper we present only as much theoretical background as needed to understand our approach. From the figure 2, a graph rewrite rule consists of a tuple , whereas L the left hand side of the rule is called pattern graph and R the right hand side of the rule is the replacement graph. Moreover we need to identify graph elements (nodes or edges) of and for preserving them during rewrite. This is done by a preservation morphism mapping elements from to ; the morphism is injective, but needs to be neither surjective nor total. Together with a rule name we have [15]. The transformation is done by application of a rule to a host graph . To do so, we have to find an occurrence of the pattern graph in the host graph. Mathematically speaking, such a match is an isomorphism from to a sub graph of . This morphism may not be unique, i.e. there may be several matches. Afterwards we change the matched spot of the host graph, such that it becomes an isomorphic sub graph of the replacement graph . Elements of not mapped by are deleted from during rewrite. Elements of not in the image of are inserted into , all others (elements that are mapped by ) are retained. The outcome of these steps is the resulting graph ′. The rewrite rules can be specified with a textual ([15], [21]) or graphical [24] manner. A set of graph rules, together with a type graph, is called a graph transformation system (GTS). One of the main static analysis facilities for GTSs is the check for conflicts and dependencies between rules and transformations. We argue that the existing theoretical results for graph transformation can advantageously be used for analyzing potential conflicts and dependencies in aspect-oriented evolution. In the following, we present a simple graphical change rule for the graph in figure.1 using the AGG tool [24]. Every rewriting rule is constituted of two parts; the Left Hand Side (LHS) part presents the pre-condition of the rule, and the Right Hand Side (RHS) depicts its post-condition. In certain cases also negative application conditions (NACs) which are preconditions prohibiting certain graph parts, are needed.

Toward a Version Control System for Aspect Oriented Software

ǀŽůƵƚŝŽŶ

Graph model the aspect oriented source code

119

Rule-based repository

Stored into Graph model the aspect oriented source code

Rewrite rules Stored into

Rule1

Version 2

Rule 2

Graph after evolution

Reproduce the graph

Rule_3 Rule n

Ă

ď

Fig. 3. Rewrite rule-based versioning system

For example, we will change the programme UpdateDisplay to control the value of x “if x>10 x=x-1”. This is can be formalized as the addition of a new pointcut “control” to capture the method setX: pointcut control(): call (void Point.setX(int)); int around (int x): control(){ if (x>10) x=x-1; return x; } Fig. 4. The addition of the Pointcut Control

The rule in figure 4 adds the pointcut “control” detailed above, to the Aspect UpdateDisplay “if it is not already existed”. This condition is formulated using the negative application conditions (NAC) depicted in the left side of the picture (i.e. the existence of two advices with the same Kind for the same pointcut is prohibited). The RHS of the rule depicts that the addition of the pointcut involves the addition of its advice (advice around and its return value). The Join Points are specified with the crosscut edges. 4.3 The Change Repository As we showed in section 2.2, we need to keep (record) the evolution history of aspect oriented system to use it later in the analysis and the validation of this evolution. Changes are stored in a change repository. This last can be used to analyze the Aspect Oriented Software evolution. Our version control system follows this principle:

120

H. Cherait and N. Bounour

⎯ The operations of change that serve to sail between the different versions of the system are considered as the rewrite rules on the source code graph. ⎯ We propose a rule-based repository as a versioning system: instead of recording all the changed graph as a version, we only records the evolution sequences (the set of rewrite rules) on this graph (figure 3.a), i.e. we can reproduce every version of the system by the execution of the associated sequence of rewrite rules. For example, in the figure 3.b to extract the version 2 of the system, we only apply the rule 2 on the base system graph.

5 Conclusion We proposed in this paper a version control system for aspect oriented source code written in AspectJ, to gather the changes done on the system through the time. Our approach consists to model the source code as a colored graph; representing the different entities of the system and the relations between them. The evolution changes are formalized using rewrite rules on the system graph and stored in a rewriting rulebased repository. Our version control system avoids the problems of actual version control systems “file-based, snapshot-based” (section 2.3). Our repository is change-based (specifically rewriting rule-based); therefore we can reproduce every version of the system in any point in time, and we can follow the evolution of every entity of the system (entity-based). With our versioning system, we can manage in a reliable way the evolution of the concerns of an aspect oriented system, by analyzing the change repository to answer all our questions. The detection of logical coupling between the aspects of the system which is very important for restructuring, and refactoring and the visualization of the source code metrics to understand and to predict the future development are the most interesting evolution tasks in this area. Although this is not the scope of this paper, we believe that this approach is general enough to be applied to other aspect-oriented programming languages.

References 1. Christopher, M.B.: Taylor, Malcolm Munro.: Revision towers. In: Workshop on Visualizing Software for Understanding and Analysis VISSSOFT, pp. 43–50. IEEE Computer Society Press, Paris (2002) 2. Gall, H., Jazayeri, M., Riva, C.: Visualizing software release histories: The use of color and third dimension. In: International Conference on Software Maintenance (ICSM), pp. 99–108. IEEE Computer Society, Oxford (1999) 3. Lehman, M.M., Ramil, J.F.: Metrics-Based Program Evolution Management. Position Paper submitted to the Workshop on Empirical Studies of Software Maintenance (WESS) Bethesda, MD (1998) 4. Vollmann, D.: Visibility of join-points in aop and implementation languages. In: Second Workshop on Aspect-Oriented Software Developement, Bonn, Germany, pp. 65–69 (2002)

Toward a Version Control System for Aspect Oriented Software

121

5. Van Rysselberghe, F., Demeyer, S.: Studying software evolution information by visualizing the change history. In: International Conference on Software Maintenance, pp. 328–337. IEEE, Los Alamitos (2004) 6. Wong, S., Cai, Y., Dalton, M.: Change Impact Analysis with Stochastic Dependencies. Department of Computer Science, Drexel University, Technical Report DU-CS-10-07 (October 2010) 7. Ratzinger, J., Fischer, M., Gall, H.: Improving Evolvability through Refactoring. In: MSR, Saint Louis, Missouri, USA (2005) 8. Ying, A.T.T., Wright, J.L., Abrams, S.: Source code that talks: an exploration of Eclipse task comments and their implication to repository mining. In: International Workshop on Mining Software Repositories (MSR), Saint Louis, Missouri, USA (2005) 9. Zimmermann, T., Weißgerber, P., Diehl, S., Zeller, A.: Mining version histories to guide software changes. In: 26th International Conference on Software Engineering, pp. 563– 572. IEEE Computer Society Press, Los Alamitos (2004) 10. Hindle, A.J.: SCQL: A Formal Model and a Query Language for Source Control Repositories. A Master Thesis. University of Victoria (2005) 11. Hassan, A.E., Holt, R.C.: The Chaos of Software Development. In: International Workshop on Principles of Software Evolution (IWPSE), Helsinki, Finland (2003) 12. Xu, J., Rajan, H., Sullivan, K.: Understanding aspects via implicit invocation. In: 19th IEEE International Conference on Automated Software Engineering, pp. 332–335 (2004) 13. Mens, T.: A Formal Foundation For Object Oriented Software Evolution. PH.D. Dissertation. Vrije Universities Brussel (1999) 14. Estublier, J.: Software Configuration Management: A Roadmap. In: International Conference on The Future of Software Engineering, New York, USA (2000) ISBN: 1-58113-253-0 15. Blomer, J., Geib, R., Jakumeit, E.: The GrGen.NET User Manual, http://www.grgen.net 16. Nagel, W.: Subversion Version Control: Using The Subversion Version Control System in Development Projects. Prentice Hall Professional Technical Reference (2005) 17. Koskela, J.: Software configuration management in agile methods. VTT Technical Research Centre of Finland (2003) ISBN 951–38–6259–3 18. Robbes, R., Lanza, M.: Versioning systems for evolution research. In: 8th International Workshop on Principles of Software Evolution, pp. 155–164. IEEE, Los Alamitos (2005) 19. Qian, Y., Zhang, S., Qi, Z.: Mining Change Patterns in AspectJ Software Evolution. In: International Conference on Computer Science and Software Engineering, pp. 108–111 (2008) 20. Fogel, K.: Open Source Development with CVS. Coriolos Open Press, Scottsdale (1999) 21. Glauert, J.R.W., Kennaway, R., Papadopoulos, G.A., Sleep, R.: Dactl: an experimental graph rewriting language. Journal of Programming Languages, 85–108 (1997) 22. Lopes, C.V., Hursch, W.L.: Separation of Concerns. College of Computer Science. Northeastern University, Boston (1995) 23. The AspectJ Team: The AspectJ Programming Guide, Online manual http://eclipse.org/aspectj/ 24. Ermel, T.S.C.: AGG Environnement: A Short Manual, http://tfs.cs.tuberlin.de/agg/ShortManual.ps 25. Ehrig, H., Ehrig, K., Prange, U., Taentzer, G.: Fundamentals of Algebraic Graph Transformation. EATCS Monographs in TCS. Springer, Heidelberg (2005) ISBN 978-3-540-31187-4 26. Lehman, M.M., Ramil, J.F.: Software Evolution and Software Evolution Processes. Annals of Software Engineering 14, 275–309 (2002); Kluwer Academic Publishers. Manufactured in the Netherlands

AspeCis: An Aspect-Oriented Approach to Develop a Cooperative Information System Mohamed Amroune1,2,3 , Jean-Michel Inglebert3 , Nacereddine Zarour2, and Pierre-Jean Charrel3, 1

University of Tebessa, Algeria Lire Laboratory, University Mentouri of Constantine, Algeria 3 IRIT Laboratory, University of Toulouse, France [email protected], [email protected], [email protected], [email protected] 2

Abstract. It is diﬃcult for a single Information System (IS) to accomplish a complex task. One solution is to look for help of other ISs and make them cooperate, leading to so-called Cooperative Information System (CIS). So Information Systems cooperation is an active ongoing area of research in the ﬁeld of information systems, where reuse is an important issue. Besides, the aspect paradigm is a promising avenue for reuse. Thus, we argue that it is interesting to propose an aspect approach to build a new information system capable to accomplish complex tasks based on the reuse of systems artifacts previously developed. According to our best of knowledge few works have tackled this question. In this paper, we present an aspect-oriented approach called AspeCis, applied from the requirements until the development phases, in order to develop a CIS from existing ISs by using their artifacts such as requirements, architectures and design. . . . Keywords: cooperative requirements, cooperative information system, aspect oriented modeling, aspect oriented requirements engineering, reuse, composition.

1

Introduction

It is frequently not easy for a sole Information System (IS) to achieve a complex task. One solution is to make existing ISs collaborate to realize this task. So, in [4] systems engineering reuse is deﬁned as the utilization of previously developed systems engineering products or artifacts such as architectures and requirements. The reuse of systems aim to deal with the complexity in order to reduce costs and development time. Therefore, this approach is opposed to conventional development in which the construction of a new system starts from nothing and needs reinventing everything . These ideas are not new and were proposed in [16] [5] [11]. The object paradigm is certainly one of the most prominent examples.

The research leading to these results has received funding from the EGIDE PHC TASSILI project under grant agreements 22375TG.

L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 122–132, 2011. c Springer-Verlag Berlin Heidelberg 2011

AspeCis: An Aspect-Oriented Approach to Develop a CIS

123

In order to be eﬀective, reuse must be considered from the early phases: Requirements Engineering, Analysis and Design. The aspect paradigm is presented as a promising avenue for reuse, because it aims at providing means to identify, modularize, specify and compose crosscutting concerns. Our goal is to reuse systems artifacts in order to build a new system by an aspect approach. Our contribution is located in the area of reuse conjointly with the use of techniques introduced by the aspect paradigm. IS composition is a promising solution for this issue and therefore an important way of IS reuse. The result of IS composition produces a so called Cooperative Information System (CIS). So the aim of this work is to propose an approach called AspeCis (for Aspectoriented approach to develop Cooperative information systems). When a new requirement cannot be achieved directly by an existing ISs, AspeCis composes and extends requirements issued from other ISs in order to fulﬁll this requirement. The main objectives of AspeCis are: (i) to separate in the CIS, the existing requirements from the new ones; (ii) to provide a high degree of functional reuse, which could help to build the new requirements on other existing ISs; (iii) to propose an aspect approach, which allows weaving requirements on join points at the model level. The rest of the paper is organized as follows: section 2 presents an overview of Aspect-Oriented Requirements Engineering Approaches. Our approach is detailed in section 3. Section 4 draws some examples of reuse and composition operators. Finally, section 5 concludes and highlights some future works.

2

A Survey of Aspect-Oriented Requirements Engineering Approaches

The Aspect-Oriented Requirements Engineering approaches (AORE) consist to treat crosscutting concerns. The emergence of these approaches is prompted by providing systematic means for the identiﬁcation, modularization, representation and composition of crosscutting requirements; in order to ﬁll the gap left by the traditional requirements engineering approaches. Below we present some AORE approaches. Aspect-oriented requirements engineering with ARCaDE was developed in [14]. This approach proposes techniques to identify and specify concerns and candidate aspects. This approach composes candidate aspects with the viewpoints that they cut across and handles conﬂicts. The approach only proposes composition rules that deﬁne how to compose crosscutting concerns with viewpoints. However, the composition of the viewpoints is not speciﬁed. Aspect-oriented software development with use cases proposed in [7] extends the traditional use cases approach [6] with two main elements: pointcuts for use cases and grouping of development artifacts in use case slices and use case modules. Use-case slices employ aspects to compose together the diﬀerent parts of a model. A use-case module contains the speciﬁcs of a use case over all

124

M. Amroune et al.

models of the system. Currently, it does not support conﬂicts management. At the requirements level, this approach handles all concerns in a uniform way by means of use cases, use case slices and use case modules, so it is a symmetric approach [15]. In the context of concern oriented requirements engineering, Moreira et al [13] propose a multi-dimensional approach to separate concerns in requirements engineering as well as trade-oﬀ analysis of the requirements speciﬁcation from such a multidimensional perspective. The basic ideas are: (i) to address the socalled tyranny of dominant decomposition, promoting a uniform treatment of the various types of concerns; (ii) to take advantage of the observation that concerns in a system are, in fact, a subset, and concrete realizations, of abstract concerns in a meta concern space; and (iii) to provide a rigorous analysis of requirementslevel trade-oﬀs as well as important insights into architectural choices. Whittle et al. and Araujo et al. present an aspect-oriented scenario modeling approach to be used at the requirements level in [1], [17]. The motivation for this approach is to eliminate the need to repeatedly deal with the same crosscutting. The non-aspectual and aspectual scenarios are modeled separately from each other then merged as required, producing the complete scenarios [15]. Theme [2] is an analysis and design approach that supports the separation of concerns for the analysis and design phases of software lifecycle. The Theme approach expresses concerns in conceptual and design constructs called themes. The Theme approach provides a model and a tool to support the identiﬁcation of aspects in requirements speciﬁcations (Theme/Doc). At the design level, Theme/UML allows a developer to model features and aspects of a system, and speciﬁes how they should be combined.

3

An Aspect-Oriented Approach to Develop a Cooperative Information System (AspeCis)

In contrast with the reviewed approaches [1],[7],[13], [14],[15],[17] where the construction of a new system, starts from nothing and needs reinventing everything, AspeCis deﬁnes so called Cooperative Requirements (CRs) which are the result of a composition process between Existing Requirements (ERs) possibly complemented with Additional Requirements (ARs). 3.1

Definitions

Before presenting the details of AspeCis, lut us deﬁne the following concepts; Existing Requirements (ERs), Additional Requirements (ARs) and Cooperative Requirements (CRs). Several deﬁnitions of requirement exist in the literature [10], but we adopt the following ones to diﬀerenciate between requirements. Existing Requirement (ER). Are statements of services or constraints provided by an axisting system, which deﬁne how the system should react to particular inputs and how the system should behave in particular situations.

AspeCis: An Aspect-Oriented Approach to Develop a CIS

125

Additional Requirement (AR). Are requirements which are not supported by any existing IS. In this case, other external information systems will be developped to fullﬁll these additional requirements. Cooperative Requirement (CR). Are goal requirements that will be reﬁned to relate on ERs and ARs, exhibiting what parts of existing systems requirements will be reused and composed, and what parts should be newly developped.

Fig. 1. Cooperative Requirement (CR) metamodel

3.2

Overview

AspeCis includes three main phases presented in ﬁgure 2 : (i) discovery and analysis of CRs, (ii) development of CRs models, and (iii) preparation of the implementation phase. Phase I: Elicitation and analysis of CRs. This phase is composed of four steps which are: (1) the deﬁnition of CRs, (2) the reﬁnement of CRs, (3) the formulation of CRs depending on the ERs and possibly with the deﬁnition of some ARs, (4) the selection of a set of unary and composite operators to be applied to the ERs and the ARs to deﬁne the CRs as can be seen in the ﬁgure 1.

126

M. Amroune et al.

Phase I : Elicitation & Analysis of CRs Definition of CRs Refinement of CRs Formulation of CRs With definition of ARs

P P H A S E

Definition of Operators (Unary and composite)

Phase II : Models Weaving Weaving (Aspect-Operators & ERs & Ars) Models

P H A S E II

Phase III: Preparation of the Implementation P H A S E III

Fig. 2. Synopsis of AspeCis approach

AspeCis: An Aspect-Oriented Approach to Develop a CIS

127

Phase II: Development of models of CRs. Using the models of ERs and ARs, we consider the operators, deﬁned in the previous phase, as aspects, they are modelled by aspectual models, which should be composed with ERs and ARs models to produce the CR model. This phase includes the conﬂict resolution that can appear during the requirements composition process. Phase III: Preparing the implementation phase. The purpose of this phase is to transform models into code templates. In the present paper, we develop the two ﬁrst phases of AspeCis. 3.3

Phase I: The Elicitation and Analysis of Cooperative Requirements

The objective of the ﬁrst phase is to deﬁne a set of CRs expressed in terms of a set of ERs possibly augmented with a set of ARs. Elicitation of Cooperative Requirements. The requirements elicitation activity oﬀers means for the identiﬁcation and analysis of all requirements. The requirements engineer might use information gathering techniques to obtain and analyze the requirements, to capture derived requirements that are a logical consequence of what the users and clients requested. In the literature, several techniques for requirements elicitation are deﬁned [10]. The choice of an elicitation technique depends on the time and resources available to the engineering requirements and, of course, the kind of information that needs to be elicited [9]. Nuseibeh presents a classiﬁcation of elicitation techniques [12]. This phase must be followed by a reﬁnement process. Refinement of Cooperative Requirements. The reﬁnement process is used to decompose the CRs into a set of basic requirements. And the use of the inference relation in order to avoid some problems related to the deﬁnition of requirements, such as redundancy and inconsistencies between CRs. Decomposition of Cooperative Requirements. The reﬁnement is used to decompose the CRs qualiﬁed as high-level requirements into a set of existing and additionnal requirements (not decomposable). The decomposition of a CR gives a tree of requirements, witch are the leaves of the tree decomposition. These requirements are connected by conjunctions or disjunctions nodes. We can distinguish the ERs that can be used without any change in the deﬁnition of the CRs and the ERs that must be changed by means of appropriate operators. Here, we must apply some unary operators to these requirements in order to make appropriate changes. Finaly, the ARs are the requirements that are not supported by the existing ISs. The result of a decomposition, gives us the existing and additional requirements involved in the deﬁnition of a CRs.

128

M. Amroune et al.

Cooperative Requirements Inference. Within the reﬁnement phase, we can use the inference relation mentioned in [9] where the inference relation is deﬁned as follows: ”When a requirement is the immediate consequence of another set of requirements, the former is called the conclusion, the latter the premises and they stand related through the inference relation. The inference relation can be used to connect the reﬁned requirement to the requirements that reﬁne it”. The inference brings the following beneﬁts: (i) It allows dealing with the redundancy of CRs. So, this relation avoids the deﬁnition of CRs that can be obtained simply by an inference relation. (ii) It allows the requirements engineer dealing with the problem of ambiguity that occurs when one CR has several possible interpretations. (iii) It allows reducing development cost and project schedule. Example. Let R1 and R2 be two requirements, R1 for ”Courses are available on a web page” and R2 for ”Students have all standard functionalities for reading courses”. We can conclude the requirement R3, ”The students can download the courses on the Web”. Expression of CRs in terms of ERs and ARs. In this sub-phase we determine the ERs and ARs involved in the deﬁnition of every CR which could not be inferred from others, i.e., we express CR using a combination of ERs and/or ARs. A set of ERs must be modiﬁed in order to consider appropriate changes in the intention to clearly deﬁne the CRs. Selection of the operators. In the general case, the reuse, of existing requirements needs some modiﬁcations in order to deﬁne CRs. These changes are assured by unary and composite operators. For us, a modiﬁcation of an ER is the result of the weaving of a new behavior on the ER, or the logical composition (set of conjunctions and disjunctions) of a set of requirements. We deﬁne 2 categories of operators: the Unary operators (Op U), which acts on a single requirement, and the Composite operators (Op C), which will act on a set of ERs and ARs requirements. Example. Let us deﬁne the following CR: CR1 = Op U 1(ER1) CR2 = Op U 1(ER2)

(1) (2)

CR3 = Op C1(CR1, CR2) CR4 = AR1

(3) (4)

CR5 = Op U 2(ER3) CR6 = Op U 2(ER4)

(5) (6)

CR8 = OP C2(CR5, CR4, CR6)

(7)

CR = OP C3(CR3, CR8)

(8)

AspeCis: An Aspect-Oriented Approach to Develop a CIS

129

We can formally represent the two operators as: Op U : E → C1 ERi → CRj

(9) (10)

Op C : (E’, A)n → C (ER1, ER2, ..., ERn, AR1, ...ARm) → CR

(11) (12)

Where: E: set of ERs, C1: set of CRs (ERs modiﬁed by Op U), E’: = Union(E, C1), A: set of ARs C: set of CRs (include C1). So, in order to deﬁne CR, we must apply the operators mentioned in its deﬁnition to the ERs. Each operator can be applied to several requirements, so these operators are transverse relative to ERs: Op U1 is transverse to the requirements (ER1, ER2) and Op U2 for (ER3, ER4). 3.4

Phase II: Development of models of CRs

We will present in this second phase of our approach how to create aspectoriented models of the CRs. Aspect oriented modeling (AOM) approaches distinguish three levels of weaving [3]; model level weaving, code-level weaving, and the executable-level (runtime) weaving. Our work is positioned in the ﬁrst level. Therefore, we also make a clear distinction between the base-system and the cross-cutting concerns. Based on the observation that the operators are transverse to the ERs, we say that operators are weaving on ERs. In Aspect-Oriented Software Development (AOSD), the Joinpoint represents a key concept. So the Joinpoint deﬁnes the places where two concerns, i.e. core and aspectual, crosscut each other [9]. Thus, if the concept of the Joinpoint is used in the code and at the conceptual level, we can also deﬁne it at the analysis level. In this context we can deﬁne an aspect as follows: Aspect = (JP, Advice) where JP (Joinpoints) are the ERs, and the Advice are the changes to be woven on these ERs. In the case of model level weaving, cross-cutting concerns are separately modeled, then woven together to yield a non-aspect-oriented model, which is then transformed to code. We want to build the models of CRs based on existing models of ERs and ARs. To deﬁne unary operators as aspects (Aspect-Op U), we consider an aspect is an entity composed by Pointcuts (1) and Advices (2) as deﬁned by the AOSD community. This deﬁnition is found for example in Java AspectJ weaver as shown in the following example. Package aop.aspectJ(Public aspect TraceAspect{ (Pointcut traced():Call(public void Order.AddItem(string,int))(1); Void around: traced(); (2) System.out.println(Before call AddedItem); Proceed();System.out.println(After call addedItem)} Initially, we use structural models, so the ERs and the operators are represented with class diagrams. The pointcuts for aspectual models are ERs models or

130

M. Amroune et al. Table 1. Pointcuts and advices Pointcuts Advices Class Add Class Remove Class Rename Class Association Add/Remove Association Add/Remove inheritance relationship Add/Remove aggregation relationship Add/Remove composition relationship Add/Remove association or attribute

elements of these models. So classes, attributes, methods and associations are example of such pointcuts. To these categories of pointcuts will correspond some advices. Table 1 shows some examples of pointcuts and advices. We consider composite operators as Aspects (AspectOp C), which aﬀect ERs models (i.e. either on the diagrams or on the part of these diagrams). So we distinguish three following categories of pointcuts: Category 1: Merge parts of diagrams/ Merge class diagrams Category 2: Junction of diagrams (using AND) Category 3: Disjunction of diagrams (using OR)

4

An Example of Operators

To give an example of the operators, let us assume that a new CIS is built for the management of a cooperative project involving several universities to provide Graduate School. Each university is supported by its existing IS. The new CIS is built on the basis of existing ISs. The following operators illustration refers to phase 1 of Aspecis. 4.1

Unary Operator (Op U)

Let ER1 , ER2 , and ER3 be the following ERs deﬁned in each of the 3 universities: ER1 = ER2 = ER3 = Every student may have a second subscription in the same university. CR = Every student may have a second subscription in the same university PROVIDED THAT the number of hours of the second speciality does not exceed 50% of the number of hours of the first one. CR is a Composition of (ER1 , ER2 , ER3 ) of type adding a property to each ERi . Thus we apply the unary operator Op U Add the property (PROVIDED THAT) to ER1 , ER2 , and ER3 to check this new property PROVIDED THAT imposed during the deﬁnition of the cooperative requirement CR. Op U weaves this new property on the existing requirements (ERi ).

AspeCis: An Aspect-Oriented Approach to Develop a CIS

4.2

131

Composite Operator (Op C)

Let ER’1 , ER’2 and ER’3 be the authentications requirements in each university. The existing ISs authenticate their users with their own method. For instance: ER’1 = Authentication by name, ER’2 = Authentication by e-mail, and ER’3 = Authentication by pseudonym. The CIS authentication requirement is as follows: CR = Use one of the following composition of existing ISs authentication. a) CR= OR (ER’1 , ER’2 ,ER’3), i.e. one of the existing authentications; b) CR = AND(ER’1, ER’2 ) OR ER’3 , i.e. either both of the 2 ﬁrst ones or only the third one; c) CR = AND (ER’1 , ER’2 ,ER’3 ), i.e. all existing ISs authentications. OP C is the OR composition of the 3 types of above logical compositions a), b), c).

5

Conclusion and Future Work

Several reasons, and economic factors lead organizations to interconnect their information systems in order to ensure a common global service. Consequently, they build a collaborative information system. In this context, we have presented in this paper an approach called AspeCIS (for Aspect-oriented Approach to develop Cooperative Information Systems) , which reuses artifacts previously developed by existing information systems. AspeCis is based upon three phases. The ﬁrst phase intends to elicit, analyze and formulate Cooperative Requirements (CRs), starting form Existing Requirements (ERs) and Additional Requirements (ARs). The second phase aims to develop the CRs models. The third phase objective is to generate code from the CRs models. AspeCis deﬁnes operators corresponding to a set of behaviors to be woven on a set of ERs and ARs to build the CRs. The role of the unary operators is to make appropriate changes to the ERs, to be used for the deﬁnition of the CRs conjointly with a set of Additional Requirements (ARs). The composite operators are used to compose ERs and ARs. In the future, we want to develop the second phase of AspeCIS. So, regarding our AspeCis deﬁnition of CIS, we intend to propose a weaving metamodel dedicated to develop this kind of CIS.

References 1. Arajo, J., Whittle, J., Kim, D.: Modeling and Composing Scenario-Based Requirements with Aspects. In: Proceedings of the Requirements Engineering Conference, 12th IEEE International. IEEE Computer Society, Washington, DC (2004) isbn 0-7695-2174-6, 58-67 2. Elisa, B., Siobhan, C.: Finding Aspects in Requirements with Theme/Doc. Workshop: Aspect-Oriented Requirements Engineering and Architecture Design. At 3rd Aspect-Oriented Software Development International Conference, AOSD 2004 (2004)

132

M. Amroune et al.

3. Evermann, J.: A Meta-Level Speciﬁcation and Proﬁle for AspectJ in UML. In: 6th International Conference on Aspect Oriented Software Development, AOSD (2007) 4. Fortune, J., Valerdi, R., Boehm, B.W., Settles, F.S.: Estimating Systems Engineering Reuse. In: 7th Annual Conference on Systems Engineering Research, CSER (2009) 5. Frakes, W.B., Isoda, S.: Success factors of systematic reuse. IEEE Software (1994) 6. Jacobson, I., Chirsterson, M., Jonsson, P., Overgaard, G.: Object-Oriented Software Engineering: A Use Case Driven Approach, 4th edn. Addison-Wesley, Reading (1992) 7. Jacobson, I., Ng, P.W.: Aspect-Oriented Software Development with Use Cases. Addison Wesley Professional, Reading (2005) 8. Jureta, I.J., Borgida, A., Ernst, N.A., Mylopoulos, J.: Techne: Towards a New Generation of Requirements Modeling Languages with Goals, Preferences, and Inconsistency Handling. In: International Conference on Requirements Engineering, RE (2010) 9. Muhammad, N., Muhammad Khalid, A., Khalid, R., Haﬁz Farooq, A.: Representing Shared Join Points with State Charts: A High Level Design Approach. World Academy of Science, Engineering and Technology (2006) 10. Nuseibeh, B., Easterbrook, S.: Requirements Engineering: A Roadmap. In: 22nd International Conference on Software Engineering (ICSE 2000), pp. 35–46. ACM Press, Limerick (1990) 11. Poulin, J.S.: Populating software repositories,Incentives and domain-speciﬁc software. Journal of Systems and Software 30(3) (1995) 12. Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Videira Lopes, C., Loingtier, J.M., Irwin, J.: Aspect-oriented programming. In: Aksit, M., Auletta, V. (eds.) ECOOP 1997. LNCS, vol. 1241, pp. 220–242. Springer, Heidelberg (1997) 13. Moreira, A., Rashid, A., Arajo, J.: Multi-dimensional Separation of Concerns, Requirements Engineering. In: 13th Requirements Engineering Conference (RE 2005), pp. 285–296. IEEE Computer Science, Paris (2005) 14. Rashid, A., Moreira, A., Arajo, J.: Modularisation and Composition of Aspectual Requirements. In: 2nd International Conference on Aspect Oriented Software Development, AOSD (2003) 15. Sousa Brito, I.S.: Aspect-Oriented Requirements Analysis. PhD thesis, Unvesity of LISBOA (2008) 16. Wartik, S., Prieto-Dfaz, R.: Criteria for comparing domain analysis approaches. In: Fourth Annual Workshop on Software Reuse (1991) 17. Whittle, J., Arajo, J.: Scenario Modeling with Aspects. IEE Proceedings Software 151(4), 157–172 (2004)

An Application of Locally Linear Model Tree Algorithm for Predictive Accuracy of Credit Scoring Mohammad Siami1, Mohammad Reza Gholamian1, Javad Basiri2, and Mohammad Fathian1 1 Industrial Engineering Department, Iran University of Science and Technology [email protected], {gholamian,fathian}@iust.ac.ir 2 Department of computer, Majlesi Branch, Islamic Azad University [email protected]

Abstract. Economical crisis in recent years leads the banks to pay more attention to credit risk assessment. Financial institutes have used various kinds of decision support systems, to reduce their credit risk. Credit scoring is one of the most important systems that have been used by the banks and financial institutes. In this paper, an application of locally linear model tree (LOLIMOT) algorithm was experimented to improve the predictive accuracy of credit scoring. Using the Australian credit data from UCI machine learning database repository, the algorithm was found an increase in predictive accuracy in comparison with some other well-known methods in the credit scoring area. The experiments also indicate that LOLIMOT get the best result in terms of average accuracy and type I and II error. Keywords: data mining, credit scoring, locally Linear Model Tree, classification, finance and banking.

1 Introduction Customer credit prediction is one of the most important topics in the financial industry. Many credit scoring models have been developed for customer credit evaluation. Whereas the credit industry has widely used credit scoring models to predict whether an applicant belongs to a good customer or a bad one. Using credit scoring, good customers could be distinguished from bad ones by regarding their attributes such as their age, marital status, income and so on [1,2]. Many recent studies have highlighted the importance of credit scoring [2,3]. Also, using an accurate classifier for credit scoring is of major importance such that 1% accuracy in the true prediction of the bad or good status, strongly affects the system performance and responsiveness [2,4]. Generally speaking, credit scoring is a binary classification problem which divides loan applicants into two groups: applicants with good credit or applicants with bad credit [1]. Many algorithms such as artificial neural networks [5,6], decision tree [7,8], genetic programming [9], and support vector machine [ 1,2,7 and 10] have been used for better credit Prediction, but nevertheless this problem remains a hot topic in financial research. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 133–142, 2011. © Springer-Verlag Berlin Heidelberg 2011

134

M. Siami et al.

“For good classifiers, superior accuracy may be one of the most important performance measures” [3]. In recent years, many methods have been proposed to improve the accuracy of credit scoring models. Therefore, a credit scoring model with a high ability to distinguish good and bad customers, have a high value for the banks and financial institutes [2,3 and 11]. In this paper, we introduce an application of LOLIMOT (Locally Linear Model Tree) algorithm to the credit scoring literature to increase the accuracy of credit scoring task. To the best of our knowledge, credit scoring literature does not contain any reference (yet) to the LOLIMOT algorithm. LOLIMOT starts an incremental tree based learning algorithm with an optimal linear least squares estimation, and if results be in an enhancement of performance, the nonlinear neurons are added. Thus, the learning algorithm constructs the model automatically to obtain the highest generalization, and the need for linearly tests is eliminated [12]. This procedure combines tree models and neural networks to make the best use of their advantages, in this way the shortcomings of these approaches are eliminated [12]. To evaluate the performance of the credit scoring model with LOLIMOT algorithm, we apply it with Australian credit scoring data. Also, our proposed method can also be used for many other datasets. Besides the introduction, the paper is organized as follows. Credit scoring and related works will be reviewed in section 2; in section 3 a locally linear neuro-fuzzy model with locally linear model tree learning algorithm is discussed. Explanation of experimental results on a benchmark dataset is organized in section 4. Finally, our conclusion and further research work have been discussed in Section 5.

2 Customer Credit Scoring Credit scoring is a data mining method that has been used for forecasting financial risk to customer lending. Many definitions have been proposed for credit scoring but most of them have expressed that credit scoring is a classification method which classifies customers into two main categories: customers with good and bad credit [2,3 and 13]. Applying good applicants for credit scoring has many benefits. Some of these benefits are: [2] • Evaluation of customer risk • Decreasing cost for credit assessment • Decision making for loan applicant easily and quickly Credit scoring is obtained from the customer experiences. Credit scoring analyzer explores customer credit data in five perspectives. These five perspectives are known 5Cs and include: the Character of the consumer, the Capital, the Collateral, the Capacity and the economic Conditions (Fig. 1) [13].

An Application of Locally Linear Model Tree Algorithm

135

ŚĂƌĂĐƚĞƌ

ŽŶĚŝƚŝŽŶƐ

ĂƉŝƚĂů

ƵƐƚŽŵĞƌ ĐƌĞĚŝƚ ƐĐŽƌŝŶŐ

ĂƉĂĐŝƚǇ

ŽůůĂƚĞƌĂů

Fig. 1. 5Cs Framework for credit scoring

Because of the large number of customer credit data by increasing the number of requests for loans, customer credit scoring based on this framework, manually is impossible. Therefore, many methods and algorithms have been presented in this field to predict customer credit [2]. According to previous studies, credit scoring methods are classified into two groups, i.e., statistical and Artificial Intelligence (AI) methods [2,14]. Some of statistical methods have been widely applied to build the credit scoring models. These models require some assumptions, such as multivariate normality assumptions for independent variables and frequently violated in the practice of credit scoring, which makes these models invalid for finite samples. Linear Discriminant Analysis (LDA) and Logistic Regression Analysis (LRA) are the famous statistical methods that have been proposed in credit scoring literature [13]. AI methods such as Decision Trees (DT) [3,8], Artificial Neural Networks (ANNs) [5,6], Support Vector Machine (SVM) [1,2,7 and 10] and Genetic Programming (GP)[9], have been applied to improve or solve defects of statistical methods. There are some research gaps in the credit scoring literature including: • Low accuracy of statistical techniques [2,13] • Large number of customer credit data with unbalanced structure[2] • Having trouble with choosing the best input features and kernel parameters for SVM method [10] In order to improve the credit scoring prediction, we proposed a credit scoring model by using an efficient classifier that is introduced in the next section.

136

M. Siami et al.

3 Locally Linear Neurofuzzy Model with Model Tree Learning In this study, the locally linear model tree classification technique is brought to the attention of banking researchers which to the best of our knowledge, in the credit scoring literature, there is not any reference (yet) to this method. In this section, this technique is introduced and the details are discussed. The basic strategy with the locally linear neuro fuzzy (LLNF) model is to divide the input space into small linear subspaces with fuzzy validity functions [15,16]. Generally speaking, the model can be considered as a neuro-fuzzy network with one hidden layer and a linear neuron in the output layer which computes the weighted sum of the outputs of the locally linear models. Fig.2 demonstrates the network structure.

Fig. 2. Network structure of a static local linear neuro-fuzzy with M neurons for p inputs [18]

Each linear component is a fuzzy neuron with a validity function. Equation (1) denotes the model input and (2) and (3) show the output of the network [12]. 1 2 M

3

Where M is the number of neurons, p denotes the number of input dimensions and ωij is the linear estimation parameters of the ith neuron. The validity functions are selected as normalized Gaussians based on (4) and (5) [16] u

u ∑M

u

4

An Application of Locally Linear Model Tree Algorithm 2

2

exp

2

137

exp

2

2

2

5

Where, σij is the standard deviation and cij shows the center of Gaussian validity functions. Optimization is used to adjust the two types of parameters by learning techniques, namely: the rule consequent parameters of the locally linear models (ωij’s) and the rule premise parameters of validity functions (cij’s and σij’s). Global optimization of ωij’s is produced by the least-squares method which can be brought in (6) [16]: M

M

6

There are M × (p + 1) elements in this global parameter vector. The equation (7) and (8) indicate the regression matrix X for N data samples. Thus by solving (9) and (10) the rule consequent parameters will be obtained as indicated in (11) [12]. , 1 1

,

,

1 2

7 1 2

8

1 9 1

0

0 0

0 2

0

10

0 1

11

On the other hand an incremental tree-based learning algorithm, e.g. locally linear model tree (LOLIMOT) is appropriated for adjusting the rule premise parameters. LOLIMOT has four iterative stages: The initial model starts with a single locally linear neuron and an optimal linear least squares estimation over the whole input space is obtained. In this stage M =1 and 1. Next, by computing a locally loss function such as MSE for each of the i = 1, . . . ,M neurons, the worst performing neuron is found. Third, the worst performing neuron is considered for further refinement and all the divisions are checked. The validation hypercube of this neuron is divided into two halves with an axis of orthogonal split. Divisions in all dimensions are tried and for each of the p divisions the following stages are repeated [17]:

138

M. Siami et al.

1. Construction of the multi-dimensional membership functions for both produced hypercubes. 2. Building all fuzzy validity functions. 3. Estimation of the rule consequent parameters for newly generated Locally Linear Models (LLMs). The standard deviations are usually set to 0.7 and the centers are the same centers of the new divisions. 4. Computing the loss function for the current overall model. In the fourth stage, the best of the p alternatives in the previous phase is chosen. If it reduces the loss functions or error indices on validation and training data sets, the related validity functions and neurons are updated, the number of neurons is increased (i.e. M= M+1); and the algorithm goes to Step 2, otherwise the learning algorithm is terminated. This automatic learning technique leads to the best linear or nonlinear model with maximum generalization [18]. It is worth mentioning that the computational complexity of LOLIMOT algorithm increases linearly with the number of neurons in comparison with other methods. The clearness and intuitive construction of LOLIMOT makes it most applicable for the computing and adjusting of the rule premise parameters [16].

4 Results and Analysis In the following, the used dataset, our validation method and results of experiments are described. 4.1 Real World Data Set Australian credit real dataset from UCI repository of machine learning databases (http://www.niaad.liacc. up.pt/statlog/datasets.html) has been used to evaluate our classifier on the customer credit data. This dataset consists of 690 samples, with 307 good applicants and 383 bad ones. Each sample contains 15 features, including 6 nominal and 8 numeric features. Also, the 15th feature for each sample is the class label, which specifies that the customer have a good or bad credit. Table 1 describes full information about Australian dataset. Table 1. Real world dataset from UCI repository name Australian

Classes

2

Number of Instances

690

Nominal features

6

Numeric features

8

Total features

14

4.2 Validation Method In this paper we have used standard measures for evaluating LOLIMOT classifiers. These measures consist of Accuracy, Type I and II error illustrated in equations (12), (13) and (14)[2]:

An Application of Locally Linear Model Tree Algorithm

ACCURACY Type I Error Type II Error

TP TP TN FN FN TP FP FP TN

TN FP

139

12

FN

13 14

For calculating these measures we need to take into account True Positive, True Negative, False Positive and False Negative. These parameters can be obtained from confusion matrix as illustrated in table 2 [2]. Table 2. Credit scoring confusion matrix

Actual

Good Customer

Bad Customer

Good Customer

True Positive (TP)

False Positive(FP) (Credit Risk)

Bad Customer

False Negative(FN) (Commercial Risk)

True Negative(TN)

Predicted

We have applied ten-fold cross validation method to test Australian credit dataset. Also, the performance of LOLIMOT classifier have been evaluated by comparing it with some well-known classifiers in the credit scoring literature including Decision Tree (DT) [2,8], Support Vector Machine (SVM) [2,10], Linear Discriminant Analysis(LDA) and Artificial Neural Networks (ANNs) [5,6]. As mentioned before, the comparative measures are Accuracy, Type I and II Error. 4.3 Numerical Results In this study, all experimental results were performed on a PC with 2 GHz Intel Core 2 Duo 2 CPU, and 2 GB RAM, using Windows XP professional operating system The results of LOLIMOT were compared with some basic learners such as DT, SVM, ANNs in [2]. ..In our experiments and [2], Australian credit data set was divided into 80% training dataset and 20% testing dataset randomly. First LOLIMOT was compared other classifiers (ANN, SVM, LDA, DT) based on average accuracy (Fig3). As observed, LOLIMOT has the best accuracy against other classifiers with 87.54% average accuracy. LOLIMOT achieves an increasing 3.15 % from decision tree, 0.98% from linear discriminant analysis, 1.87% from support vector machine and 4.26% from artificial neural networks. Fig.4 indicates the comparison of LOLIMOT algorithm with four other classifiers based on Type I and II Errors. In type I Error, LOLIMOT places in third position in this dataset. SVM and LDA has the best type I error value, but bad performance of these classifiers against LOLIMOT in average accuracy and type II error made them

140

M. Siami et al.

unattractive for credit scoring. Other methods such as decision tree and artificial neural networks have found below places. As observed, LOLIMOT algorithms have the lowest value in type II error. In credit scoring, Type II Error is very important for decision making. This measure gives the rate of bad customers that have been predicted as good customers. This type of misclassification error would have increased credit risk of financial institutes. According to this measure, decision tree is placed in the second and linear discriminant analysis, artificial neural networks and support vector machine have the worse value in type II error rate respectively. At result, based on our experiments it can be concluded that LOLMOT algorithm has found the best performance in comparison with other applied methods. The maximum generalization capability of the LOLIMOT classifier has increased its performance significantly. Also, the iterative process of this method for branching the neurons has made it superior to simple neural networks. On the other hand, locally linear neurons and the fuzzy validity functions have increased the performance of the classification tree structure. The high performance of the LOLIMOT classifier proves that the tree characteristic of this method and the locally linear neurons have complimented each other in customer credit scoring problem. 88.00% 87.00% 86.00% 85.00% 84.00% 83.00% 82.00% 81.00% Accuracy

LOLIMOT

LDA [2]

SVM [2]

DT [2]

ANNs [2]

87.54%

86.56%

85.67%

84.39%

83.28%

Fig. 3. The Mean Accuracy

25.00% 20.00% 15.00% 10.00% 5.00% 0.00%

LOLIMOT

LDA [2]

SVM [2]

DT [2]

ANNs [2]

Type I

14.98%

12.68%

7.20%

18.00%

19.27%

Type II

10.44%

14.05%

20.04%

13.70%

14.68%

Fig. 4. Type I and II Error

An Application of Locally Linear Model Tree Algorithm

141

5 Conclusions In recent years, credit risk assessment has been considered in financial discussions more than before and becomes one of the most important topics in the field of financial risk management. In this paper, we have applied locally linear model tree learning classifier called LOLIMOT, in order to improve the accuracy and reduce misclassification errors in credit scoring classification task. Since this classier is easy to use and all the parameters were implemented automatically, it is a practical technique for banking practitioners and academics. The LOLIMOT algorithm was tested by applying into Australian credit real dataset from UCI Repository of machine learning databases. LOLIMOT results produced an enough confidence to be sure about its superiority to other major classifiers due to our comparison with some other well-known classification methods. The experiments also show that the proposed method get the best result in terms of average accuracy and type I and II error. Results indicate that the LOLIMOT algorithm is an effective way to solve the binary classification problems such as credit scoring. Some future research directions will be also emerged. First, large datasets with more exploration of credit scoring data structures can be experimented to further evaluation of LOLIMOT classifier. As another recommendaqtion to improve the accuracy of credit scoring could be ensemble of LOLIMOT with other powerful classifiers using data fusion methods. Acknowledgments. We thank the Iran Telecommunication Research Center for financial support. We wish to thank the developers of Weka. We also express our gratitude to the donors of the different datasets and the maintainers of the UCI Repository.

References [1] Chen, F., Li, F.: Combination of Feature Selection Approaches with SVM in Credit Scoring. Expert Systems with Applications 37, 4902–4909 (2010) [2] Wang, G., Hao, J., Ma, J., Jiang, H.: A Comparative Assessment of Ensemble Learning for Credit Scoring. Expert System with Applications 38, 223–230 (2011) [3] Zhang, D., Zhou, X., et al.: Vertical Bagging Decision Trees Model for Credit Scoring. Expert System with Applications 37, 7838–7843 (2010) [4] Chen, W., Ma, C., Ma, L.: Mining the customer credit using hybrid support vector machine technique. Expert Systems with Applications 36, 7611–7616 (2009) [5] Hsieh, N.C.: Hybrid Mining Approach in the Design of Credit Scoring Models. Expert System with Applications 28, 655–665 (2005) [6] Tsai, C.F., Wu, J.W.: Using Neural Network Ensembles for Bankruptcy Prediction and Credit Scoring. Expert Systems with Applications 34, 2639–2649 (2008) [7] TunLi, S., Shiue, W., Huang, M.H.: The Evaluation of Consumer Loans Using Support Vector Machines. Expert System with Application 30, 772–782 (2006) [8] Leea, T., Chiub, C.C., Chouc, Y.C., Lud, C.J.: Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Expert System with Applications 50, 1113–1130 (2006)

142

M. Siami et al.

[9] On, C.S., Huang, J.J., Tzeng, G.H.: Building credit scoring model using genetic programming. Expert System with Application 29, 41–47 (2005) [10] Huang, C.L., Chen, M.C., Wang, C.J.: Credit Scoring with Data Mining Approach Based on Support Vector Machine. Expert System with Application 37, 847–856 (2007) [11] XiuXuan, X., Chunguang, Z., Zhe, W.: Credit Scoring Algorithm Based on Link Analysis Ranking with Support Vector Machine. Expert System with Application 36, 2625–2632 (2009) [12] Ghorbani, A., Taghiyareh, F., Lucas, C.: The Application of the Locally Linear Model Tree on Customer Churn Prediction. In: SoCPaR, pp. 472–477 (2009) [13] Thomas, L.C.: A Survey of Credit and Behavioral Scoring: Forecasting Financial Risk of Lending to Consumers. International Journal of Forecasting 16, 149–172 (2000) [14] Nanni, L., Lumini, A.: An Experimental Comparison of Ensemble of Classifiers for Bankruptcy Prediction and Credit scoring. Expert System with Applications 36, 3028– 3033 (2009) [15] Gholipour, A., et al.: Solar activity forecast: Spectral analysis and neurofuzzy prediction. Journal of Atmospheric and Solar-Terrestrial Physics 67(6), 595–603 (2005) [16] Nelles, O.: Nonlinear system identification: from classical approaches to neural networks and fuzzy models. Springer, Heidelberg (2001) [17] Sharifie, J., Lucas, C., Araabi, B.N.: Locally linear neurofuzzymodeling and prediction of geomagnetic disturbances based on solar wind conditions. Space Weather (April 2006) [18] Pedram, A., Jamali, M., Pedram, T., et al.: Local Linear Model Tree (LOLIMOT) Reconfigurable Parallel Hardware. International Journal of Applied Science, Engineering and Technology 1, 1 (2005)

Predicting Evasion Candidates in Higher Education Institutions Remis Balaniuk1, Hercules Antonio do Prado1,2, Renato da Veiga Guadagnin1, Edilson Ferneda1, and Paulo Roberto Cobbe1,3 1 Graduate Program on Knowledge and IT Management, Catholic University of Brasilia, SGAN 916 Avenida W5, 70790-160, Brasília, DF, Brazil 2 Embrapa - Management andStrategySecretariat, Parque Estação Biológica - PqEB s/n°, 70770-90, Brasília, DF, Brazil 3 Information Technology Department, UniCEUB College, SEPN 707/907 Campus do UniCEUB - Bloco 1, 70790-075, Brasília, DF, Brazil {remis,hercules}@ucb.br, {renatov,eferneda}@pos.ucb.br, [email protected]

Abstract. Since the nineties, Data Mining (DM) has shown to be a privileged partner in business by providing the organizations a rich set of tools to extract novel and useful knowledge from databases. In this paper, a DM application in the highly competitive market of educational services is presented. A model was built by combining a set of classifiers into a committee machine to predict the likelihood that a student who completed his/her second term will remain in the institution until graduation.The model was applied to undergraduate student records in a higher education institution in Brasília, the capital of Brazil, and has shown to be predictive for evasion in a high accuracy. The unbiased selection of students with elevated evasion risk affords the institution the opportunity to devise mitigation strategies and preempt a decision by the student to evade. Keywords: Knowledge Discovery in Databases, Data Mining, Committee Machines, Higher Education Institutions, Student Retention.

1 Introduction Information systems permeate every facet of modern business, and produce massive amounts of data that are stored in increasingly larger databases. According to some estimations, the volume of data in these databases double in size every 20 months [1]. This growing mass of data holds many layers of information, including patterns or relations that are difficult to be identified through simple analysis. Techniques like data mining make it possible for organizations to wade through masses of data and identify these relationships, producing meaning, and ultimately value, from previously unintelligible data. These techniques have been successfully used by many modern businesses in competitive markets, such as credit card companies, investment firms, and retailers, to produce advantage. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 143–151, 2011. © Springer-Verlag Berlin Heidelberg 2011

144

R. Balaniuk et al.

Modern economy strongly depends on skilled labor force. Seeking to take advantage from this reality, a growing number of colleges and universities are being created. In Brazil, in particular, there has been tremendous growth in the number of higher education institutions in the last decade, resulting in steadily increasing competitive pressures within the education services market. Studies, in fact, indicate that this trend should continue in the foreseeable future. To thrive in this market it is essential that higher education service providers seek to gain competitive advantage. In this context, data mining has a role to play, by giving institutions the ability to predict possible future student intent, allowing these institutions the opportunity to devise appropriate mitigation strategies. Gaioso [9] shows that there are several reasons for students to abandon their undergraduate studies, among them, financial strain, lack of vocational orientation, deficiencies in basic education and schedule conflicts. These reasons, for the most part, remain undetected by HEIs until the moment a student initiates a transfer request, leave of absence request, or drops out. Institutions that are able to identify students with high evasion risk, and manage to successfully overcome student grievances early on, may establish an environment of cooperation between the school and the student and foster those factors that promote student loyalty, and by so doing, may gain significant advantage over competing HEIs. In addition, by fulfilling its business objectives the education service provider also fulfills its social charter, increasing its graduation rates and contributing, ultimately, to the progress of society as a whole. This article presents the results of a data mining experiment that demonstrates a method for early detection of students with elevated evasion risk. It uses data mining tools to analyze historic student records and predict whether students who completed their second term of undergraduate studies will go on to graduate, or will abandon his or hers studies prematurely. In the following sections, this paper provides a brief background of the Brazilian higher education services market, describes the methodology used for the selection of attributes, data extraction, preparation and processing and presents the experiment results.

2 The Brazilian Higher Education Market After years of limited growth due to sparse public investments and restrictive regulations, that limited the interest of private education service providers, changes in the so called Law of Bases and Directives, signed in 1996, sparked renewed interest in this sector. After these changes, the number of Higher Education Institutions (HEI) grew from 900 in 1997 to 2,252 in 2008, a two and a half fold increase. During this period, the number of undergraduate level students enrolled in HEIs grew by a similar ratio, from 1.9 million in 1997 to just over 5 million in 2008. Of these 5 million students, approximately three quarters were enrolled in private higher education institutions [2],[3]. In spite of the growth in the undergraduate student population, Brazil still lags behind other Latin American countries in college enrollment rates. Data from

Predicting Evasion Candidates in Higher Education Institutions

145

UNESCO [4] shows that, in 2007, 30% of Brazilian college-age students were enrolled in an HEI, compared to 52% and 68% of the college-age population in Chile and Argentina, respectively. Also, the sustained growth of emerging economies such as Brazil, with GDP growth for 2010 estimated at 7.5% [5], point to an increasing demand for a skilled workforce. Consequently, an education market that is already significant, worth roughly US$ 8.2 billion in 2005 [6] using current exchange rates [7], has significant growth potential, suggesting increasing competition among higher education institutions. Competition for students is already fierce, as evidenced by open vacancy ratios of around 50% [3] in Brazilian higher education institutions. In order to remain viable, HEIs are forced to engage in large and very expensive recruiting campaigns that even include mass media advertisements. But these efforts have no effect to retain students, who abandon their studies before graduation at an estimated rate of 44.7% [3]. The Brazilian National Institute for Educational Research Anísio Teixeira (INEP) does not make clear what percentage of these students transfer to other HEIs or drop out altogether. It is well known, however, that college transfers are common practice among students of private colleges and universities. With excess capacity and high rates of evasion, it becomes critical for HEIs to maximize retention of enrolled students. Bergamo et al. [8] indicate that student retention is a key issue for the survival of private colleges and universities, and student loyalty is directly linked to factors such as trust, emotional commitment, satisfaction, expectations management and school reputation. HEIs that foster student loyalty gain competitive advantage through a solid revenue stream, increased reputation and reduced marketing and recruiting budgets.

3 Methodology 3.1 Boosting Prediction Accuracy Predictions produced by machine learning algorithms vary in performance due to many factors, such as the nature of the data being analyzed, the size of the dataset, the number of patterns contained in the dataset, etc. Some algorithms may misclassify some items that other algorithms may classify correctly. No algorithm is perfect for every situation, each having strengths and weaknesses [10]. Committee machines is a method of combining the prediction results from various learning algorithms, leveraging the strength of each algorithm in order to achieve a combined result superior to any other that could be reached by using a single learning algorithm alone [11]. 3.2 The Experiment The main goal of the experiment was to identify students with high evasion risk in order to provide college Deans and Bursar officials with the means to locate and interact with these students, and develop individual mitigation strategies that could preempt a decision to drop out.

146

R. Balaniuk et al.

This prediction experiment using machine learning algorithms was conducted with data gathered from student records of 11,495 undergraduate students of a top-tier private higher education institution in Brasília, the capital of Brazil. Data was collected based on factors identified by Gaioso [9] as being important reasons behind student evasion. A number of student attributes were studied, extracted and transformed, resulting in a database that combined the socio-economic and academic information for each student. The attributes examined were the following: (i) age group, (ii) gender, (iii) neighborhood of residence, (iv) work status, (v) type of high school attended, (vi) family income, (vii) overall grade point average (GPA) for all classes attended, (viii) GPA in second semester classes, (ix) overall class attendance average, (x) attendance average in second semester classes, and (xi) number of failed classes in the two initial semesters. CRISP-DM [12] was the method adopted as a guide to drive the data mining application. To conduct the evasion prediction experiment, the WEKA (Waikato Environment for Knowledge and Analysis) workbench software for machine learning [13] was selected. This software is open source, and contains a collection of machine learning algorithms suitable for data mining projects After constructing prediction models using regression, decision tree and neural network algorithms, the results were combined in a committee machine, which produced a report identifying each student and the probability him or her would graduate normally or evade. 3.3 Data Preparation This phase strongly relies on the expertise from the HEI managers. To help diagnose economic strain, family income and neighborhood of residence were selected as indicators of economic health. Anecdotal evidence shows that students, occasionally, misrepresent their family income in fear of having his or her enrollment rejected by the institution, so the neighborhood of residence is a particularly relevant indicator of family income. In general, family income drops the farther from downtown in Brasília a person resides, with residents of the outer suburbs having lower incomes, and residents of the South and North Lake neighborhoods having higher incomes on average. Neighborhoods were divided in six large regions which were North/South Lake, Brasília-Proper (Plano Piloto), Near-Suburbs, Far-Suburbs, Outer Suburbs and Other States. To identify possible scheduling conflicts, the age group, gender, and work status of the student were selected. It is expected that younger students, or those who do not work, are less likely to experience sustained scheduling conflicts. On the other hand, for female students, pregnancy and child rearing could lead to scheduling conflicts, and, consequently, evasion. The type of high school attended (public, private or military academy) was selected as an indicator that may point to basic education deficiencies. The Brazilian public school system is known for its structural problems and overall poor performance when compared with private or military schools.

Predicting Evasion Candidates in Higher Education Institutions

147

Grade and attendance records, along with number of failed classes, were selected as measures of current academic success. Since the HEI studied grades students using a subjective grading scheme, the GPA had to be calculated using a grade conversion convention, which was adapted from that used by North American HEIs. The grading scheme used by the Institution was converted to numerical values. Passing grades, SS, MS and MM were converted to 4.0, 3.0 and 2.0 respectively. MI and SR, failing grades, were converted to 1.0 and 0, respectively. GPA calculation does not take in consideration canceled or withdrawn classes. The weighted mathematical average for the converted grades was calculated using the number of credits for the class, and transformed in a grade scale of A, B, C and D, with A equivalent to a GPA greater or equal to 3.4, B representing a GPA between 2.7 and 3.4, C, between 2 and 2.7, and D, less than 2 points. The GPA was calculated for 2nd semester classes and the overall cumulative GPA. Class attendance was also calculated using a similar scheme, in which the weighted average of absences was calculated for the 2nd semester classes and overall attendance average, using the number of class credits as weights. The average absences was also converted to a scale represented by Low, Medium and High, with an average of less than 8, between 8 and 16 and 16 or more absences, respectively. Attendance records of withdrawn or canceled classes were not considered as well. The final measure of academic performance is a count of the number of failed classes for the first and second semesters. Three groups were created for this indicator, these being: No failed classes, up to 2 failed classes and 3 or more. Student records were divided into two groups. The first group, with 3,058 records, was used for training and testing the model, and included records of students who enrolled in 2005 and 2006, who completed the socioeconomic enrollment survey and graduated (within the average graduation time of 5 years for the institution), or dropped out prematurely. The second group with the remaining 8,437 records, contained students regularly enrolled in classes in the second semester of 2010, that a) would not graduate at the end of the semester, b) were are at the minimum in their second year, and c) completed their socioeconomic enrollment survey. These records were used to generate predictions.

4 Results The training data was processed through WEKA using four classification algorithms. These were ZeroR, a simple classifier, Average One-Dependency Estimator (AODE), a Bayesian probabilistic classifier, J48, an implementation of the C4.5 decision tree, and Multilayer Perceptron, a neural network. ZeroR classifier was used for drawing a prediction baseline to compare against other classifiers [13]. To establish the upper limit of accuracy, training and testing was done using the complete dataset, and to build the model later used for prediction, the data was split using ⅔ for training and ⅓ for testing. Prediction accuracy is summarized in Figure 1.

148

R. Balaniuk et al.

Prediction Accuracy 100% 80% 56.5%

60%

91.3%

84.7% 79.8%

80.2% 80.0% 56.5%

76.2% 56.5%

40% 20% 00% AODE ZeroR

J48

Complete dataset training

Multilayer Perceptron Split training (66% train, 33% test)

Fig. 1. Algorithm prediction accuracy in training

When trained using the entire training dataset, all prediction algorithms showed significant improvement over the baseline results. As would be expected, these high rates drop when trained using a smaller sample, 66% of the set, and tested using the remaining data. Accuracy reduction for AODE was negligible, while J48 and Multilayer Perceptron sustained greater accuracy reductions. In order to boost overall prediction accuracy across all algorithms, a committee machine was set up using all three prediction outputs (AODE, J48 and Multilayer Perceptron). After testing various committee schemes, simple arithmetic mean was selected because it maintained an overall high prediction result accuracy of 80.6% (slightly superior as achieved for AODE and J48), while being extremely simple to implement. Table 1 presents the summarized training results using the ⅔ ⅓ split. Table 1. Summary of the prediction accuracy for each algorithm and the committee machine

Overall Accuracy Graduate Prediction Evade Prediction

AODE

J48

80.0% 80.5% 79.2%

79.8% 80.8% 78.4%

Multilayer Perceptron 76.2% 80.9% 70.7%

Committee Machine 80.6% 81.9% 78.7%

The confusion matrix produced by each training algorithm and the committee machine using their results is presented in table 2. Table 2. Confusion matrices for each training algorithm

Graduate Evade

AODE Grad. Evade 505 86 122 327

J48 Grad. Evade 500 91 119 330

Perceptron Grad. Evade 449 142 106 343

Committee Grad. Evade 499 92 110 339

Predicting Evasion Candidates in Higher Education Institutions

149

The accuracy of the algorithms was also evaluated using receiver operating characteristic (ROC) curves, which indicate the performance of a given classifier [14]. The curves show the relation between true positives (Y axis) and false positives (X axis), expressed as a percentage of the sample. Higher arcs (closer to the x, y point of 0%, 100%) are evidence of more accurate classifiers. The ROC curves plotted for each prediction algorithm are shown in Figure 2.

ROC Curve - split 2/3 1/3 training set 100 90 80

True Positives

70 60

AODE

50

J48 Multilayer Perceptron

40

Committee Machines

30 20 10 0 0

10

20

30

40

50

60

70

80

90 100

False Positives Fig. 2. ROC Curves expressed for the split training sets

After training, the 8,437 prediction records were processed using the models created using the ⅔ ⅓ split. The output for all three predictions was combined by averaging each individual student probabilities for each outcome (graduate or evade) using simple arithmetic mean. The results were then converted to a committee prediction representing the most likely outcome expected for the student. Students for whom the prediction was ambiguous (equal probabilities) were predicted to graduate. A sample of the committee results report is presented in table 3 below. Table 3. Committee machine prediction for regularly enrolled students Student ID 185 186 189

Course Journalism Psychology Law

Graduate 2.6% 1.6% 57.4%

Committee Predictions Evade Predicted status 97.4% Evade 98.4% Evade 42.6% Graduate

150

R. Balaniuk et al.

The complete report indicates the predicted future status for every student enrolled, with the probability for each possible outcome (graduate or evade). This information allows school officials to decide whether to pursue a policy of attempting to retain students with high predicted probability of dropping out or to address borderline cases, those that would, in theory, be easier to address. The model predicted that 3,250 of the 8,437 students enrolled would evade, which corresponds to 38.5% of students. Although lower than the estimated national average of 44.7% according to INEP [2], the figure is compatible with historic evasion rates for the HEI. The remaining 5,187 students were predicted to graduate normally. Preliminary analysis by college Deans and Professors indicate that predictions in fact indicate many known at-risk students. Additionally, records of enrollment renewals for 2011 also confirm several of the evasion predictions contained in the full prediction report.

5 Conclusion Many industries have experienced deep transformation through the use of techniques such as data mining and machine learning. In highly competitive business environments, those organizations that correctly implemented these techniques acquired great competitive advantage, and consequently, great success. To remain economically viable, higher education institutions have sought to gain competitive advantage through investments in information technologies and recruiting campaigns. But these efforts ignore one very important problem faced by HEIs, which is student retention. Perhaps because historically early detection has been very difficult, HEIs have attempted to deal with the problem at the exit point, after the student has decided to leave, a point at which it is often too late to revert the decision. This article presents a successful experiment which employed data mining tools in early evasion-risk detection for students of a preeminent higher education institution in Brasília, the capital of Brazil. The results of this experiment indicate that it is possible to identify students with elevated evasion risk, even those students who may not present overt signs of academic difficulty or financial strain, and this information, associated with appropriate pedagogical and financial grievance mitigation strategies, may prove to be exceptionally valuable tools for colleges and universities who seek to reduce student evasion, and provide the critical competitive advantage needed to thrive in the educational services market. The experiment resulted in a report containing predictions for each undergraduate student that completed their second term, and were regularly enrolled in the institution studied. The report indicated, with accuracy of 80.6%, whether the student would graduate or evade, expressed as a percent of likelihood of one outcome or the other. Of the 8,437 students analyzed, the model indicated that 3,250 students (38.5%) were predicted to evade. By narrowing the student population to those at risk, the results of this experiment allow department Chairs, college Deans and Bursars officials to engage each student early in their undergraduate career, and in so doing, to identify his or hers individual risk factors and to devise strategies to combat these factors, both at the level of individual solutions and institutional changes. These may include pedagogical support

Predicting Evasion Candidates in Higher Education Institutions

151

for basic education deficiencies, academic orientation, vocational orientation or financial aid. Some of these strategies may, in fact, uncover opportunities to create additional educational products that could be offered to students, creating new, and potentially profitable, revenue streams. Without this filter of the student population, such engagement would be far less efficient, perhaps to the point of becoming unfeasible. Regardless of the strategy adopted, Bergamo et al [8] indicate that developing a closer contact with the student by engaging him or her personally to discuss their needs foster loyalty, and by so doing may be sufficient to overcome evasion intent. This study indicates an objective method to select students to receive such attention.

References 1. Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. The MIT Press, Cambridge (2001) 2. National Institute for Educational Research Anisio Teixeira (Inep): Higher Education Statistical Synopsis (2008), http://portal.inep.gov.br 3. National Institute for Educational Research Anisio Teixeira (Inep): Technical Summary – Higher Education Census 2008, Preliminary Data (2009), http://portal.inep.gov.br 4. UNESCO Institute for Statistics: International Standard Classification of Education Key Statistics (2011), http://www.uis.unesco.org 5. Cardoso, J.: Brazilian GDP expected to grow 7,5% in 2010 and 4,3% in 2011, forecasts OCDE. Jornal Valor Online, São Paulo, November 18 (2010) 6. Lima, M.C.: The WTO and the “Educational Market”. Reasons Behind the Interest and Possible Consequences. In: VI International Colloquium on Higher Education Management in South America, Blumenau (2006) 7. Campos, E.: Dollar Closes at R$ 1,740 and Negates Losses for the Year. Jornal Valor Online, São Paulo, November 16 (2010) 8. Bergamo, F., Farah, O.E., Giuliani, A.C.: Loyalty and Higher Education: Stategic Tool in Client Retention. Revista Gerenciais 6(1), 55–62 (2007) 9. Gaioso, N.P.L.: Student Evasion in Higher Education: Student and Management Perspectives. Universidade Católica de Brasília, Brasília (2005) 10. Wolpert, D.: The Lack of a Priori Distinctions between Learning Algorithms. Neural Computation 8(7), 1341–1390 (1996) 11. Tresp, V.: Committee Machines. In: Hu, Y.H., Hwang, J.-N. (eds.) Handbook for Neural Network Signal Processing. CRC Press, Boca Raton (2001) 12. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0: Step-By-Step Data Mining Guide (2000), http://www.crispdm.org 13. Bouckaert, R.R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A., Scuse, D.: Weka Manual for Version 3-6-2. University of Waikato, Hamilton (2010) 14. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques. Elsevier, San Francisco (2005)

Search and Analysis of Bankruptcy Cause by Classification Network Sachio Hirokawa1, Takahiro Baba2 , and Tetsuya Nakatoh1 1 2

Research Institute for Information Technology, Kyushu University, hirokawa,[email protected] Graduate School of Information Science and Electrical Engineering, Kyushu University

Abstract. A simple document search is insuﬃcient when we analyse corporate information. Not only a list of search results, but also a justiﬁcation why the results match the query condition is important. This paper proposes a method to extract cause of bankruptcy from news articles applying the co-occurrence analysis of words.

1

Introduction

Understanding the current economic situation is crucial for all people working in industry. Particularly, most business men are watching information of the companies in their related ﬁelds. Web is valuable and convenient tool to obtain those information. The present paper focuses on the bankruptcy information available on the Web. Those information are expected to be more useful and valuable if we can use them for the management of the enterprise by extracting any reason why the company has gone bankrupt. We can use them to analyse a related ﬁeld enterprises. And, we can use them to judge in what kind of enterprise to have to invest when investing. There are several researches that apply text minig for the change of the ﬁnancial market and the ﬁnancial bankruptcy enterprise [6,8]. Visualization using graphs is one of the approaches to extract important documents and keywords to discover the relations between them [4,10] The present paper proposes a new text mining method based on the formal concept analysis [1] We construct the formal concept lattice from the bankruptcy documents and the keywords that appear. The lattice displays the co-occurrence relation of the keywords. We introduce a concise form of the lattice by restricting the nodes of the lattice. The key feature of the proposed system is in the visualization of the keywords that appear in the search result of bankruptcy information. However, the purpos of the system is not just to search for documents and feature words. It is a system to support users to obtain some hypothesis of bankruptcy. Many search engines provide a function of keyword expansion that return related words for the users query (Fig 1. Left). The proposed system does not L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 152–161, 2011. c Springer-Verlag Berlin Heidelberg 2011

Search and Analysis of Bankruptcy Cause by Classiﬁcation Network

153

Fig. 1. Search Engine(left) and Analysis Engine(right)

only give a hint for the search. This system presents appropriate hints that become necessary during the analysis and supports an interactive, repeating analytical work (Fig 1. Right) The users should determine the query for the analysis by themself at the beginning. However, the system presents the directionality of narrowing and enhancing in the analysis after that. The user can eﬃciently advance the analytical work by selecting the direction according to the purpose. The chance might be found by an automatic presentation of the related word that could not be hit on in user’s knowledge.

2

Related Work

The comparative analysis of the enterprises is one of the main researches in business administration, where the numerical data of the business report are analysed [12]. The progress of computer expanded the area of numerical analysis. The analysis and the forecast of bankruptcy information are one of hot topics of the numerical analyses in the ﬁeld [17,22,26] In [22] they apply SVM to estimate parameters to determine the growth of property, economic performance, operating income and so on. [17] compares a data mining technique and a statistical technique for the bankruptcy forecast by using the ﬁnancial ratio of a state of the company and a fragmentary variable. [26] compares the predictive performances of ﬁve feature extraction methods in the bankruptcy forecast. These analyses are based on the numerical data, which is not easy for non-specialists to actually use to analyze bankruptcy information. The technique of visualization is known to be eﬀective in a lot of ﬁelds. In [11] the relations between the enterprises are made visible by using KeyGraph. However, the visualization does not give any explanation of the cause of such relations. In [8,10,18,27] it is reported that the text-mining approaches of the keyword extraction and the keyphrase extraction are eﬀective for the analysis of the annual report and bankruptcy information. In [14] the text mining tool is used to analyze the part of the business condition of the annual report. We can extract the feature words from the reports. However, we cannot guess the relationship of the feature words which reﬂects the content of the report. To analyze the content of the document, most systems use the co-occurrence relation of words in the document. That is, the targets of analysis are documents. The analysis limited within the narrower range, such as in each sentence,

154

S. Hirokawa, T. Baba, and T. Nakatoh

are being paid attention recently. [25] focuses on a speciﬁc sentence in the document to obtain context information. They use the words “NI TSUKIMASHITE (concerning to)” and “NO TAME(for)” to specify important sentences to be analysed in detail. [16] uses a co-occurrence of words not in document level, but in sentence level to extract the topics. The present paper does not consider documents but considers sentences as targets. The analysis of co-occurrence relation in the narrow scopes reveals detailed tendency of a speciﬁc enterprise or of the enterprises determined by the users’ input query. Visualization has been used mainly as the ﬁrst step of document summarization to describe the search result in brief. [19] uses the graph visualization for the measurement of the sentence similarity and generates outline of the document. In [28], the relation of the words is calculated based on the co-occurrence in the same sentence and the text of the questionnaire of the consumer is analyzed in a hierarchical key word graph. In [24], the weights of the feature words are extracted by applying SVM for frequency of words, the average path length and the clustering co-eﬃcient of the co-occurrence graph of words in sentences. Most of these approaches display only the relative position of words. We only can see that the distance of two words represents the similarity or relation and that the close words are strongly related to each other. On the other hand, the classiﬁcation network, which we introduce in the present paper, is a hierarchical directed graph. An edge represents not only how two words are similar but also that the word in the left of the edge appears only in the sentence which includes the right word. The notion of classiﬁcation network comes from formal concept lattice and the concept graph [1] which was introduced and studied in [21,13,20]. [21] used the concept graph to generate a hierarchical structure of words. [13] formulated the structure of researchers in particular companies in term of the concept graph. [20] analysed ﬁnancial reports with the concept graph. A concept graph is determined with a threshold α. If we construct a concept graph with α = 0.1, the relation of words is very weak. The classiﬁcation network can be considered as a concept graph with the threshold alpha = 1.0.

3 3.1

Feature Extraction by Classification Network Formal Concept Lattice and Classification Network

The classiﬁcation is the most basic technique for analyzing the object. The result of classiﬁcation depends on what attributes are chosen to represent the objects. Given a set of objects and a set of attributes, the formal concept lattice [3,5] is a lattice that displays inclusion relation of concepts, where a concept is a pair of a set of objects and a set of attributes that determine each other. The adjacent concepts that appear in the lower direction of a concept represent the classiﬁcation of the concept with attributes. In the present paper, the objects are sentences and the attributes are words that appear in the sentences.

Search and Analysis of Bankruptcy Cause by Classiﬁcation Network

155

Table 1. An example of context matrix

Lehman Shock recession Standard Building Law customer burst of the economic bubble

A B C D E F G √ √ √√ √ √ √ √ √ √ √ √

Fig. 2. Concept Lattice

Fig. 2 is the concept lattice for Table 1 which describes characteristics of imaginary companies A-G. Each node represents a concept where the objects (company names) and the attributes (keywords) are displayed. Note that a company name appear only once in the path from the left end to the right end for simplicity. Consider the node ”E/Standard Building Law”, for example. It is understood that the word ”Standard Building Law” appears at all lower nodes which locates in the right direction from the node and that the name ”E” of the company appears at all upper nodes. The node ”E/Standard Building Law” shows the fact that the two companies ”C” and ”E” are completely characterised by the two words ”Lehman Shock” and ”Standard Building Law”. This simpliﬁed representation generates such nodes whose objects and attributes are empty. In fact, there are at most 2n pairs of object names and attribute names assuming that n is largest number of objects and attributes. On the other hand, the number of concepts may be 2n . The gap between 2n and 2n implies that most of the nodes in a concept lattice have empty label. Hence, the visualization does not give any help for comprehension. In the present paper, we eliminate all the nodes from the concept lattice, when the label is empty for the node. We call this directed graph as ”classiﬁcation network”. In the real visualization, we only display the set of attributes, i.e., keywords for each node. With this visualization, we can interpret the global and local relationship of keywords more intuitively. Fig. 3 is the classiﬁcation network for Table 1. We can see that ”customers” and ”burst of the economic bubble” are related and that ”recession” is isolated.

156

S. Hirokawa, T. Baba, and T. Nakatoh

Fig. 3. Classiﬁcation Network

3.2

Feature Extraction by Classification Network

We introduce a score sc(w, q) of a keyword w with respect to the query q as follows. Firstly, we obtain the set of sentences S(q) that satisfy the query q. Secondly, we extract all the words W (S(q)) that appear in S(q). Thirdly, we construct the classiﬁcation network CN (q) from the context matrix with respect to S(q) and W (S(q)). The score sc(w, q) of the keyword w is determined as the sum of the following four values. – – – –

(1) (2) (3) (4)

depth(w, CN (q)) – the depth of the node w in CN (q) f req(w) – the frequency of the keyword q adj(w) – the number of adjacent nodes of w in CN (q) W (w) – the number of the words in the node

We sort words in W (q)) using the score sc(w, q) to obtain feature words for the query q.

4

Evaluation

In the present paper, we used real data which is available at [9]. This site provides brief information for the companies which went bankrupt recent years. Each document describes the outline of the company and how it went bankrupt. From this site, we collected 276 articles dated between 20050201(February 1st, 2005) to 20100228(February 28, 2010). We use the sentences of these articles instead of articles themselves to obtain detailed relationship of keywords. We separated an article into several sentences with the punctuation. Each article contains 9.8 sentences in average. To compare and evaluate the accuracy of the extracted keywords, we manually selected keywords for each article that represent the cause and the reason of the bankruptcy of the company. 4.1

Comparison with TF*IDF

In this section, we compare the accuracy of proposed method and TFD*IDF for extraction of bankrupty cause with respect to each company.

Search and Analysis of Bankruptcy Cause by Classiﬁcation Network

157

Given a name q of a company as a query, we obtain the set of sentence S(q). We compare the top 20 keywords that are obtained by TF*IDF and by the proposed method of classiﬁcation network. As a evaluation measurement, we use MAP(Mean Average Precision) [7]. MAP is calculated as the average of average precision AP (n). AP (n) is the average of the precision for each outputs below rank n and is obtained as the number of correct guess divided by n. MAP is the average of AP (n) when n varies from 1 to n. The value of MAP is better if the method gives correct guess in top rankings.

Fig. 4. MAP of Classiﬁcation Network and TF*IDF

We compared MAP with respect to top 20 keywords by TF*IDF and those by the proposed method (CN). Fig. 4 displays the values of MAP for each company. The x-axis denotes the MAP by TF*IDF. The y-axis denotes the MAP by CN. CN TF*IDF MAP 0.121 0.081

The average MAP is 0.121 for CN and 0.081 for TF*IDF. The propose method gain a better evaluation than TF*IDF. By close analysis of Fig.4, it turned out that there are 156 companies, amont all of 276 companies, whose MAPs are zero with the proposed method CN. However, TF*IDF yields the similar low score of

158

S. Hirokawa, T. Baba, and T. Nakatoh

MAP below 0.2 either, for those companies. This means that neither methods achieve a good performance. It is worthwhile to note that most of the MAPs by TF*IDF are below 0.2, while most the MAPs by CN are above 0.2. From these observation, we can say that the proposed method CN outperform the baseline TF*IDF. 4.2

Re-ranking by Classification Network

To construct a classiﬁcation network, we need to determine the order for each pair (u, v) of words. It takes O(n2 )-time if there are n words. On the other hand, the keywords that we expect to use as feature are at most 10 or 20. Therefore, most of the calculation of the order of u and v might have nothing to do with the result. So, we propose a practical method to utilize CN for ”re-ranking” of keywords, where a set L of keywords are obtained as top 20 keywords with respect to TF*IDF, then the classiﬁcation network for L. Finally, the list L is re-ranked with respect to the score by CN. By limiting the number of words to 20, we can save computation time for ordering of the words.

Fig. 5. Re-ranking by Classiﬁcation Network

Fig. 5 displays the MAPs for TF*IDF and that by re-ranking. The average MAP for re-ranking method is 0.081, which is lower than that of the full CN method in the previous section but is better than that of TF*IDF. CN TF*IDF MAP 0.101 0.081

Table 6 shows the ranked list of words for ”Izumo Airport Country Club” company with TF*IDF and with CN. The red colored words are correct words chosen by human. The top 3 words in TF*IDF disappear or get lower ranking in

Search and Analysis of Bankruptcy Cause by Classiﬁcation Network

159

Fig. 6. Classiﬁcation Network of Izumo Airport Country Club Co

CN. These one character words are usually dealt with as stop words, where we do not use stop word list in the experiment. Thus, the re-ranking by CN perform a kind of stop-word elimination. Note that the three red words at top rankings in CN do not appear in the similar high ranking in TF*IDF.

5

Case Study of Close Analysis of Classification Network

Fig. 7 is the classiﬁcation network for the company ”Izumo Airport Country Club”. The correct words are marked with asterisk ”*” and the nodes which contain a correct word are emphasized with red circle. The asterisk mark and the red circles are used only to help comprehension of the graph and are not used for ranking of keywords. We can see such words as ”price”, ”competition”, ”neighbor”, ”population” and ”decline”. We can think of a hyphothesis that the competition of price in neighboring golf courses and the decline of golf population caused the deterioration in earnings. In fact, we found such a sentence in the article of the company. It is also worthwhile to note that the red-circled nodes contain many words with asterisk mark. This implies that the correct feature words tend to occur in the same nodes. This observation is a source why we introduced the value W (w) for the sore of sc(w, q). In fact, the correct words does not appear every sentence but appear only in speciﬁc sentences.

6

Conclusion and Further Work

The present paper proposed the classiﬁcation network (CN)of keywords and a word scoring method based on CN. The method are applied to discover bankruptcy cause of companies from 276 bankruptcy information. It is conﬁrmed that the proposed method outperforms TF*IDF to extract feature words of bankruptcy. Another method is introduced to use CN for re-ranking words.

160

S. Hirokawa, T. Baba, and T. Nakatoh

The result of re-ranking is better than that of TF*IDF but is lower than that of original CN method. We have two plans for further work. The ﬁrst plan is to characterise the sentences that contain the bankruptcy cause words. We observed that the word ”but” appears often in the red-circled nodes of many classiﬁcation networks. The word ”but” is not the correct word as the bankruptcy cause. However, the word ”but” separates the initial description of the company from the explanation how bankruptcy occurred for that company. Therefore, the chances are high that cause words might appear after the word ”but” in the same sentence. We think that there might be similar ”clue” words to distinguish the sentences that contain bankrupty cause. The second plan is to compare the proposed CN-base method with other methods which uses graph structure.

References 1. Baba, T., Liu, L., Hirokawa, S.: Formal Concept Analysis of Medical Incident Reports. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6278, pp. 207–214. Springer, Heidelberg (2010) 2. Aoshima, T., Fukuta, N., Yokoyama, S., Ishikawa, H.: A Proposal of Constrained Clustering of Micro-Blogs. In: DEIM 2010 B1-3 (2010) (in Japanese) 3. Carpineto, C., Romano, G.: Concept Data Analysis Theory and Application. John Wiley and Sons, Chichester (2004) 4. Egoshi, R., Nagai, H., Nakamura, T.: Extraction of Important Articles from Related Articles using Small World Structure. IPSJ SIG. Notes (113), 17–22 (2008) (in Japanese) 5. Ganter, G., Wille, R., Franzke, C.: Formal Concept Analysis Mathematical Foundation. Springer, Heidelberg (1999) 6. Izumi, K., Goto, T., Matsui, T.: Analysis of Financial Markets Fluctuation by Textual Information. Journal of JSAI 25(3), 383–387 (2010) (in Japanese) 7. Kishida, K.: Property of Mean Average Precision as Performance Measure in Retrieval Experiment. IPSJ SIG. Notes (74), 97–104 (2001) (in Japanese) 8. Shirata, C.Y., Takeuchi, H., Ogino, S., Watanabe, H.: Financial Analysis using Text Mining Technique: Empirical Analysis of Bankrupt Companies. Business Analysis Association Annual Report (25), 40–47 (2009) (in Japanese) 9. Chan, S.W.K.: Extraction of salient textual patterns: Synergy between lexical cohesion and contextual coherence. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 34(2), 205–218 (2004) 10. Chuang, W.T., Yang, J.: Extracting sentence segments for text summarization: A machine learning approach. SIGIR Forum (ACM Special Interest Group on Information Retrieval), 152–159 (2000) 11. Goda, S., Ohsawa, Y.: Chance discovery in credit risk management -Time order method and directed KeyGraph for estimation of chain reaction bankruptcy structure. In: Satoh, K., Inokuchi, A., Nagao, K., Kawamura, T. (eds.) JSAI 2007. LNCS (LNAI), vol. 4914, pp. 247–254. Springer, Heidelberg (2008) 12. Huﬀ, A.S. (ed.): Mapping Strategic Thought. Wiley, Chichester (1990) 13. Iino, Y., Hirokawa, S.: Time Series Analysis of R and D Team Using Patent Information. In: Vel´ asquez, J.D., R´ıos, S.A., Howlett, R.J., Jain, L.C. (eds.) KES 2009. LNCS, vol. 5712, pp. 464–471. Springer, Heidelberg (2009)

Search and Analysis of Bankruptcy Cause by Classiﬁcation Network

161

14. Kida, M.: Cognitive research of Asahi ’s organizational renewal -textminig of annual reports organizational science. Academic Journal 39(4) (2006) 15. Koester, B.: Conceptual Knowledge Retrieval with FooCA: Improving Web Search Engine Results with Contexts anc Concept Hierarchies. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 176–190. Springer, Heidelberg (2006) 16. Li, B., Zhou, L., Fen, S., Wong, K.-F.: A Uniﬁed Graph Model for Sentence-based Opinion Retrieval. In: Proc. 48th ACL, pp. 1367–1375 (2010) 17. Li, H., Sun, J., Wu, J.: Predicting business failure using classiﬁcation and regression tree: An empirical comparison with popular classiﬁcal statistical methods and top classiﬁcation mining methods. Expert Systems with Applications 37(8), 5895–5904 (2010) 18. Liu, X., Webster, J., Kit, C.: An extractive text summarizer based on signiﬁcant words. In: Li, W., Moll´ a-Aliod, D. (eds.) ICCPOL 2009. LNCS (LNAI), vol. 5459, pp. 168–178. Springer, Heidelberg (2009) 19. Ouyang, Y., Li, W., Wei, F., Lu, Q.: Learning similar ity functions in graph-based document summarization. In: Li, W., Moll´ a-Aliod, D. (eds.) ICCPOL 2009. LNCS (LNAI), vol. 5459, pp. 189–200. Springer, Heidelberg (2009) 20. Mining System based on Search Engine and Concept Graph for Large-Scale Financial Report Texts. In: Proc. 2nd IEEE ICIFE, pp. 675–679 (2010) 21. Shimoji, Y., Wada, T., Hirokawa, S.: Dynamic Thesaurus Construction from English- Japanese Dictionary. In: Proc. The Second International Conference on Complex, Intelligent and Software Intensive Systems, pp. 918–923 (2008) 22. Shin, K.-S., Lee, T.-S., Kim, H.-J.: An application of support vector machines in bankruptcy prediction model. Expert Systems with Application 28(1), 127–135 (2005) 23. http://www.tsr-net.co.jp/ 24. Yamamoto, Y., Orihara, R.: Keyword Extraction using the Word Co-occurrence Network Properties that is Independent of Languages and Document Types and Its Evaluation by Prediction of Headline Words. Trans. JSAI 24(3), 303–312 (2009) (in Japanese) 25. Takeuchi, H., Ogino, S., Watanabe, H., Shirata, Y.: Context-based text mining for insights in long documents. In: Yamaguchi, T. (ed.) PAKM 2008. LNCS (LNAI), vol. 5345, pp. 123–134. Springer, Heidelberg (2008) 26. Tsai, C.-F.: Feature selection in bankruptcy prediction. Knowledge-Based Systems 22, 120–127 (2009) 27. Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000) 28. Uchida, Y., Yoshikawa, T., Furuhashi, T., Hirao, E., Iguchi, H.: Extraction of important keywords in free text of questionnaire data and visualization of relationship among sentences. In: IEEE International Conference on Fuzzy Systems, art. no. 5277332, pp. 1604–1608 (2009)

Conceptual Distance for Association Rules Post-processing Ramdane Maamri1 and Mohamed said Hamani2 1,2

Lire Laboratory, University of Mentouri, Constantine, Algeria { University of Mentouri, Constantine, 2 University of Farhat Abbas, Setif}Algeria [email protected], [email protected] 1

Abstract. Data-mining methods have the drawbacks to generate a very large number of rules, sometimes obvious, useless or not very interesting to the user. In this paper we propose a new approach to find unexpected rules from a set of discovered association rules. This technique is characterized by analyzing the discovered association rules using the user’s existing knowledge about the domain represented by a fuzzy domain ontology and then ranking the discovered rules according to the conceptual distance of the rule. Keywords: data mining; fuzzy ontology; unexpectedness; association rule; domain knowledge; interestingness; conceptual distance.

1 Introduction Knowledge discovery in databases (data mining) has been defined in [2] as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from data. Association rule algorithms [1] are rule-discovery methods that discover patterns in the form of IF-THEN rules. It was noticed that most algorithms of data mining generate a large number of rules who are valid but obvious or not very interesting to the user [8, 3]. To address this issue most approaches on knowledge discovery use objective measures of interestingness, such as confidence and support [1], for the evaluation of the discovered rules. The interestingness of a rule is essentially subjective [8, 3]. Subjective measures of interestingness, such as unexpectedness [9], assume that the interestingness of a pattern depends on the decision-maker and does not solely depend on the statistical strength of the pattern. One way to approach this problem is by focusing on discovering unexpected patterns [8, 3] where unexpectedness of discovered patterns is usually defined relative to a system of prior expectations. Ontologies allow domain knowledge to be captured in an explicit and formal way such that it can be shared among human and computer systems. Highly related concepts are grouped together in the hierarchy. We propose a new approach for ranking rules according to their conceptual distance. The more concepts are far away, the less are related to each other. The least concepts are related to each other and take part of the definition of a rule the more surprising the rule is and therefore interesting. With such ranking, a user can check fewer rules on the top of the list to extract the most pertinent ones. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 162–169, 2011. © Springer-Verlag Berlin Heidelberg 2011

Conceptual Distance for Association Rules Post-Processing

163

1.1 Association Rules and Rule Interestingness Measures Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Many algorithms such Apriori [1] can be used to discover association rules from data to extract useful patterns. Past research in data mining has shown that the interestingness of a rule can be measured using objective measures and subjective measures. Objective measures involve analyzing the rule’s structure and statistical significance. However, it is noted in [8] that objective measures are insufficient for determining the interestingness of a discovered rule. Two main subjective interestingness measures are: unexpectedness [3,8] and actionability [8]. • •

Unexpectedness: Rules are interesting if they are unknown to the user or contradict the user’s existing knowledge (or expectations). Actionability: Rules are interesting if the user can do something with them to his/her advantage.

In this research, we only focus on unexpectedness. 1.2 Fuzzy Ontology Although there is not a universal consensus on the definition of ontology, it is generally accepted that ontology is a specification of conceptualization [16]. Ontology can take the simple form of a taxonomy (i.e., knowledge encoded in a minimal hierarchical structure) or as a vocabulary with standardized machine interpretable terminology. Ontologies are composed of concepts, relationships, instances and axioms. A concept represents a class of entities within a domain whereas relationships describe the interactions between concepts or their properties [18]. Relationships can be classified to taxonomies and associative relationships [18]. The fuzzy ontology has been introduced to represent the fuzzy concepts and relationships where each concept is related to other concepts in the ontology, with a degree of membership µ (0 ≤ µ ≤ 1). The Fuzzy ontology is a hierarchical relationship between concepts within a domain, which can be viewed as a graph. Our approach uses fuzzy membership degree in “IS-A” relationships between concepts. 1.3 Conceptual Distance Two main categories of algorithms for computing the semantic distance between terms organized in a hierarchical structure have been proposed in the literature: distance-based approaches and information content-based approaches. The general idea behind the distance-based algorithms [7] is to find the shortest path between two concepts in terms of number of edges. Information content-based approaches [7] are inspired by the perception that pairs of concepts which share many common contexts are semantically related. The problem of the ontology distance is that it is highly dependent on the construction of the ontology. To address this problem, we are associating weights to concepts in the ontology along with the strength of to relation between concepts. Our approach uses fuzzy membership degree in “IS-A” relationships between concepts. In an IS-A semantic network, the simplest form of

164

R. Maamri and M. said Hamani

determining the distance between two concept nodes, A and B, is the shortest path that links A and B, i.e. the minimum number of edges that separate A and B [7] or the sum of weights of the arcs along the shortest path between A and B [10].

2 Method Presentation Data-mining is the process of discovering patterns in data. The use of objective measures of interestingness, such as confidence and support, is a step toward interestingness. Beside objective measures, our approach exploit domain knowledge represented by Fuzzy ontology organized as DAG hierarchy. The nodes of the hierarchy represent the rules vocabulary. For a rule like (x AND y z) x, y and z are nodes in the hierarchy. The conceptual distance between the antecedent (x AND y) and the consequent (z) of the rule is a measure of interestingness. The more the distance is high, the more the rule is unexpected and therefore interesting. Based on this measure a ranking algorithm helps in selecting those rules of interest to the user. The basic idea of our technique consist of generation association rules using any algorithm of rule generation (‘A-priori’ Algorithm for instance) and adjusting the objective measures as support and confidence to user needs. The output i.e. association rules resulting from this process along with domain ontology becomes the inputs to our approach. Our technique analyzes the discovered rules and computes the conceptual distance of each rule. The higher is the distance the more interesting is the rule. 2.1 Concept Semantic Distance The semantic distance between two concepts A and B is the sum of weights of the arcs along the shortest path between A and B [10]. In order to calculate the weight for fuzzy relations, an extension to the weighting function ƒ (μ, ω): {0,1}xIRIR relative to the crisp hierarchy relations in (1) is needed.

ω μ = 1 (if f (μ ,ω ) =  0 μ = 0 (if

A, B Connected ) . A, B Disconnected )

(1)

µ is the membership degree of a concept to its parent and ω is the weight associated to this concept in the hierarchy. The function in (1) (with Boolean variable(µ ∈{0, 1})) is extended to a weighting function ƒ(μ,ω):[0,1]xIRIR in (2) (with continuous variable (µ∈ [0,1])).

 ω + (1 − μ f (μ ,ω ) =  0

)*

ω μ ≠ 0 μ = 0

.

(2)

In order to compute the shortest path between two nodes we are using Dijkstra's algorithm [15]

Conceptual Distance for Association Rules Post-Processing

165

2.2 Rule Conceptual Distance In order to compute the distance between groups of concepts, for a given rule R :X Y where X=X1∧…∧Xk, Y=Y1∧…∧Ym, we use Hausdorff distance. Distance(X,Y)=max(h(X,Y),h(Y,X)) where : h( X , Y )

= max min x − y . x∈X

y∈Y

The function h(X,Y) is called the directed Hausdorff 'distance' from X to Y. It identifies the point Xi∈X that is farthest from any point of Y, and measures the distance from Xi to its nearest neighbor in Y. The Hausdorff distance, H(X,Y), measures the degree of mismatch between two sets, as it reflects the distance of the point of X that is farthest from any point of Y and vice versa [11]. In our approach we are looking for the surprising rules, those with a maximum distance between the antecedent and the consequent in a rule X1∧…∧Xk  Y1∧…∧Ym with k Xi and m atomic Yj concepts respectively. Hausdorff distance is a good candidate for this case.

3 Experiments The experiments were performed using a census income database [13]. To generate the association rules, we used the implementation of the ‘Apriori’ algorithm [12] with a minimum support value equal 0.2 and a minimum confidence value equal 0.2. The number of the generated rules set is 2225. In order to perform the experiments, we created taxonomy of 81 weighted concepts (Fig 1) and defined two fuzzy concepts ‘Low_Level’ and ‘High_Level’ for education as fuzzy sets based on the data set we are studying. The membership functions to the fuzzy sets are: µHigh(Level_Rank)= Level_Rank/15 where Level_Rank is a sequential number ranging from 0 to 15 with 0 representing 'Preschool’ and 15 represents 'Docotrate'. Level_Rank={(Preschool=0), (1st-4th=1), ( 5th-6th=2), ( 7th-8th =3), ( 9th=4), th (10 =5), ( 11th=6), (12th=7), (HS-grad =8), (Some-college=9), (Assoc-acdm=10), (Assoc-voc=11), (Bachelors=12), (Prof-school=13), (Masters=14), (Doctorate=15)} Formally High_Level and Low_Level fuzzy set can be defined as: High_Level={Preschool/0.00, 1st-4th/0.07, 5th-6th/0.13, 7th-8th/0.20, 9th/0.27, 10th/0.33, 11th/0.40, 12th/0.47, HS-grad /0.53, Some-college/0.60, Assoc-acdm /0.67, Assoc-voc /0.73, Bachelors /0.80,Prof-school /0.87, Masters /0.93, Doctorate/1.00} Low_Level={Preschool/1.00, 1st-4th/0.93, 5th-6th/0.87, 7th-8th/0.80, 9th/0.73,10th/0.67, 11th/0.60, 12th/0.53, HS-grad /0.47, Some-college/0.40, Assocacdm /0.33, Assoc-voc /0.27, Bachelors /0.20,Prof-school /0.13, Masters /0.07, Doctorate/0.00 } Our experiments were conducted with weight =7 for all education concepts and with different weights for the rest (Fig 1). The results are presented in (Fig 2). The concept subsumer of ‘HS-Grade’ and ‘craft-repair’ of rule (1) in (Fig 2) is 'census-income’ (Fig 1). These concepts are less related to each other based on the ontology. The same apply for the rule (2) in (Fig 2).

166

R. Maamri and M. said Hamani

The concept subsumer of ‘some-college’ and ‘Never-Married’ of rule (3) in (Fig 2) is 'census-income’ as well, however, the membership degree of ‘some-college’ is 0.60 (Fig 1). Note though that the weight for both concepts ‘craft-repair’ and ‘NeverMarried’ is 2. 'HS-Grade’ has a membership degree of 0.53 and is less than the membership degree of the concept 'some-college’ which is 0.60 (rule (3)) which makes it relatively far from expected and therefore more interesting. The common subsumer for the last 2 rules (Fig 2) is the concept 'Personal' (Fig 1). These rules express the relation between concepts ‘sex’ and 'age' and they are close to each other based on the ontology. A rule such as (1), (2) is more interesting, because it is giving information between 'Education' and 'Occupation' and it involves a higher decision maker than the one concerning 'sex' and 'age' (last 2 rules). The more we move up in the hierarchy, more the decision is important and the vision of the decision maker is broader, strategic and important. The approach makes no difference between rules xy and yx though.

Fig. 2. Experiment rules ranking results

Conceptual Distance for Association Rules Post-Processing

167

Fig. 1. Census-income Ontology

4 Related Works Unexpectedness of patterns has been studied in [8, 3] and defined in comparison with user beliefs. A rule is considered interesting if it affects the levels of conviction of the user. In [6] the focus is on discovering minimal unexpected patterns rather than using any of the post processing approaches, such as filtering, to determine the minimal

168

R. Maamri and M. said Hamani

unexpected patterns from the set of all the discovered patterns. In [5] unexpectedness is defined from the point of view of a logical contradiction of a rule and conviction, the pattern that contradict a prior knowledge is unexpected. In [4], the unexpectedness of a discovered pattern is characterized by asking the user to specify a set of patterns according to his/her previous knowledge or intuitive feelings. This specified set of patterns is then used by a fuzzy matching algorithm to match and rank the discovered patterns. [5] propose an association rule mining algorithm that can take item constraints specified by the user in the rule mining process so that only those rules that satisfy the constraints are generated. [6] Has taken a different approach to the discovery of interesting patterns by eliminating non interesting association rules. In order to find subjectively interesting rules, most existing approaches ask the user to explicitly specify what types of rules are interesting or uninteresting, then generates or retrieves those matching rules. These researches on the unexpectedness make a syntactic or semantic comparison between a rule and a belief. Our definition of unexpectedness is based on the structure of background knowledge (hierarchy) underlying the terms (vocabulary) of the rule which is the conceptual distance between the head and the body of the rule. In our approach the knowledge is expressed as a hierarchy of ontology concepts. Ontologies enable knowledge sharing. Sharing vastly increases the potential for knowledge reuse and therefore allows our approach to get free knowledge just from using domain ontologies.

5 Conclusion and Future Work In this paper we proposed a new approach for ranking association rules according to their conceptual distance, defined on the base of ontological distance. The proposed ranking technique helps the user to identify interesting association rules, in particular, expected and unexpected rules. It uses a fuzzy ontology to calculate the distance between the antecedent and the consequent of rules on which is based the ranking. The more the conceptual distance is high, the more the rule represents a high degree of interest. In the future, we plan to integrate the concept proprieties in the conceptual distance computing and exploit others relation types of the ontology. We are planning to conduct further evaluations of our approach on other datasets using other background knowledge.

References [1] Agrawal, R., Imielinski, T., Swami, A.: Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering 5(6), 914–925 (1993) [2] Usama, M.: Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery: An overview. In: Advances in Knowledge Discovery and Data Mining, pp. 1–34 (1996) [3] Liu, B., Hsu, W.: Post-analysis of learned rules. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, Menlo Park, August 4-8, pp. 828–834. AAAI Press/MIT Press (1996)

Conceptual Distance for Association Rules Post-Processing

169

[4] Liu, B., Hsu, W., Mun, L.-F., Lee, H.-Y.: Finding interesting patterns using user expectations. IEEE Trans. Knowl. Data Eng. 11(6), 817–832 (1999) [5] Padmanabhan, B., Tuzhilin, A.: On the discovery of unexpected rules in data mining applications. In: On the Discovery of Unexpected Rules in Data Mining Applications, pp. 81–90 (1997); Procs. of the Workshop on Information Technology and Systems (WITS 1997) [6] Padmanabhan, B., Tuzhilin, A.: On characterization and discovery of minimal unexpected patterns in rule discovery. IEEE Trans. Knowl. Data Eng. 18(2), 202–216 (2006) [7] Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics 19(1), 17–30 (1989) [8] Silberschatz, A., Tuzhilin, A.: What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering 8(6), 970–974 (1996) [9] Uthurusamy, R., Fayyad, U.M., Spangler, W.S.: Learning useful rules from inconclusive data. In: Knowledge Discovery in Databases, pp. 141–158 (1991) [10] Richardson, R., Smeaton, A.F.: Using WordNet in a knowledge-based approach to information retrieval. Technical Report CA-0395, School of Computer Applications, Dublin City University, Dublin, Ireland (1995) [11] Huttenlocher, D.P., Kl, G.A., Rucklidge, W.J.: Comparing images using the hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 850–863 (1993) [12] Borgelt, C.: http://www.borgelt.net/software.html [13] Census income, ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/census-income/ [14] Ng, R.T., Lakshmanan, L.V.S., Han, J., Pang, A.: Exploratory mining and pruning optimizations of constrained associatino rules. In: SIGMOD 1998 (1998) [15] http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm [16] Guarino, N.: Formal ontology and information systems. In: Guarino, N. (ed.) Proceedings FOIS 1998, pp. 3–15. IOS Press, Amsterdam (1998) [17] Sahar, S.: On incorporating subjective interestingness into the mining process. In: ICDM, pp. 681–684 (2002) [18] Mansingh, G., Osei-Bryson, K.-M., Reichgelt, H.: Using ontologies to facilitate postprocessing of association rules by domain experts. Inf. Sci. 181(3), 419–434 (2011)

Manufacturing Execution Systems Intellectualization: Oil and Gas Implementation Sample Stepan Bogdan, Anton Kudinov, and Nikolay Markov Tomsk Polytechnic University, Lenin Avenue. 30, 634050 Tomsk, Russia {Bogdan,Kudinovav}@tpu.ru, [email protected]

Abstract. Up-to-date trend in industrial automation is implementation of Manufacturing Execution Systems (MES) everywhere and in oil and gas industry. Conception of MES is constantly in progress. Many researches suppose that analytical features, available for low-end users (engineers, dispatchers, geologists, etc.) are necessary in manufacturing management, but today there is no ready-to-use framework applicable to make intelligent manufacturing systems for oil and gas industry. A model-driven approach of MES intellectualization and an original iMES framework proposed. iMES based on functions of the traditional MES (within MESA-11 model), business intelligence (BI)-methods (On-Line Analytical Processing & Data Mining) and production markup language (industrial data standard for oil and gas production). Case study of well tests results validation using iMES framework is considered. Keywords: Manufacturing Execution System, data mining in industry, Manufacturing Process Control, Intellectual Manufacturing Systems.

1 Introduction Continuous process manufacturing is a very sophisticated object to manage, especially if it is an oil & gas production industry. To improve controllability of production management automation and information systems are traditionally implemented. Computer integrated manufacturing (CIM) conception promulgates a fundamental strategy of integrating manufacturing facilities and systems in an enterprise through computers and its peripherals to control the entire production process [1]. According to Williams [2] there are three main architectures widely known: the CIMOSA, the reference model GRAI-GIM and the PERA. Despite selected architecture, according to CIM there are 5 hierarchical levels of information systems of manufacturing enterprise [2]. Zero and first levels are for automation such as sensors and measurement elements. Supervisory Control And Data Acquisition (SCADA), Manufacturing execution system (MES) and Enterprise Resource Planning (ERP) systems are placed on higher appropriate levels. MES market is rapidly growing [3] and lots of MES solutions are now implemented everywhere and in oil & gas production enterprises, but today MES implementation is still a problem [4]. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 170–177, 2011. © Springer-Verlag Berlin Heidelberg 2011

Manufacturing Execution Systems Intellectualization: Oil and Gas Implementation Sample 171

MES conception evolves from MESA-11 model to c-MES model [5]. Nowadays, according to Littlefield [6], MES has outlived its original definition and new conception of manufacturing operations management (MOM) software is the next generation of manufacturing systems. Nevertheless, mass of modern MES are made based on MESA-11 and c-MES models. Effect of MES implementation could be better when it is primarily focused on operational (main) business process management (production process management) [7]. In this article by the term management we mean planning and control plans execution [8]. This article is about enhancing MES by using BI technologies and meta-model based warehouse to fulfill modern oil and gas production management system requirements.

2 MES Intellectualization According to Van Dyk [9] software can provide all functions traditionally expected from MES, but if these functions are not integrated within business process, then such software cannot be counted as a MES. In oil and gas production companies main continuous operational process consist of four sequential parts: production, treatment, transportation and realization of hydrocarbon products. Management of this process crosses several time cycles (minutes, hours, days, months, years) and usually managed by process engineers (technologists), geologists and dispatchers by plans and monitoring systems. In “minutes/seconds” time zoom, fields dispatch units continuously get information from field automation and asynchronously from field production units. In “hour” time zoom a dispatching unit of every field making summary of production indicators to pass it to central dispatch unit. Every day the central dispatch unit distributes manufacturing summary to other organizational units including geological unit and makes daily mission plan to each of field dispatch unit. Geological unit can intervene in the management process if daily manufacturing summary indicates risk of disorder and monthly generates well operating practices plan. Process engineer unit acts like a global supervisor and every year generates a detailed production plan depending on factual wells operating plan execution. According to MES conception its functions cover only dispatching functions on hours/days time cycles of the management process. Production management process is not only data and document flow. Every stage of it involves complex sub-processes of complicated analytics. So there is a problem of the implementation such management process within classical CIM model [1, 2]: well operating practices and plan for production, treatment and realization of hydrocarbon products phases are off the borders of ERP (because of geological and process engineer units, their functions and used data are logically related to MES level) and MES (because of month and year planning periods and advanced analytics are not typical for MES [10]). Furthermore there is a problem of data integration because production data is usually used by different units in different way. There are also various naming styles; it demands a data integration tools implementation. Such solution associated with lots of negative effects such as high cost of solution, integration problems, low maintainability and so on. So general problems are: data integration, naming systems unification and advanced analytical processing.

172

S. Bogdan, A. Kudinov, and N. Markov

There are two general approaches to solve these problems: enhancing ERP to work with month and year technological data and enhancing MES by adding BI-systems are usually used to solve analytical processing problem. Both of these approaches are generally based on Intellectual Manufacturing System (IMS) idea [11] and allow using advanced analytics to solve manufacturing tasks, but there is still data integration and naming systems unification problems unsolved. Enhancing ERP to include year planning and well operating practices generation entails transfer of a big part of industrial data to fifth level of CIM. It is contradicts with general idea of CIM where an industrial data stream narrows while moving up to next level [2]. Enhancing MES to include whole production process management in single MES seems as good solution. To use this approach there should be an appropriate framework. Unfortunately, state-of-art IMS theory does not describe an appropriate framework to use in oil and gas production management system design [11]. To solve this problem we propose an industry oriented approach based on MES intellectualization. Our thesis is: enhancing MES functions by using OLAP technologies, data mining techniques (solves analytical processing problems) and meta-model based data marts (solves data integration and naming system unification problems) can provide applicable solution for operational process management without unwanted negative effects listed above. Such enhanced MES we propose to call intellectual MES or iMES. Traditionally MES are based on Online Transaction Processing (OLTP) technologies which optimized to bulk load of real time data. Unfortunately OLTPbased MES are shows low performance for analytical processing. Traditionally business intelligence technologies are used on the higher level of CIM where ERP systems situated, but there are lots of analytical problems on this automation level [11]. For oil and gas production company data mining techniques are very useful especially on two upper levels of production management. Data mining techniques are used to solve some typical industrial analytical problems, but such solutions are usually isolated within specific software that does not come in everyday industrial management practice because of their low integration abilities [12]. iMES conception allows to easily implement any industrial analytics within single homogenous information space to reduce production management uncertainty, decrease human influences, help in finding factors problems and so on. Intellectualizing MES provides lots of possible profits for traditional MES functions. For convenience MES functions and data mining goals [13] were assembled into Table 1. Let consider some examples of data mining implementation in oil and gas industry on MES automation level. For oil and gas production companies clustering can be successfully used in dispatching to determine normal operating practices, anomalies and possible trends of controlled processes. Regression methods can be successfully used in maintenance management to find causes of breakdowns. Classification methods can be widely used all over industry. In oil and gas industry mining methods can be successfully used for well and reservoir modeling, oil and gas production transportation, component composition of hydrocarbon production forecasting, well tests analyzing etc [14].

Manufacturing Execution Systems Intellectualization: Oil and Gas Implementation Sample 173

+ + +

+ + +

+

+

+

+

+ +

+

+

+

+ + +

+

+ + +

+

+ + +

+ + + +

+

+

Sequence discovery

Visualization

+ + +

Regression

+

Forecasting

+

Clustering

RAS (Resource Allocation and Status) ODS (Operations/Detail Scheduling) DPU (Dispatching Production Units) DOC (Document Control) DCA (Data Collection/Acquisition) LM (Labor Management) QM (Quality Management) PM (Process Management) MM (Maintenance Management) PTG (Product Tracking and Genealogy) PA (Performance Analysis)

Classification

MESA-11functions

Association

Table 1. Common data mining tasks in MES applications

+ + + +

To let oil and gas specialists use the power of data mining in their daily work within MES context, iMES implementation needed. We propose iMES framework to make MES intellectualization easier and cheaper than providing solutions for every analytical problem that can appear in oil and gas production management. Proposed framework to enhance MES for oil and gas production is shown on Figure 1. Automatic control systems continuously obtain industrial data, which comes into production database by OPC1 protocol. After industrial data processing it loads in the common warehouse. Based on oil and gas industry data definition standards (PRODML, WITSML, RESQML2) data marts are based on common warehouse. These standards define hierarchical data models for most common oil and gas task groups. They are tightly connected with each other but intended for other purposes. This problem can be solved by OLAP implementation. Data mining models and analytic queries use general warehouse and/or data marts as data sources. Results of data mining can be presented to user by iMES client or other software. This framework is very useful: it consolidates and standardizes heterogeneous data needed on every stage of management of operational process; it allows making analytics of any complicity; it can be source of data for other applications (this is essential in manufacturing summary data population) etc. To improve an efficiency of the framework implementation we could highly recommend developing both OLTP and BI server parts of the iMES using integrated data management platform (like Oracle BI Suite, Microsoft SQL Server with Analysis Servises, etc.)

1 2

OLE for Process Control Production Markup Language, Wellsite Information Transfer Standard Markup Language, Reservoir Characterization Markup Language (www.energistics.org).

174

S. Bogdan, A. Kudinov, and N. Markov

Fig. 1. iMES framework

Next part of this article is a case study of solving practically significant problem for oil and gas industry using proposed industry oriented metamodel-driven approach to MES intellectualization (iMES framework).

3 Case Study: Well Test Validation As an example of practical usage of iMES well test validation problem was used. There are several different well test types; only production tests are considered in this article. There are 3 popular formal methods of data mining: KDD1, SEMMA2 and CRISP-DM3 [15]. To describe possible solution of this problem industry- and toolneutral data mining process model (CRISP-DM) was implemented. It involves next sequential phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment. Business Understanding, Data Understanding and Data Preparation. Well dynamics are very complicated. Analytical processing of well tests results is aimed to make more accurate well operating practices plans as an important phase of oil and gas production management [14]. There is a problem of classifying well test results as valid or invalid. This classification depends on lots of factors such as random distortion, features of field exploitation, production intensification activities, production management style and others. So in practice geologists use complex expert estimation to make a decision: to use or not to use well test result data in later calculations. To support such decision making statistical models can be implemented. The goal is to create a model which can automatically estimate well test results and suggest validity of them using their history with predictive accuracy at least 75% [14] for the most common test results. To solve such analytical problems iMES framework disposes data-mart level as a source of data to generate data models. For the well test validation PRODML-based

Manufacturing Execution Systems Intellectualization: Oil and Gas Implementation Sample 175

data mart of iMES framework should be used. Actually PRODML is a hierarchical object model so to use it for online analysis it must be transformed to multidimensional model. This model in iMES framework consists of six tightly connected parts: Installed System (describes wells equipment), Measurements (contains history of different on-field measures), Product Operation (contains a history of on-field equipment usage), Product Flow (contains a history of product flow through nodes), Product Volume (contains a history of product volumes storage), Well Test (contains a factual information about different types of well tests performed). The data is collected by MES into production database and then filtered and validated it comes to common warehouse and goes to PRODML-based data mart. Well production test data of each test can be described by about 30-50 (depends on types of fluids) standard attributes such as test date, well head temperature, flowing pressure, flow line pressure, pOverZ, choke orifice size, gas oil ratio, fluid velocity, gas potential, gas volume, pressure drawdown, gas rate, gas density etc. As a base dataset we used gas well production tests history of JSC Vostokgazprom3 for a period of 1998-2011 years. All data was previously loaded from papers to database and verified by experts. Whole dataset is 3528 results of well tests. Only 2319 of dataset were classified by experts as valid and 1209 as invalid. For this dataset missing values in well tests attributive part are common so model should consider them correctly. Modeling. There are several types of classification methods such as clustering, neural networks, decision trees and others. Comparison of different modeling methods is out of the scope of this article. The article shows practical benefit of data mining application in MES level of oil and gas production company. To generate a statistical model for well test results Microsoft Decision Trees algorithm [16] algorithm was used. As result we have got a tree-structured model shown on Figure 3.

Fig. 3. Tree-structured model of well test validity

Values in circles mean amount of well tests estimated as valid (in percent). This model shows that hypothesis about well test estimation can be successfully done basing on gas rate and pressure drawdown is invalid on current dataset because 3

www.vostokgazprom.ru

176

S. Bogdan, A. Kudinov, and N. Markov

obtained classifiers are annulus pressure, bottom hole temperature pressure drawdown and stratum type. According to proposed iMES framework tree-structured data model should be implemented within analytic module on appropriate level. Evaluation and Deployment. 82% of well test results were correctly classified by model. It is better than 75% stated as criteria above (other not boosted tree algorithms did not showed a significantly better accuracy). So this model can be used for preestimation of well test results validity and help geologist to make decision. From user side workflow of well test validation is very easy: every new well test substitutes from user interface and shows model based validity estimation. Geologist can agree or disagree with this recommendation. Each disagreement causes model recalculation. Further experience. Shown well test classification generously based on manually inputted data through user interface, but using iMES conception specialists can easily mix such manual data and automatically gathered data. A practical example of it is using iMES to identify features of defective equipment. From time to time on field automation marks sensor measured value as "bad". There are lots of possible reasons for this. Using data mining we can find features that can help experts to solve the problem of "bad" data. Automatically gathered data from thousands of sensors on fields of Vostokgazprom were processed on data preparation level of iMES and loaded on warehouse level (Fig. 1). Then were selected data sources of the most frequent "bad" marked values toward all data (gathering such statistics is resource-intensive for OLTP systems but easy for OLAP systems). After that obtained processed data was mixed with static data describing data sources on data mart level of iMES. Using Microsoft SQL Server Analysis Services data mining algorithms several mining models were built. As a result of subsequent querying general features of defective data sources were found. Analysis shown that defective data source is usually a pressure sensor set on pump of exact type, produced in 19982000 periods by a single manufacturer and installed outdoors. This mined knowledge saved resources of specialists to find the reason of "bad" data, and gave them useful analytical data to replace defective equipment. These samples show that iMES conception can be successfully used in oil and gas and possibly on other industries. Today most of organizational units in Vostokgazprom are involved in MES context, so deployment of this model should be made in tight connection with MES “Magistral-Vostok”4 implemented here. Before deployment of our model MES “Magistral-Vostok” was intellectualized as describe above. It allowed: to consolidate needed data for our model, to deploy our model on appropriate level of iMES between other possible data mining models and analytic functions, to make needed feedback for continuous model improvement, to easily use results of automatic well tests estimation at MES client-side etc. MES “Magistral-Vostok” made using Microsoft technologies (Microsoft SQL Server). Success of this solution has been proven by practical usage in Vostokgazprom for a several years [17].

4

mes-magistral.ru

Manufacturing Execution Systems Intellectualization: Oil and Gas Implementation Sample 177

4 Conclusion MES play a huge role in modern industry management. Despite the rapid evolution of MES conception, classical MESA-11 [12] is a common architecture for such systems. We stand for future intellectualization of MES. There are samples of IMS implementation in industry, but there are no actual frameworks which can be easily applied for oil and gas industry. We propose the meta-model driven approach to MES intellectualization and framework to design iMES. Shown example of MES intellectualization for oil and gas production company and its practical benefits let us assume that analogous approach can be implemented almost everywhere where MES is suitable.

References 1. Nagalingam, S.V., Lin, G.C.I.: Latest developments in CIM. Robotics and Computer Integrated Manufacturing 15, 423–430 (1999) 2. AMICE Consortium: Open System Architecture for CIM, Research Report of ESPRIT Project 688, vol. 1. Springer-Verlag (1989) 3. Logica, MES Product Survey 2010. Logica 526 (2010) 4. Shaohong, J., Qingjin, M.: Jinan Research on MES Architecture and Application for Cement Enterprises. In: ICCA 2007, May 30-June 1, pp. 1255–1259. IEEE, Guangzhou (2007) 5. MESA International, http://www.mesa.org/en/modelstrategicinitiatives/MESAModel.asp 6. Littlefield, M., Shah, M.: Management Operation Systems. The Next Generation of Manufacturing Systems, Aberdeen Group, 19 (2008) 7. Hammer, M., Champy, J.: Reengineering the Corporation: A Manifesto for Business Revolution. Harper Business, New York (1994) 8. Kanter, J.: Management-Oriented Management Information Systems, 2nd edn., p. 484. Prentice Hall, Englewood Cliffs (1977) 9. Van Dyk, L.: Manufacturing execution systems. M.Eng. dissertation. University of Pretoria, Pretoria (1999) 10. ANSI/ISA-95.00.03-2005 Enterprise-Control System Integration, Part 3: Models of Manufacturing Operations Management 11. Christo, C., Cardeira, C.: Trends in Intelligent Manufacturing Systems. In: ISIE 2007, June 4-7, pp. 3209–3214. IEEE, Vigo (2007) 12. Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining, Massachusetts Institute of Technology, 378 (2001) 13. Ngai, E., Xiu, L., Chau, D.: Application of data mining techniques in customer relationship management: A literature review and classification. Expert Systems with Applications 36(2), 2592–2602 (2009) 14. Al- Kaabi, A.U., Lee, J.W.: Using Artificial Neural Nets To Identify the Well – Test interpretation Model, SPE 28151 (1993) 15. Azevedo, A., Santos, M.P.: KDD, SEMMA and CRISP-DM: A parallel overview. In: IADIS European Conference Data Mining, Amsterdam, July 24-28, pp. 182–185 (2008) 16. Larson, B.: Delivering Business Intelligence With Microsoft SQL Server 2008, p. 792. McGraw-Hill Osborne Media, New York (2008) 17. Bogdan, S., Kudinov, A., Markov, N.: Example of implementation of MES MagistralVostok for oil and gas production enterprise. In: CEE-SECR 2009, October 28-29, pp. 131–136. IEEE, Moscow (2009)

Get Your Jokes Right: Ask the Crowd Joana Costa1 , Catarina Silva1,2 , Mário Antunes1,3 , and Bernardete Ribeiro2 1

2

Computer Science Communication and Research Centre School of Technology and Management, Polytechnic Institute of Leiria, Portugal {joana.costa,catarina,mario.antunes}@ipleiria.pt Department of Informatics Engineering, Center for Informatics and Systems of the University of Coimbra (CISUC), Portugal {catarina,bribeiro}@dei.uc.pt 3 Center for Research in Advanced Computing Systems (CRACS), Portugal

Abstract. Jokes classiﬁcation is an intrinsically subjective and complex task, mainly due to the diﬃculties related to cope with contextual constraints on classifying each joke. Nowadays people have less time to devote to search and enjoy humour and, as a consequence, people are usually interested on having a set of interesting ﬁltered jokes that could be worth reading, that is with a high probability of make them laugh. In this paper we propose a crowdsourcing based collective intelligent mechanism to classify humour and to recommend the most interesting jokes for further reading. Crowdsourcing is becoming a model for problem solving, as it revolves around using groups of people to handle tasks traditionally associated with experts or machines. We put forward an active learning Support Vector Machine (SVM) approach that uses crowdsourcing to improve classiﬁcation of user custom preferences. Experiments were carried out using the widely available Jester jokes dataset, with encouraging results. Keywords: Crowdsourcing, Support Vector Machines, Text Classiﬁcation, Humour classiﬁcation.

1

Introduction

Time is an important constraint due to the overwhelming kind of life modern societies provide. Additionally, people are overstimulated by information that is spread faster and eﬃciently by emergent communication models. As a consequence, people no longer need to search, as information arrives almost freely by several means, like personal mobile devices and social networks, like Twitter and Facebook, just to mention a few examples. The numerous facilities provided by these communication platforms have a direct consequence of getting people involved on reading small jokes (e.g. one-liners) and quickly emit an opinion that is visible instantly to all the connected users. Crowdsourcing emerged as a new paradigm for using all this information and opinion shared among users. Hence, this model is capable of aggregating talent, leveraging ingenuity while reducing the costs and time formerly needed to solve problems [1]. Moreover, crowdsourcing is enabled only through the technology of the web, which is a creative mode of user interactivity, not merely a medium between messages and people [1]. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 178–185, 2011. c Springer-Verlag Berlin Heidelberg 2011

Get Your Jokes Right: Ask the Crowd

179

In classiﬁcation scenarios, a large number of tasks must deal with inherently subjective labels and there is a substantial variation among diﬀerent annotators. One of such scenarios is text classiﬁcation [2] and particularly humour classiﬁcation, as it is one of the most interesting and diﬃcult tasks of it. The main reason behind this subjectivity is related with the contextual meaning of each joke, as they can have religious, racist or sexual comments. However, in spite of the attention it has received in ﬁelds such as philosophy, linguistics, and psychology, there have been few attempts to create computational models for automatic humour classiﬁcation and recommendation [3]. The SVM active learning approach we propose in this paper takes advantage of the best of breed SVM learning classiﬁer, active learning and crowdsourcing, used for classifying the examples where the SVM has less conﬁdence. Thus, we aim to improve the SVM baseline performance and provide a more assertive joke recommendation. The reason for using active learning is mainly to expedite the learning process and reduce the labelling eﬀorts required by the supervisor [4]. The rest of the paper is organized as follows. We start in Section 2 by describing the background on SVM, crowdsourcing and humour classiﬁcation and proceed into Section 3 by presenting the crowdsourcing framework for humour classiﬁcation. Then, in Section 4 we introduce the Jester benchmark and discuss the results obtained. Finally, in Section 5 we delineate some conclusions and present some directions for future work.

2

Background

In what follows we will provide the background on Support Vector Machine (SVM), crowdsourcing and humour classiﬁcation, which constitute the generic knowledge for understanding the approach proposed ahead in this paper. 2.1

Support Vector Machines

SVM is a machine learning method introduced by Vapnik [5], based on his Statistical learning Theory and Structural Risk Minimization Principle. The underlying idea behind the use of SVM for classiﬁcation, consists on ﬁnding the optimal separating hyperplane between the positive and negative examples. The optimal hyperplane is deﬁned as the one giving the maximum margin between the training examples that are closest to it. Support vectors are the examples that lie closest to thehyperplane. Once this hyperplane is found, new examples can be classiﬁedby determining on which side of the hyperplane they are. The output of a linear SVM is u = w × x − b, where w is the normal weight vector to the hyperplane and x is the input vector. Maximizing the margin can be seen as an optimization problem: minimize

1 ||w||2 , subjected to yi (w.x + b) ≥ 1, ∀i, 2

(1)

where x is the training example and yi is the correct output for the ith training example. Intuitively the classiﬁer with the largest margin will give low expected risk, and hence better generalization.

180

J. Costa et al.

To deal with the constrained optimization problem in (1) Lagrange multipliers αi ≥ 0 and the Lagrangian (2) can be introduced: l

Lp ≡

1 ||w||2 − αi (yi (w.x + b) − 1). 2 i=1

(2)

In fact, SVM constitute currently the best of breed kernel-based technique, exhibiting state-of-the-art performance in diverse application areas, such as text classiﬁcation [6, 7, 8]. In humour classiﬁcation we can also ﬁnd the use of SVM to classify data sets [9, 3]. 2.2

Crowdsourcing

Over the last few years, with the burst of communication technologies, virtual communities emerged. People are now easily connected and can communicate, share and join together. Considering this new reality, industries and organizations discovered an innovative low-cost work force, which could save time and money in problem solving, as online recruitment of anonymous, a.k.a. crowdsourcing, brings a new set of issues to the discussion [10, 1, 11, 12]. Since the seminal work of Surowiecki [13], the concept of crowdsourcing is expanding, mainly through the work of Jeﬀ Howe [10], where the term crowdsourcing was deﬁnitely coined. The main idea underpinning crowdsourcing is that, under the right circumstances, groups can be remarkably intelligent and eﬃcient. Groups do not need to be dominated by exceptionally intelligent people in order to be smart, and are often smarter than the smartest individual in them, i.e. the group decisions are usually better than the decisions of the brightest party. As an example, if you ask a large enough group of diverse, independent people to predictor estimate a probability, and then average those estimates, the errors each of them makes in coming up with an answer will cancel themselves out, i.e., virtually anyone has the potential to plug in valuable information [13, 14]. There are four conditions that characterize wise crowds [13]: 1. Diversity of opinion, as each person should have some private information, even if it is just an eccentric interpretation of the known facts. 2. Independence, related to the fact that people’s opinion is not determined by the opinions of those around them. 3. Decentralization, in which people are able to specialize and draw on local knowledge. 4. Aggregation, related to the existing mechanisms for turning private judgements into a collective decision. Due to its promising beneﬁts, crowdsourcing has been widely studied for the last few years, being the focus of science and research in many ﬁelds like biology, social sciences, engineering, computer science, among others. [15]. In computer science, and particularly in machine learning, crowdsourcing applications are booming. In [16] crowdsourcing is used for the classiﬁcation of emotion in speech, by rating contributors and deﬁning associated bias. In [17] people contribute to image classiﬁcation and are rated to obtain cost-eﬀective labels. Another interesting application is presented in [18], where facial recognition is carried out by asking people to tag speciﬁc characteristics in facial images.

Get Your Jokes Right: Ask the Crowd

181

There are still few applications of crowdsourcing for text classiﬁcation. In [19] economic news articles are classiﬁed using supervised learning and crowdsourcing. In this case subjectivity is not an issue, while in our application scenario subjectivity is of major importance. 2.3

Humour Classification

Humour research in computer science has two main research areas: humour generation [20, 21] and humour recognition [9, 3, 22]. With respect to the latter, research done so far considers mostly humour in short sentences, like one-liners, that is jokes with only one line sentence. Humour classiﬁcation is intrinsically subjective. Each one of us has its own perception of fun, yetautomatic humour recognition is a diﬃcult learning task. Classiﬁcation methods used thus far are mainly text-based and include SVM classiﬁers, naïve Bayes and less commonly decision trees. In [9] a humour recognition approach based in one-liners is presented. A dataset was built grabbing one-liners from many websites with an algorithm and the help of web search engines. This humourous dataset was then compared with non-humourous datasets like headlines from news articles published in the Reuters newswire and a collection of proverbs. Another interesting approach [22] proposes to distinguish between an implicit funny comment and a not funny one. A 600,000 web comments dataset was used, retrieved from the Slashdot news Web site. These web comments were tagged by users in four categories: funny, informative, insightful, and negative, which split the dataset in humourous and non-humourous comments.

3

Proposed Approach

This section describes the proposed crowdsourcing SVM active learning strategy. Our approach is twofold. On one hand, we use the power of the crowd as a source of information. On the other hand, we deﬁne and guide the crowd with an SVM active learning strategy. Figure 1 shows the proposed framework.

Fig. 1. Proposed crowdsourcing active learning SVM framework

We start by constructing a baseline SVM model to determine which examples should be presented to the crowd. Then, we generate a new SVM model that beneﬁts from the active examples obtained by the crowd classiﬁcation feedback. The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the

182

J. Costa et al.

data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle [23, 24]. We use an SVM active learning strategy that determines the most uncertain examples and point them as active examples to be labeled using the SVM separating margin as the determining factor. When an SVM model classiﬁes new unlabeled examples, they are classiﬁed according to which side of the Optimal Separating Hyperplane (OSH) they fall. Yet, not all unlabeled points are classiﬁed with the same distance to the OSH. In fact, the farther from the OSH they lie, i.e. the larger the margin, more conﬁdence can be put on their classiﬁcation, since slight deviations of the OSH would not change their given class. To classify the active examples, instead of using a supervisor as traditionally happens, we propose to use crowdsourcing, i.e., make available the set of examples to classify and let people willingly provide the classiﬁcation. While in academic machine learning benchmark-based settings this may seem useless, in real situations where in fact the classiﬁcation is not known, it can become remarkably important.

4

Experimental Setup

In this section we start by describing the Jester jokes data set, used in the experiments. We then proceed by detailing the pre-processing method and ﬁnally, we conclude by depicting the results obtained. 4.1

Data Set

The Jester dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users and is available at: http://eigentaste.berkeley.edu. It was generated from Ken Goldberg’s joke recommendation website, where users rate a core set of 10 jokes and receive recommendations from other jokes they could also like. As users can continue reading and rating and many of them end up rating all the 100 jokes, the dataset is quite dense. The dataset is provided in three parts: the ﬁrst one contains data from 24,983 users who have rated 36 or more jokes, the second one data from 23,500 users who have rated 36 or more jokes and the third one contains data from 24,938 users who have rated between 15 and 35 jokes. The experiments were carried out using the ﬁrst part as it contains a signiﬁcant number of users and rates for testing purposes, and for classiﬁcation purposes was considered that a joke classiﬁed on average above 0.00 is a recommendable joke, and a joke below that value is non recommendable. The jokes were split into two equal and disjoint sets: training and test. The data from the training set is used to select learning models, and the data from the testing set to evaluate performance. 4.2

Pre-processing Methods

A joke is represented as the most common, simple and successful document representation, which is the vector space model, also known as Bag of Words. Each joke is indexed with the bag of the terms occurring in it, i.e., a vector with one component for each term occurring in the whole collection, having a value that takes into account the number of times the term occurred in the joke. It was

Get Your Jokes Right: Ask the Crowd

183

also considered the simplest approach in the deﬁnition of term, as it was deﬁned as any space-separated word. Considering the proposed approach and the use of text-classiﬁcation methods, pre-processing methods were applied in order to reduce feature space. These techniques, as the name reveals, reduce the size of the joke representation and prevent the mislead classiﬁcation as some words, such as articles, prepositions and conjunctions, called stopwords, are non-informative words, and occur more frequently than informative ones. These words could also mislead correlations between jokes, so stopword removal technique was applied. Stemming method was also applied. This method consists in removing case and inﬂection information of a word, reducing it to the word stem. Steaming does not alter signiﬁcantly the information included, but it does avoid feature expansion. 4.3

Performance Metrics

In order to evaluate a binary decision task we ﬁrst deﬁne a contingency matrix representing the possible outcomes of the classiﬁcation, as shown in Table 1. Table 1. Contingency table for binary classiﬁcation Class Positive Class Negative a b (True Positives) (False Positives) Assigned Negative c d (False Negatives) (True Negatives) Assigned Positive

Several measures have been deﬁned based on this contingency table, such b+c a a as, error rate ( a+b+c+d ), recall (R = a+c ), and precision (P = a+b ), as well as combined measures, such as, the van Rijsbergen Fβ measure [25], which combines recall and precision in a single score: Fβ =

(β 2 + 1)P × R . β2P + R

(3)

Fβ is one of the best suited measures for text classiﬁcation used with β = 1, i.e. F1 , an harmonic average between precision and recall (4). F1 = 4.4

2×P ×R . P +R

(4)

Results and Discussion

To test and evaluate the proposed approach, we used the margin-based active learning strategy presented in Section 3 and preprocessed the jokes according to the methodology described in Section 4.2. We select ten active jokes, correspond to those which would be more informative to the learning model, and then we let the crowd members to classify them. After collecting the 100 answers for each joke classiﬁcation, we averaged the results. Table 2 summarizes the overall performance results obtained. Analysing

184

J. Costa et al.

Table 2. Performances of Baseline and Crowdsourcing Approaches Precision Baseline SVM Crowd SVM

81.40% 81.82%

Recall

F1

92.11% 86.42% 94.74% 87.80%

the table we can see that crowdsourcing introduced tangible improvements even with such preliminary setup. As both recall and precision were improved we were able to conclude that the enhancement was robust regarding both false positive and false negative values. However, further work might be necessary in order to guarantee that the used crowd is vast and diverse enough to optimize the performance of the proposed model, as classiﬁcation errors (or at least inconsistency) made by the crown were veriﬁed. These errors might me explained not only by the general subjectiveness of humour, but also by the contextual meaning of some jokes, as the used crown was mostly non-English native, and some jokes were intrinsically related to the American culture.

5

Conclusions and Future Work

In this paper we have presented a framework for humour classiﬁcation, based on an SVM active learning strategy that uses crowdsourcing to classify the active learning examples. Our aim was to evaluate the improvement of performance with this strategy, when compared with baseline SVM. For that purpose, we have conducted a set of experiments using the Jester data set, by comparing the baseline SVM model with our twofold active learning approach, which consisted in using the crowd source information to classify the examples in which SVM has less conﬁdence. The preliminary results obtained are very promising and in line with some previous published work. We were able to observe that crowdsourcing can improve the baseline SVM. Although the improvement is still slight, probably due to the constrains referred in Section 4.4, an overall performance improvement was achieved, i.e., users can be more conﬁdent in the assertiveness of the joke recommendation when using crowdsourcing. Our future work will include a diverse crowd and possibly a more restrictive contextual jokes.

References 1. Brabham, D.C.: Crowdsourcing as a Model for Problem Solving: An Introduction and Cases. Convergence: The International Journal of Research into New Media Technologies 14(1), 75–90 (2008) 2. Raykar, V., Yu, S., Zhao, L., Valadez, G., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. The Journal of Machine Learning Research 99, 1297–1322 (2010) 3. Mihalcea, R., Strapparava, C.: Making computers laugh: investigations in automatic humor recognition. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 531–538 (2005)

Get Your Jokes Right: Ask the Crowd

185

4. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms. In: Proceedings of ICML 2003, 20th International Conference on Machine Learning, pp. 19–26 (2003) 5. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1999) 6. Joachims, T.: Learning Text Classiﬁers with Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002) 7. Tong, S., Koller, D.: Support vector machine active learning with applications to text classiﬁcation. The Journal of Machine Learning Research 2, 45–66 (2002) 8. Antunes, M., Silva, C., Ribeiro, B., Correia, M.: A Hybrid AIS-SVM Ensemble Approach for Text Classiﬁcation. In: Dobnikar, A., Lotrič, U., Šter, B. (eds.) ICANNGA 2011, Part II. LNCS, vol. 6594, pp. 342–352. Springer, Heidelberg (2011) 9. Mihalcea, R., Strapparava, C.: Technologies That Make You Smile: Adding Humor to Text-Based Applications. IEEE Intelligent Systems 21(5), 33–39 (2006) 10. Howe, J.: The Rise of Crowdsourcing. Wired (June 2006) 11. Hsueh, P.-Y., Melville, P., Sindhwani, V.: Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria, pp. 1–9 (May 2009) 12. Nov, O., Arazy, O., Anderson, D.: Dusting for science: motivation and participation of digital citizen science volunteers. In: Proceedings of the 2011 iConference, pp. 68–74 (2011) 13. Surowiecki, J.: The Wisdom of Crowds. Doubleday (2004) 14. Greengard, S.: Following the crowd. Communications of the ACM 54(2), 20 (2011) 15. Leimeister, J.: Collective Intelligence. In: Business & Information Systems Engineering, pp. 1–4 (2010) 16. Tarasov, A., Delany, S.: Using crowdsourcing for labelling emotional speech assets. In: ECAI - Prestigious Applications of Intelligent Systems, pp. 1–11 (2010) 17. Welinder, P., Perona, P.: Online crowdsourcing: rating annotators and obtaining cost-eﬀective labels. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 25–32 (2010) 18. Chen, Y., Hsu, W., Liao, H.: Learning facial attributes by crowdsourcing in social media. In: WWW 2011, pp. 25–26 (2011) 19. Brew, A., Greene, D., Cunnigham, P.: The interaction between supervised learning and crowdsourcing. In: NIPS 2010 (2010) 20. Stock, O., Strapparava, C.: Getting serious about the development of computational humor. In: IJCAI 2003, pp. 59–64 (2003) 21. Binsted, K., Ritchie, G.: An implemented model of punning riddles. arXiv.org, vol. cmp-lg (June 1994) 22. Reyes, A., Potthast, M., Rosso, P., Stein, B.: Evaluating Humor Features on Web Comments. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, LREC 2010 (May 2010) 23. Settles, B.: Active learning literature survey. CS Technical Report 1648, University of Wisconsin-Madison (2010) 24. Silva, C., Ribeiro, B.: On text-based mining with active learning and background knowledge using svm. Soft Computing - A Fusion of Foundations, Methodologies and Applications 11(6), 519–530 (2007) 25. van Rijsbergen, C.: Information Retrieval. Butterworths ed. (1979)

An Evolutionary Approach for Program Model Checking Nassima Aleb, Zahia Tamen, and Nadjet Kamel Computer Sciences Department, University of Sciences and Technologies Houari Boumediene, BP 32 EL ALIA Bab Ezzouar, 16111, Algiers Algeria [email protected], [email protected], [email protected]

Abstract. In this paper we use a genetic algorithm to verify safety properties of C programs. We define a new method for program modeling: A Separation Modeling Approach: ASMA, in which programs are represented by two components: Data Model DM, and Control Model CM. The safety verification problem is expressed by means of reachability of some erroneous location L in the program. First, we compute the “Access chain” of L: a string where each position represents the required value of CM elements guards to reach L. Then, the genetic algorithm starts by generating each time a new population which tries to provide an execution which is "conform" to the Access chain. An individual of the population is a set of intervals each one representing an input variable. Our technique allows handling programs containing pointers and function calls. Keywords: Program modeling, Static verification, Genetic algorithms.

1 Introduction Software model checking has been an active area of recent research [1,3,7,12,14,20]. The input to a software model checker is the program source and a temporal safety property. The specification is usually given by program instrumentation [2,5]. The output of the model checker is ideally either a proof of program correctness that can be separately validated [16,17], or a counterexample in the form of an execution path of the program. Recently, two abstraction techniques have given good results in software verification: Predicate abstraction and abstract interpretation. Predicate abstraction is used in a key paradigm: Counterexample-Guided Abstraction Refinement (CEGAR) [1,4,6,11,18,21,22]. The model checker starts by verifying the property on a coarse abstraction of the program, in the case of imprecision, the abstraction is refined. The CEGAR paradigm is used by the Slam project [1] and the Blast model checker [10]. Abstract interpretation [8] uses abstract domains : intervals, octagonal.., to capture the semantic of programs. In this paper we present another approach for program model checking. We model programs by using the ASMA modeling approach. In ASMA, each function of the program is represented by two models: Data Model (DM) representing operations on variables and Control Model (CM) expressing the control structure. Integers representing locations are used to number each operation in DM. The erroneous location L is characterized by an access chain: “ACCESS”. A genetic algorithm is used to generate and improve individuals representing various input variables values such that their execution paths become closer to ACCESS. The objective is achieved if there is some individual L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 186–199, 2011. © Springer-Verlag Berlin Heidelberg 2011

An Evolutionary Approach for Program Model Checking

187

which access chain matches with ACCESS. We exploit the concept of weakest precondition, defined in [9] to perform a kind of “symbolic execution”. The rest of the paper is organized as follow: The section 2 exposes our modeling approach. In the section 3 are defined: Symbolic executions, weakest precondition and execution paths. Section 4 develops the verification technique. Functions are developed in the section 5, while the section 6 is devoted to pointers and aliasing. Some attempts to “ensure” desirable formal properties are discussed in the section 7. In the section 8, we expose some experimental results, the last section concludes by highlighting contributions of our work and exposing some future directions.

2 Program Modeling We prepare our program by applying the following operations: 1. 2. 3. 4.

for and do while instructions are replaced by equivalent while instruction Switch instruction is replaced the if then else statement. Post and Pre-increment(and decrement)are transformed into standard forms. Output statements (printf, .)are eliminated: They have no effect on execution.

So, a program contains: Assignments, conditionals, repetitions and function calls. Usually, programs are represented by control flow graphs, we define another representation: ASMA. It separates a program into two models: The first describes all operations affecting variables values, we call it: Data Model; the second is called Control Model, it summarizes the control structure of the program in a compact way. 2.1 Data Model: DM It models variable’s declarations, assignments, and inputs. These instructions are numbered with integers representing locations. 1.

2. 3. 4. 5.

Variable declarations: A declaration of the form: type idf is modeled by idf = type_idf0, meaning that idf has the type type and has not yet a known value. Global variables declarations are all designed by location 0. Local declarations are numbered by the location where they are performed. Assignments: Are represented in the same way as in the source program. Simultaneous declarations and assignments: A statement of the form: type idf=val is modeled by idf=type_val. For example: float x=2.5 : x=float_2.5 Inputs: An input assigns some value to a variable. So, the input of a variable v is modeled by v=$v, where $v is interpreted as an unknown constant. Predefined functions rand and malloc: A call having the form: v=rand(..) or v=malloc(..)is modeled by v=£v where £v is an unknown constant.

2.2 Control Model: CM Control Model describes constraints that make possible the execution of each DM instruction. It models conditional statements and loops. Conditional statements: There are two sorts of conditional statements: alternative statement (with the else branch) and the simple conditional (without the else branch). An alternative statement is modeled by I=(Cd,Si,Se,Sf) where :

188

-

N. Aleb, Z. Tamen, and N. Kamel

Cd : is a Boolean expression representing the condition of the statement. Si : is the location of the first instruction to perform if Cd is true. Se : is the location of the first instruction to perform if Cd is false. Sf : is the location of the first instruction after the conditional statement.

For an element I of CM: Then(I)=[Si,Se[and Else(I)=[Se,sf[. [Sk,Sl[ is locations set from Sk (included) to Sl (excluded). A simple conditional statement is represented by (Cd,Si,Sf) with Cd, Si and Sf having the same meaning as the alternative statement. Loops: A while statement is modeled by I=(Cd,Si,Sf)* where Cd, Si and Sf have the same meaning as the alternative statement.We call Body(I)=[Si,Sf[. We call also: CM[i]:The ith element of CM; Cd[i] its condition ; and Begin(i) the first location in it. ITE: The subset of CM elements having the form (Ci,Si,Se,Sf) ; IT: The subset of CM elements having the form (Ci,Si,Se); and LOOP: The subset of CM elements having the form (Ci,Si,Sf)*. 2.3 Modeling Example Source Code 0: int x,y,z,t; 1:x=1 ;

2: scanf(“%d”,&y); 3:scanf(“%d”,&z); if (x>y+z) 4:{ x=z; 5: t=y-1;

if(y>0) 6: {x=t ; 7 : t=0 ; } else

8: t=t+x ; }; else 9: { x=y; 10: t=z+x ; if (t>y)

{While(x0) if (t>10) 16: t=t-10;

Data Model 0:x=int_x0 0 :y=int_y0 0 :z=int_z0 0 :t=int_t0 1: x=1 2: y=$y 3 :z=$z 4: x=z 5: t=y-1 6:x=t 7: t=0 10: t=z+ 8: t=t+x 9:x=yx 11: x=x+2 12: t=t-1 13:t=t+x 14:x=x+1 15:t=x 16:t=t-10

Control Model 1:(x>y+z,4,9,16) 2:(y>0,6,8,9) 3:(t>y,11,15,16) 4:(x0,16,17) 7:(t>10,16,17)

Fig. 1. Program Example Prog with its modeling

An Evolutionary Approach for Program Model Checking

189

2.4 Location Access Chain Computing Let L be a location, the Access chain ACCESS of L expresses the required guard value of each element of CM from the beginning to L. ACCESS is a string constituted of characters ‘1’,’0’ or ‘x’. Let’s note the expression “must be equal” by “≡”. So: ‘1’ if Cd[i] ≡ True ACCESS[i] =

‘0’ if Cd[i]≡False ‘x’ in the other cases.

If CM[i] is an element of ITE then if L∈Then(i) then Cd[i] ≡True else if L∈Else(i) then Cd[i]≡False. A same reasoning is performed on the other CM subsets. Examples : In the program Prog: L =12: ACCESS=0x11; L=13: ACCESS=0x1x1.

3 Symbolic Executions We use the concept of weakest precondition to do “symbolic executions”. This allows computing just the considered guard in the desired point of the program instead of running all the statements of the program from the beginning until the considered location. 3.1 Weakest Precondition For a statement S and a predicate C, let WP(S,C) denotes the weakest precondition of C with respect to (w.r.t.) the statement S. WP(S,C) is defined as the weakest predicate whose truth before S entails the truth of C after S terminates. Let v=exp be an assignment, where v is a variable and exp is an expression. Let C be a predicate, by definition WP(v=exp,C) is C with all occurrences of v replaced with exp, denoted C[exp/v]. For example: WP(v=v+2, v>8) = (v+2)>8 = (v>6). In the subsequent, we denote WP(Si,C) the weakest precondition of the predicate C w.r.t. the statement having the location Si in the Data Model. We use the concept of weakest precondition to evaluate CM element’s conditions which represent positions values of individual’s access chain. Hence, since we use intervals to design different parts of CM elements: Then(I), Else(I) and Body(I). We call the two first intervals: simple intervals and the last one iterative interval. So, in the subsequent, we extrapolate the definition of weakest precondition to be applied to simple locations intervals. Weakest precondition for iterative intervals will be exposed afterward. We define the weakest precondition of a predicate C w.r.t. an interval [Si,Sj[, denoted by WPI([Si,Sj[,C), as the weakest predicate whose truth before Si entails the truth of C after Sj-1 terminates. The idea is to compute successively the weakest preconditions of C with respect to each location within [Si,Sj[ starting by the end until we attain Si, or we obtain a constant meaning that there are no variables occurring in C. For each location Sk∈[Si,Sj[, the result obtained from computing WP(Sk,Ck) is given as predicate to compute its weakest precondition w.r.t. Sk-1 and so on. So:

190

N. Aleb, Z. Tamen, and N. Kamel

C If no variable occurs in C WP(Si,C) If Sj=Si WPI([Si,Sj-1[,WP(Sj-1,C)) Otherwise

WPI([Si,Sj[,C)v=

Example: WPI([0,4[,x>y+z)=WPI([0,3[,WP(3,x>y+z))=WPI([0,2[,WP(2,x>y+$z)) =WPI([0,1[,WP(1,x>$y+$z))=WPI[0,0[,1>$y+$z)=1>$y+$z Which means that the condition x>y+z is satisfied just before the location 4 if and only if the input values of variables x and y verify the constraint 1>$y+$z. We define also : WPI([Si,Sj[∪[Sk,Sl[,C)=WPI([Si,Sj[,WPI([Sk,Sl[,C)) 3.2 Execution Path We define the execution path P as the succession of CM intervals targeted by some input values. Let ω(Cd[k]) be the value of Cd[k], and P[k] the portion of the path associated to CM[k]. The path P is defined as the union :P=P[1]∪P[2]..∪P[m] where :

P[k] =

Then(k) if CM[k]∈ (IT ∪ITE)and ω(Cd[k])=’1’ Else (k) if CM[k]∈ ITE and ω(Cd[k])=’0’ (Body(k))n if CM[k]∈LOOP and ω(Cd[k])=’1’ n is the iterations number.

(Body(k))n represents the union of the interval Body(k) n times. 3.3 CM Guards Value Computing Let’s call Pk the prefix having the length k of the path P. 1- CM[k]∉ LOOP : ω(Cd[k]) =

WPI([0,Begin(k)[) WPI(P(k-1),Cd[k])

If k=1 Otherwise

2- CM[k]∈LOOP: Let CM[k]=(C,Sb,se)*, we note ω(Cd[k])j Cd[k] in the path P in the iteration j: ω(Cd[k])j =

the value of

ω(Cd[k]) If j=1 WPI(P(k-1) ∪[Sb,Se[j-1 ,Cd[k]) If j>1 and ω(Cd[k])j-1=True

The loop iterations number: Is the least integer n such that ω(Cd[k])n+1=False

4 A Genetic Algorithm for Program Model Checking We use a genetic algorithm [19] for safety property verification. Having a program represented by its data model DM and its control model CM, and a safety property expressed as some erroneous location L, we try to find a set of input variables values that allow reaching L. Our technique is presented by the algorithm of the figure 2.

An Evolutionary Approach for Program Model Checking

191

First, we use L and CM to compute the access chain ACCESS of the location L, the problem is then: Is there any execution that has an access chain which matches with ACCESS?. To answer this question we use a genetic algorithm which starts with a population of individuals each one representing a possible initialization of input variables. Each input value is represented by an interval. This allows us to “correct” gradually some undesirable behavior instead of rejecting systematically each unwanted results. For each individual i, we compute its access chain Chaini recording the sequence of CM elements executed by the individual i. The fitness function computes the distance between ACCESS and Chaini. The objective is reached if we find an individual i* such that the distance between Chaini* and ACCESS is zero. In the contrary case, the population must be improved. We define a combination operator and three mutation operators. 4.1 Individuals Access Chain Computing First, let’s notice the two following points: 1- The value of a CM guard is not at all times known. For example, for the individual i such that y=[-10,50] and z=[-200,100], the value of Cd[1]: x>y+z is not known. 2- It is not always required to know the truth values of all CM guards. This is due to the fact that CM elements are often opposite. So, for example, if an individual is such that Cd[1] is false, it is not necessary to evaluate Cd[2]. Consequently, the access chain of an individual contains the characters: ‘1’,’0’, ‘u’ or ‘x’ meaning respectively: True, False, Unknown or not required. When we find the first ‘u’, we stop the computing by completing all the remainder positions by ‘u’. Let i be an individual, and let ωi(Cd[k])be the valuation of Cd[k] for the individual i. ωi(Cd[k]) is computed in the same manner than ω(Cd[k]) by using the data of the individual i. The access chain of i noted by Chaini is the string : a1a2a3..an such that :

ak =

‘1’ ‘0’ ‘u’ ‘x’

if ωi(Cd[k])=True if ωi(Cd[k]) =False if ωi(Cd[k]) is unknown Or ωi(Cd[k-1])=’u’ if Cd[k] is not required

The execution path of an individual i is computed as described in the section 3.2, we add the case where ak is not required : if ak=’x’ then P[k]=∅.

192

N. Aleb, Z. Tamen, and N. Kamel

Algorithm 1. Genetic Algorithm for programs verification 1: Inputs: A program Pg, a location L, max_iter, CardPop 2: Outputs: Eventually An execution leading to L. 3: Initializations: Pop0 : a population of individuals 4: Compute The acces chain of L : ACCESS 5: Generate an initial Population ; i=1; success=False; 6: while (i<=max_iter) and (success=False) 7: { j=1; 8: while(j<=CardPop) and (success=False) 9: { Compute Chain[j] 10: Fitness[j]=Distance (ACCESS ,Chain[j]) 11: If Fitness[j]=0 then success=True 12: else j=j+1; 13: } 14: If (success=False) 15: then { Recombination(); 16: Narrowing (); 17: StrongMutate(); 18: WeakMutate(); 19: i:=i+1; 20: } } 21: If Success 22: then The set of input values is an execution allowing to reach L ; Chain[j] is a path to L. 23: else No solution found. Fig. 2. Genetic algorithm for pro7grams verification

4.1.1 Interval Operations We adopt the same definitions for arithmetic operations as in interval abstract interpretation [16]. Logical operations are our own definitions since we use a three valuated logic. Arithmetic operations

Logical operations

n= [n, n] [a, b]+[c, d] = [a + c, b + d] [a, b]- [c, d] = [a − d, b − c] -[a, b] = [−b,−a] [a, b]*[c, d] =[Min,Max] With: Min=min(ac,ad,bc,bd) Max=max(ac,ad,bc,bd)

([a,b]=[c,d])=T if(a=b=c=d) F if [a,b][c,d]= U else ([a,b]<[c,d])=T if (b[c,d])= F if (bd) U else

Truth Table F 1 0 u

G FG FG G u 1 u u u u 0 u u u u u

An Evolutionary Approach for Program Model Checking

193

Intervals union, intersection, inclusion and appurtenance are defined as ordinary intervals operations. 4.1.2 Individual Access Chain Computing Example Let’s consider the program Prog, each individual is composed of two intervals having the type integer representing the variables y and z. A possible individual i is [-10,-5[ ; [-200,0[, let’s compute its access chain Chaini to the location 6, let’s note Chaini =a1a2 . So, let’s compute ωi(Cd[1]) and ωi(Cd[2]). We use the formulas defined in 3.3. ωi(Cd[1])=WPI([0,4[,x>y+z)=1>$y+$z=True=>a1=1 (since $y+$z=[-210,-5]) ωi(Cd[2])=WPI(([0,4[∪[4,6[),y>0)=$y>0=False=>a2=0. Consequently, Chaini =10 4.2 Population Initialization To guarantee some desirable properties, initial population is generated in such a way that ensures the following points: 1- Diversity 2- Acceptable quality Each created individual is first evaluated to verify the two precedent properties. To force diversity, we privilege individuals having different values in ACCESS positions represented by ‘x’ since they are those positions which can provide different access paths to the same location. An individual has an acceptable quality if its access chain has at least one correct position. It is also advantageous to use large size intervals in the initialization stage, and to narrow them progressively. Thus, let |Pop| be the number of initial population individuals. So, as in [21], in the initialization phase, we generate |Pop|+K (K>0) individuals but we preserve only the |Pop| “best” ones. 4.3

Fitness Function

For an individual i, the fitness function, Fitness(i) measures the distance between ACCESS and Chaini. The computing of the fitness is performed in the following manner: Let Fit be a string such that: if(ACCESS[k]=’x’)OR(ACCESS[k]=Chaini[k]) then Fit[k]=0 else Fit[k]=1. Fitness(i) is the decimal number obtained by converting the binary number represented by Fit. Example: Let ACCESS=10xx1; and let i an individual such that: Chain i=1101U ; Fit=01001 ; Fitness(I )= 9. We remark that the Fitness function represents truthfully the distance between the desired behavior and the behavior of the considered individual. In fact, if we consider for example two individuals i1 and i2 such that Fit1=1000 and Fit2=0001 despite the fact that these two individuals have both one faulty position which does not match with ACCESS, their Fitness must be different because the first individual has failed in the first guard so it has taken a path completely different from ACCESS and it represents an execution that is completely deviating. While i2 has matched with ACCESS until the last position so it is closer to ACCESS. So, i2 is better than i1, which is effectively expressed by our fitness function since: Fitness(i1)=8 and Fitness(i2)=1. The goal is to find an individual i* such that Fitness(i*)=0.

194

N. Aleb, Z. Tamen, and N. Kamel

4.4 Population Improvement To ameliorate the population, we adopt a guided approach which increases the probability of obtained individuals to be effectively better than their parents. However, since recombination and mutation may be performed many times in a genetic algorithm, thus, they must be as simple as possible. So, we perform a gradual amelioration. It consists to ‘correct’ the first faulty position of each individual of P: We call faulty position a position whose value in the access chain and in ACCESS are different, and its value in ACCESS is either ‘0’ or ‘1’. A faulty position could be an unknown position or an erroneous one. We categorize individuals considering their fitness (fit or unfit) and their faulty positions values (wrong or unknown). Let’s call FU, FE, UU, UE respectively the individual’s categories: Fit with Unknown positions, Fit with Erroneous positions, Unfit with Unknown positions and finally, Unfit with Erroneous positions. 4.4.1 Recombination Recombination operator is applied on the class FU to correct progressively individual’s faulty positions. Let i1 and i2 be two individuals such that i1 has the position p as first unknown position and p is not a faulty position for i2. The idea is to use the data of the individual i2 to correct the unknown position p of i1. However, since in a program variables are strongly correlated to each other, modifying some data of an individual to correct some guard could in the same time alter negatively other guards. To avoid this situation, we modify data corresponding to some faulty position in a “conservative” way. So we perform an intersection between the data, occurring in the position p, of i2 and those of i1. Consequently, all the guards which had a known value conserve their value, those which had unknown values could have a known ones. Hence, the recombination operator is defined as follow: Let the individuals : i1,i2 and i3.such that p is a faulty position of i1, and let x1i,x2i,x3i the values intervals of the input variable xi respectively for i1,i2 and i3. Recombination (i1,i2,p)=(i3,i2) such that for all input variable xi : x1i ∩x2i x1i

x3i =

If xi occurs in Cd[p] Otherwise

Example: Let ACCESS= 1x1101; let i1 an individual such that Chaini1 = 011001, so the faulty positions of i1 are: 1 and 4. The recombination point will be the position 1. Let i2 be an individual such that Chaini2= 101000, so, 1 is not a faulty position of i2. So, we use i2 to correct i1. Input variables occurring in Cd[1] are y and z, so : Recombination (i1,i2,1)=(i3,i2) such that : y3=y1∩y2 and z3=z1∩z2, where yi and zi are the intervals of variables y and z of the individual i. i1: [0,200 ] y1

[-50,100]

z1

i2: [-10,25 ] y2

[20,50 ]

z2

i3: [0,25 ] y3

[20,50 ]

z3

An Evolutionary Approach for Program Model Checking

195

4.4.2 Mutation Operators We define three mutation operators: Weak mutation, Strong mutation and Narrowing. Weak mutation makes a little perturbation on individuals of the class FE. It consists to modify the interval of a unique variable. The candidate variable to change must appear in the faulty position. To improve the category UE, we use Strong mutation. It consists to change randomly all variables values. The narrowing operator consists to reduce, in various ways, the input values intervals. It is used to eliminate faulty positions in individuals of category UU, since unknown values are due to intervals large sizes.

5 Functions Each function is represented by its two models: DM and CM. Without loss of generality, we suppose the two following assumptions: 1- Each function has exactly one return statement: We call it return location 2- Every call of a function fi is of the form Sk:v=fi(…) where Sk is the call location. Let P be a program represented by DMp and CMp, Each statement of the form Sk: v=fi() in P is modeled as follow : -

In DMp it is represented as an ordinary assignment. So, we have: Sk: v=fi(). In CMp , it is modeled by j:(Refi , Sk). Where j is the current position in CMp and Refi is a reference to the function fi. An efficient way is to reference each function by an integer. Let Sri be the return location of the function fi ; let L0 be the call location, and finally, let’s note ACCESS(Sri) the access chain of the location Sri in the function fi ; and ACCESSl the access chain of the location L in the overall program.

Definition: Let P=(DMp,CMp) be a program such that Sk: v=fi()∈DMp and j:(Refi,Sk)∈CMp ; let L be a location in P. The access chain of L is ACCESSl=a1 a2 a3.… aj-1 ^ ACCESS(Sri)^ aj+1… an.‘^’ is concatenation operator. - The chain “a1 a2 a3.… aj-1” is computed as explained in previous sections. - ACCESS(Sri): the access chain of Sri in the function referenced by Refi. - The chain “aj+1… an” is computed as explained in previous sections. Let L be a location in the caller program and let’s check if L is reachable. We have three cases regarding the position of L we verify w.r.t. L0 1- LL0 : We use the previous definition. 3- The location L is in the body of the function: To reach L, we must first reach L0. So the problem is transformed in two parts: First, consider the reachability of L0 in the caller program, then the reachability of L in the function. Let’s call ACCESSlf : the access chain of the location L in the function f. So we have : ACCESSlp = ACCESSL0 ^ ACCESSLf

196

N. Aleb, Z. Tamen, and N. Kamel

Hence we have calculated the access chain of the studied location, the computing of the access chain of individuals is done as in previous sections. Global variables modifications: Global variables are handled in a natural way. Let vg be a global variable. Two cases are possible: -

-

Case 1: vg is declared in the function. In the weakest precondition computing of a predicate w.r.t. some location coming after the call of fi, the position 0 of fi 0: v=type_v0 is encountered before exiting the function (in the reverse order), so in the limit case, the variable v is assigned the constant type_v0. Case 2: vg is not declared in the function. In the weakest precondition computing of a predicate w.r.t. a location coming after the fi procedure call, we will first encounter either the position in fi where vg have been modified or if it does not exist we find the last modification of vg before entering the function.

So, in the two cases the correct substitutions are done for the variable vg.

6 Pointers and Aliasing Our approach allows handling programs manipulating pointers. We regroup variables references in sets representing equivalence classes. Each class has one representing element. Every modification of any element of a class is expressed on its representing element meaning that all the elements of the considered class are modified in the same manner. This method allows us to resolve the problem of aliased variables in a very natural way. So, during the DM computing we perform these actions: 1. To each declaration of the form type *v, we create the class corresponding to v Cv={(v,0)} with v its representing element. 2. For every variable x, the first assignment having the form Si:v=&x , has as effect to create the class corresponding to x references, containing the elements (&x,0) and (v,Si) meaning that &x is a reference of x since the location 0 and v is a reference of x since the location Si. &x is the representing element of the class. 3. For every assignment of the form Si:a=b such that b is an element of a given class C, the couple (a,Si) is added to the class C. 4. Each assignment of the form *c=d such that c is an element of a class having e as representing element, has as effect to assign the value d to *e. So, it is modeled by: *e=d. In this manner, all the operations done on variables referenced by different names are expressed on the class representing element. 5. The weakest precondition of a predicate with a dereference *a is computed in the same manner than precedent cases except that we use the representing element of equivalence class instead of the variable a. Example: Let’s consider the following portion of program and generate its DM: Program Data Model i=0; 1: i=0 a=&i; 2: a=&i /* Creation of C&I insert(a,2) */ b=&i; 3: b=&i /* Insert (b,3) in C&i */ *b=1; 4: i=1 /* Since b∈ C&i */ a=&j; 5: a=&j /* Creation of C&j, insert (a,5) */

An Evolutionary Approach for Program Model Checking

c=a; *a=2;

6: c=a 7: j=2

197

/* Insert (c,6) in C&j */ /*Since a∈C&j */

The created classes are: C&i={(&i,0) , (a,2) , (b,3)} and C&j={(&j,0) , (a,5) , (c,6)}. We have in these two sets all aliasing information. For example, for the variable a, from the location 2 to 4 a is a reference to i, while from the location 5 to the end it is a reference to j. We have also, for example, from 3 to 5 a and b are aliased and from 6 to the end a and c are aliased. The weakest preconditions computing is performed on the representing element, in the same manner as previousely. Let’s compute WPI([1,5[,*a=1).WP(4,*&i=1)=WP(4,i=1)=(1=1)=True. So, WPI([1,5[,*a=1)=True.

7 Formal Properties Soundness : The soundness property guarantees that if a location L is reachable then we must find an execution leading to it. Under some constraints, our technique can satisfy the soundness property. We can assert the following implication: In the initial population, if there is for each element of CM, at least one individual satisfying the guard and one individual satisfying the opposite guard, then if a location L is reachable then the technique finds an execution allowing to reach it. To attain this achievement, the following conditions must be established: 1- Initial intervals must be rather large to avoid early erroneous positions. If the location is reachable then reduce progressively the interval in the adequate way. 2- The iterations number must be greater than ACCESS length : Since we ‘correct’ faulty positions gradually, so in the worst case where all positions are unknown, the iteration number must allow to correct one faulty position each iteration. This represents the case where all population individuals are of bad quality. Precision: If it is proven that a property is violated then it exhibits is some execution that violates it. This is satisfied by our technique since, the core of the method itself, consists to find an execution path that leads to the erroneous location.

8 Experiments We do several experimentations to evaluate our approach. Genetic algorithm parameters like the maximal number of iterations, population size, and the proportion of “fit’ individuals, are adjusted during experimentation. Our aim is to measure the influence of the predicates number on the performance of the tool. We report here the results obtained for 5 unsafe programs, in each program Prn, the access chain ACCESS of the location has a length n. The column Proved indicates if our tool has proved the program unsafe(Y) or not (N). We report the iterations number, the population size and the execution time. We had noticed three important points: -The quality of initial population is decisive for the tool performances. -Program size does not influence the performance of the method since only the path leading to the location is analyzed. -In our actual tool, we did not focus on optimizations issues, so we can improve it in diverse manners to deal with larger number of predicates in ACCESS. Yet, obtained results are already very encouraging.

198

N. Aleb, Z. Tamen, and N. Kamel

ProgramName

Proved

Pr5 Pr8 Pr10 Pr16 Pr20

Y Y Y Y Y

Population Size 50 50 100 100 100

Iterations Number 3 3 4 4 200

Execution Time(sec) 0.05 0.06 0.1 0.06 5

9 Conclusion and Future Work We have presented an original approach to the program model checking problem. The results obtained are encouraging. Our work presents several contributions: 1. Our technique for program modeling, ASMA, is very powerful; it is as simple as advantageous. In fact, it permits to manipulate a program very easily and to have control over it. We can characterize, by a set of conditions, each region of the program independently of the rest of the program and without being forced to cross the program entirely from the beginning. 2. Symbolic execution process is novel. It presents several advantages: • The definitions of weakest precondition w.r.t. intervals and control structures are novel; they have never been defined before. Furthermore, these definitions are done in a natural way. • The computing of weakest precondition in a reverse order, and by using in each location, L, the result found in the location L+1, allows using at each location, the more recent value of variables. • The weakest preconditions are computed only on the considered path in the needed location, and not over the entire program. 3. A substantial characteristic of our approach is compositionality: Functions can be analyzed by our method straightforwardly. 4. Despite the fact that genetic algorithms does not ensure any formal property like convergence or soundness, the methodology followed in our work has succeed to take advantages of two opposite worlds: formal world and the randomness world. 5. Even though the pointer analysis problem was not our primordial objective; pointers are represented and manipulated in a simple and natural way. A lot of aliasing and points-to information can be deduced from the data model without significant effort. There are two future directions to our work. The first is to investigate, with the same modeling, other evolutionary approaches. The second is the automatic test generation.

References 1. Ball, T., Rajamani, S.K.: The Slam project: Debugging system software via static analysis. In: Proc. POPL, pp. 1–3. ACM, New York (2002) 2. Ball, T., Rajamani, S.K.: SLIC: A specification language for interface checking of C. Tech.Rep.MSR-TR-2001-21, Microsoft Research (2002)

An Evolutionary Approach for Program Model Checking

199

3. Chaki, S., Clarke, E.M., Groce, A., Jha, S., Veith, H.: Modular verification of software components in C. IEEE Trans. Softw. Eng. 30(6), 388–402 (2004) 4. Clarke, E.M., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided abstraction refinement. In: Emerson, E.A., Sistla, A.P. (eds.) CAV 2000. LNCS, vol. 1855, pp. 154– 169. Springer, Heidelberg (2000) 5. Clarke, E.M., Grumber, O., Peled, D.: Model Checking. MIT, Cambridge (1999) 6. Clarke, E.M., Kroening, D., Sharygina, N., Yorav, K.: SatAbs: SAT-based predicate abstraction for ANSI-C. In: Halbwachs, N., Zuck, L.D. (eds.) TACAS 2005. LNCS, vol. 3440, pp. 570–574. Springer, Heidelberg (2005) 7. Corbett, J.C., Dwyer, M.B., Hatcliff, J., Pasareanu, C., Robby, L.S., Zheng, H.: Bandera: Extracting finite-state models from Java source code. In: Proc. ICSE, pp. 439–448. ACM, New York (2000) 8. Cousot, P., Cousot, R.: Abstract interpretation: A Unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Principales of Programming Languages, POPL 1977, pp. 238–252 (1977) 9. Dijkstra, E.: A discipline of programming. Prentice Hall, Englewood Cliffs (1976) 10. Beyer, D., Henzinger, T.A., Jhala, R., Majumda, R.: The software model cheker Blast. Int. J. Softw. Tools Technol. Transfer (2007) 11. Esparza, J., Kiefer, S., Schwoon, S.: Abstraction refinement with Craig interpolation and symbolic pushdown systems. In: Hermanns, H. (ed.) TACAS 2006. LNCS, vol. 3920, pp. 489–503. Springer, Heidelberg (2006) 12. Godefroid, P.: Model checking for programming languages using VeriSoft. In: Proc. POPL, pp. 174–186. ACM, New York (1997) 13. Graf, S., Saidi, H.: Construction of abstract state graphs with PVS. In: Grumberg, O. (ed.) CAV 1997. LNCS, vol. 1254, pp. 72–83. Springer, Heidelberg (1997) 14. Havelund, K., Pressburger, T.: Model checking Java programs using Java PathFinder. STTT 2(4), 366–381 (2000) 15. Hao, J.K.: Memetic algorithms. A book chapter 16. Henzinger, T.A., Jhala, R., Majumdar, R., Necula, G.C., Sutre, G., Weimer, W.: Temporalsafety proofs for systems code. In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS, vol. 2404, pp. 526–538. Springer, Heidelberg (2002) 17. Henzinger, T.A., Jhala, R., Majumdar, R., Sanvido, M.A.A.: Extreme model checking. In: Dershowitz, N. (ed.) Verification: Theory and Practice. LNCS, vol. 2772, pp. 332–358. Springer, Heidelberg (2004) 18. Henzinger, T.A., Jhala, R., Majumdar, R., Sutre, G.: Lazy abstraction. In: Proc. POPL, pp. 58–70. ACM, New York (2002) 19. Holland, J.: Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor (1975) 20. Holzmann, G.J.: The Spin model checker. IEEE Trans. Softw. Eng. 23(5), 279–295 (1997) 21. Ivancic, F., Yang, Z., Ganai, M.K., Gupta, A., Shlyakhter, I., Ashar, P.: F- Soft: Software verification platform. In: Etessami, K., Rajamani, S.K. (eds.) CAV 2005. LNCS, vol. 3576, pp. 301–306. Springer, Heidelberg (2005) 22. Kroening, D., Groce, A., Clarke, E.M.: Counterexample guided abstraction refinement via program execution. In: Davies, J., Schulte, W., Barnett, M. (eds.) ICFEM 2004. LNCS, vol. 3308, pp. 224–238. Springer, Heidelberg (2004) 23. Young, M., Pezze, M.: Software Testing and Analysis: Process, Principles and Techniques. Wiley, New York (2005)

Modelling Information Fission in Output Multi-modal Interactive Systems Using Event-B Linda Mohand-Oussa¨ıd, Idir A¨ıt-Sadoune, and Yamine A¨ıt-Ameur Laboratory of Applied Computer Science (LISI) ENSMA and University of Poitiers, France

Abstract. Output multi-modal human-machine interfaces combine semantically output medias and modalities to increase interaction machine capabilities. In order to provide a rigorous development approach for these interfaces, we have proposed a generic formal model that formally describes the output multi-modal interface construction starting from the information generated by the functional core. This formal model is composed of two sub-models: the first one dedicated to the semantic fission of information and the second one relative to the allocation of modalities and medias for information. This paper presents an Event-B implementation of the semantic fission submodel. Keywords: Multi-modal interaction, semantic fission, formal modelling.

1 Introduction Output multi-modal interfaces have been developed in order to improve the communication practices between humans and machines in terms of eÆciency, easiness and conviviality. Thereby, several design approaches have been proposed to develop this kind of interfaces. Some multi-modal systems appeared in many areas: telecommunications, industry, medicine ... Moreover, when the interactive system is critical, the existing design approaches (section 2) become less powerful due to the fact that the same rigor is necessary for the design of both the functional core and the interface. This situation has led to the emergence of approaches addressing the formal design of multi-modal interfaces. A lot of this work has been devoted to the input multi-modal interfaces. The output multi-modality is less prevalent, that is why, it produces less informal specification models and thus less formal specification models. Our work addresses output multi-modal interfaces modelling. In [9], we proposed a generic formal model for the specification of a multi-modal output interactive system. Starting from the information generated by the functional core as a result of a given computation, it models the output multi-modal presentation to be delivered to the user by means of two successive transformations. The first representation of output information consisting of a combination of the elementary information obtained from the semantic fission process of the information. The second representation consists of building the multi-modal presentation by allocating to each elementary information the pairs (modality, media). L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 200–213, 2011. c Springer-Verlag Berlin Heidelberg 2011

Modelling Output Multi-modal Interactive Systems Using Event-B

201

In this paper, we present an Event-B implementation of the semantic fission model we have proposed in [9]. It consists of the first sub-model of the generic approach that we had proposed. Thus, our paper is structured as follows. After a brief presentation of related works in section 2, we describe in section 3, the proposed formal approach for the development of an output multi-modal human-machine interface. In section 4, we present the Event-B development process of the proposed fission model and in section 5, we present the binary fission Event-B models. A case study, detailed in annex, illustrating our proposal is presented in section 6. Finally, we conclude and give some perspectives to this work.

2 State of the Art Several approaches have been devoted to the multi-modal interfaces design in order to master the interface design process and to enhance interface usability. As stated above, the input multi-modality was subject to several research work compared to the study output multi-modality. Indeed, input multi-modality has been addressed in several studies which led to various formal models like Interactive Cooperative Objects (ICOs) ([2], [3] and [4]) or to the definition of a generic formal model proposed in [5]. The Event-B method has also been set up for formal modelling of multi-modal human-machine interfaces in [8]. Dealing with output multi-modality has been addressed in two main approaches: the SRM (Standard Reference Model) [6] and the WWHT (What, Which, How, Then) model [7]. These approaches propose a process for describing the fission of output modalities in order to deliver the output multi-modal information to the users. They are still semi-formal.

3 A Generic Design Model for Output Multi-Modal Interfaces When the designer needs to guarantee a sure design of the multi-modal interface depending on concise specifications, and to validate functional or usability properties, the models outlined above are not powerful enough. To overcome this drawback, we have proposed a generic formal model for handling the description of the output multi-modal interface design (for more details, see [9]). This model is based on the WWHT model, cited above. It expresses the output multi-modal interface within a concise framework (syntax, static and dynamic semantics). The proposed model (see Figure 1) formally describes the construction of the output multi-modal interface (multi-modal presentation) according to the designer’s interface choices. This multi-modal presentation is instantiated by determining lexico-syntactic contents and morphological attributes of the multi-modal presentation. The multi-modal presentation is obtained by two successive decompositions (steps) of the output information generated by the functional core. Therefore, two formal models (the fission model and the allocation model) compose our formal global model. 1. The semantic fission model. It expresses the semantic fission or decomposition of the information generated by the functional core into elementary information units to be delivered to the user.

202

L. Mohand-Oussa¨ıd, I. A¨ıt-Sadoune, and Y. A¨ıt-Ameur

2. The allocation model. It formalizes the multi-modal presentation construction for each elementary information unit resulting from the fission process. The multi-modal presentation corresponds to (modality, media) pairs, combined with the: complementary, redundant, choice and iteration operators. Next, we present more precisely the fission model for which we conducted the EventB formal developments.

Fig. 1. The proposed formal model

3.1 The Fission Model The fission model describes the basic information unit composition (static semantics) and its temporal occurrence (dynamic semantics). The description of the semantic fission model includes the description of the syntax, the static and dynamic semantics. Syntax. Let I be the set of continuous information (whose restitution to the user takes a significant time) to fission, and UIE be the set of elementary information units. The description of the fission step is given by the following BNF1 rules. I

UIE (optemp op sem )(I I) It(n I) where n

Where - optemp is a temporal operator (see Table 1); - op sem is a semantic operator (see Table 4); - It is a binary temporal operator expressing iteration. The temporal and semantic binary operators are defined on traces of events that express the production of the information ii I resulting from the fission description. Their signatures are: optemp : I I I op sem : I I I It : I I 1

Backus Naur Form.

Modelling Output Multi-modal Interactive Systems Using Event-B

203

In order to define the meaning of the introduced temporal and semantic operators, let ii , i j be two information elements of I, then Table 1 and Table 4 give the informal meaning of such operators. Table 1. The temporal operators Anachronic Sequential Concomitant Coincident Parallel Choice Independent order Iteration

An(ii i j ) S q(ii i j ) Ct(ii i j ) Cd(ii i j ) Pl(ii i j ) Ch(ii i j ) In(ii i j ) It(n ii )

i j occurs after an interval of time following the end of ii i j occurs immediately when ii ends i j occurs after the beginning of ii and ends after ii ends i j occurs after the beginning of ii and ends before ii ends ii and i j begin and end at the same moment deterministic choice between ii and i j ii occurs after i j or i j occurs after ii ii occurs sequentially n times

Table 2. The semantic operators Concurrent Complementary

Cc(ii i j ) the semantics of ii and i j are independent C p(ii i j ) the semantics of ii and i j are independent but complementary Complementary and redundant Cr(ii i j ) the semantics of ii and i j are complementary and a part of there semantics is redundant Partially redundant Pr(ii i j ) the semantics of ii is included in the semantics of i j or the semantics of ii is included in the semantics of i j Totally redundant T r(ii i j ) ii and i j have the same semantics

Static and dynamic semantics. The static semantics describes the static properties of the information to fission. It defines two functions: - The int : I D function associating to each information element its interpretation over a semantic domain D. This function defines the meaning of the information delivered to the user. It is an explicit semantics that is not handled in the interface design; - The T function that defines the restitution duration of information by introducing time boundaries functions start and end. The dynamic semantics of the fission model addresses the temporal and semantic relationships between fissioned information using the functions introduced in static semantics. First, it defines the temporal operators (optemp ) by means of the start and end functions. For example, for the anachronic operator: ii i j I with T (ii ) (start(ii ) end(ii )) T (i j ) (start(i j ) end(i j )) An(ii i j ) (start(ii ) end(i j)) i end(ii ) start(i j ) Second, it expresses the semantic operators (op sem ) by means of the int function. For example, for the concurrent operator: Cc(ii ı j ) means that int(ii ) and int(i j ) are independent. The dynamic semantics also expresses the iteration operator (It) using the S q sequential operator. ii

I n It(n ii ) ( ((ii S q ii ) S q ii ) S q ii )

n times

204

L. Mohand-Oussa¨ıd, I. A¨ıt-Sadoune, and Y. A¨ıt-Ameur

Case study. To show how the model described above can be used to model an output multi-modal interface, we choose to model a basic output interface that generates a warning message: Abnormal temperature over 50Æ C. This message is issued from the computations performed in the functional core of the application. Thus, we consider: - I the set of output information to fission containing the elements in f o wr, in f o tp, in f o at and in f o dg; - UIE is the set of elementary information units consisting of in f o tp and in f o dg; - D the interpretation domain of I containing the elements: warning, temperature, attention and danger. int(in f o wr) warning int(in f o tp) temperature int(in f o at) attention int(in f o dg) danger The fission result of the information in f o wr is expressed by the parallel (temporal) and the complementary (semantic) combination of the information in f o tp and in f o at as follows: in f o wr (S q C p)(in f o tp in f o at) The information in f o at being a non elementary information unit, this information is subject to fission in a three times iteration of the information in f o dg in f o at It(3 in f o dg) Thus, the final result of the fission process of the information in f o wr is the following: in f o wr (S q C p)(in f o tp It(3 in f o dg))

4 Event-B Implementation The proposed formal model is independent of any specification language. We propose to formalize it by the Event-B method. This choice is motivated by the presence of the refinement operation in the B method that corresponds to the decomposition process of the WWHT architecture model. More precisely, this operation is used to formalize the decomposition process associated to fission and allocation in WWHT. This approach will be set up on the previous case study. 4.1 The Event-B Method The Event-B method [10] is an evolution of the B method [11]. It is based on the weakest pre-condition of Dijkstra [12] and is supported by the Rodin platform [13]. Event-B is a formal method based on mathematical foundations, which are the first order logic and the sets theory. The Event-B model encodes a state transition system where the variables represent the state and the events represent the transitions from one state to another. The Event-B modelling process is incremental, it starts from an abstract model of the system which evolves progressively to a concrete one by adding design details through successive refinement steps. The description of the Event-B model is associated

Modelling Output Multi-modal Interactive Systems Using Event-B

205

to proof obligations ensuring the consistency of the model. These proof obligations are automatically generated and must be proved in order to ensure the model correctness. An Event-B model is divided into two components (see Fig.2: one, called CONTEXT, lists the static properties of the model and the other one, called MACHINE, describes the dynamic properties of the model (behavior)).

Fig. 2. The Event-B model structure

A MACHINE is defined by a set of clauses. Briefly, the clauses: – VARIABLES describes the model’s variables corresponding to the state of the defined transition system; – INVARIANTS first order logic expressions expressing typing and safety properties. These properties shall remain true in the whole model and in further refinements. Invariants need to be preserved by the initialization and events clauses; – THEOREMS first order logic expressions that can be proved from the invariants; – VARIANT decreasing variable determining the events execution order; – EVENTS describes all the events that may occur in a given model (transitions definition). Each event is described by a guard and a body. The guard is a first order logic expression involving variables, an event is fired when its guard is true. The body describes changes operated on variables thanks to generalized substitutions [12]. A MACHINE may refer to a CONTEXT. A CONTEXT can extend another one, it consists of the following clauses: – – – –

SETS declares the abstract and enumerated sets; CONSTANTS declares the constants; AXIOMS first order logic expressions describing the constant properties; THEOREMS first order logic expressions that can be deduced from axioms.

206

L. Mohand-Oussa¨ıd, I. A¨ıt-Sadoune, and Y. A¨ıt-Ameur

4.2 The Event-B Models for the Fission Process For the Event-B development associated to our fission model, we used the principle introduced in [14]. The authors encode process algebra operators in the Event-B method by successive refinements where the left side of a BNF rule is represented by the abstract machine and the right side is represented by the refined machine. The same development process is applied to the fission model syntactic expression, we have previously introduced. I

UIE (optemp op sem )(I I) It(n I) where n

When applying this principle to our syntactic description, the obtained Event-B model for the semantic fission model represents: – The left side of the above syntactic expression by the abstract machine ; – The right side (disjunction between UIE, (optemp op sem)(I I) and It(n I)) by the refined machine. The UIE refinement being trivial because it is a basic information unit that is not subject to fission, we choose to develop the representative refinements of the binary fission (optemp op sem)(I I) and the iterative fission It(n I). The obtained model shall express static and dynamic properties defined in the semantic descriptions of the fission model. Thus, according to the Event-B model structure, the context includes the information interpretation function which expresses a static property in the sense of our generic model. The temporal function T has not been explicitly modelled, we exploit the temporal scheduling implicitly expressed by the basic Event-B model (the use of guards activation events and variant allows the developer to express events temporal scheduling). So, modelling T becomes unnecessary in our Event-B formalization. The temporal and semantic operators which expresses the dynamic information scheduling and relationships are defined in the machine. Some static properties related to semantic operators (definition in some particular values and sets declarations) are described in the context component. Due to page constraint, we present in Section 5 only the Event-B development process relative to the first fission rule (binary fission rule).

5 The Binary Fission Model The abstract and refined models relative to the binary fission describe the refinement of an information i into two information i1 and i2 associated with a combination of a temporal and a semantic operator. The temporal operator optemp describes the temporal ordering of the two information occurrences and the semantic operator op sem expresses the semantic relationship between the two information. i (optemp op sem)(i1 i2) 5.1 The Context The abstract and the refined models share the same context Interface (see Table 3). It describes the static properties of information. Thus, the context Interface is composed of:

Modelling Output Multi-modal Interactive Systems Using Event-B

207

Table 3. The binary interface context

Table 4. The semantic context

CONTEXT interface bin SETS I D AXIOMS axm1 : In f I empty I axm2 : Dom D empty D axm3 : Int In f Dom axm4 : Int In f 1 Dom1 axm5 : Int In f 2 Dom2 axm6 : Int(empty I) empty D axm7 : I1 I axm8 : I2 I axm9 : D1 D axm10 : D2 D axm11 : In f 1 I1 empty I axm12 : In f 2 I2 empty I axm13 : Dom1 D1 empty D axm14 : Dom2 D2 empty D END

CONTEXT semantic EXTENDS interface bin AXIOMS axm1 : D sem Dom1 Dom2 axm2 : S em D sem Dom axm3 : S em(empty D empty D) empty D axm4 : (d1 d2) D sem S em(d1 d2) axm5 : R(D1 D2) END

- The declaration of I, the set of information to fission (i and its components to fission) and D the interpretation domain of I within the Int function; - axm1 and axm2 axioms declaring the In f and Dom sets. They set the I and D sets to the initial values empty I and empty D; - axm3, axm4, axm5 and axm6 axioms defining the interpretation function Int; - axm5 and axm6 axioms declaring the I1 and I2 sets of the two information i1 and i2 subject to fission; - axm7 and axm8 axioms declaring the D1 and D2 sets, the interpretation domain of I1 respectively I2 within the Int function; - axm9 and axm10 axioms declaring the In f 1 and In f 2 sets. They set the I1 and I2 sets to the initial value empty I; - axm11 and axm12 axioms declaring the Dom1 and Dom2 sets. They set the D1 and D2 sets to the initial value empty D. The interface context is extended for each semantic operator op sem (see Table 4) in order to include its definition by the following declarations: - axm1 axiom defining the semantic operator’s domain (D sem ); - axm2 axiom declaring the semantic operator function S em from D sem to Dom ; - axm3 axiom defining of the semantic operator function S em for empty values ; - axm4 axiom expressing the complementarity properties by the definition of each semantic operator function S em; - axm5 axiom expressing complementarity and redundancy properties by the relationship between D1 and D2. The S em operator definition and the R relation introduced respectively in axm4 and axm5 axioms are defined for each semantic operator as follows:

208

L. Mohand-Oussa¨ıd, I. A¨ıt-Sadoune, and Y. A¨ıt-Ameur

S em(d1 d2) d Dom Cc(d1 empty D) d1 Cc(empty D d2) d2 Cc(d1 d2) d R(D1 D2) D1 D2 S em(d1 d2) d Dom C p(d1 empty D) empty D Complementary C p(empty D d2) empty D C p(d1 d2) d R(D1 D2) D1 D2 Complementary S em(d1 d2) d Dom Cr(d1 empty D) empty D and redundant Cr(empty D d2) empty D Cr(d1 d2) d R(D1 D2) D1 D2 S em(d1 d2) Pr(d1 empty D) d1 Pr(empty D d2) d2 Partially redundant (D1 D2 Pr(d1 d2) d2 D2 D1 Pr(d1 d2) d1) R(D1 D2) D1 D2 S em(d1 d2) T r(d1 empty D) d1 Totally redundant T r(empty D d2) d2 T r(d1 d2) d1 d2 R(D1 D2) D1 D2 Concurrent

5.2 The Abstract Machine The abstract machine in f ormation (see Table 5) describes the production (by the functional core) of an information i. It sees the semantic operator(Op sem) context that links information produced by the fission. It is composed of a single event in f o that computes d, the interpretation of any produced information i before the fission (act1), initialized to the empty interpretation. The in f o event is activated under the condition that the information interpretation Int(i) belongs to the range of the semantic operator (grd2) i.e. the semantic interpretation of the information i can be expressed in terms of the corresponding semantic combination of two information interpretations. Finally, the abstract machine invariant simply expresses the typing property for d. Table 5. The abstract machine MACHINE information SEES [op sem ] INVARIANTS inv1 : d Dom EVENTS INITIALISATION EVENT info BEGIN ANY act1 : d : empty D i END WHERE grd1 : i I grd2 : Int(i) ran(op sem ) THEN act1 : d : Int(i) END END

5.3 The Refined Machine The refined machine represents the information subject to fission corresponding to the right side of the binary fission syntactic expression (optemp , op sem )(i1, i2), it refines the abstract machine in f ormation (see Table 5) and sees the semantic operator (Op sem )

Modelling Output Multi-modal Interactive Systems Using Event-B

209

context. Thus, in addition to the refined event in f o, expressing the production of the information i after the fission process, the refined machine describes the production of the i1 and i2 information respectively by the events in f o1 and in f o2. The temporal interleaving of i1 and i2 relative to Optemp is expressed by scheduling in f o1 and in f o2 and their semantic relationship related to the op sem is given by d after the interpretation of i in terms of the interpretation of i1(d1) and i2(d2). For the temporal scheduling of i1 and i2, we refer to [14] where the authors dealt with process algebra composition and had identified the sequential, parallel, choice and iteration operators as basic operators. They showed that it is possible to express all process algebra composition operators using these basic operators. So, we propose to extend these models with the variables relative to semantic operators and to adapt them to the Event-B model. We choose to present in Table 6 the refined machine based on sequential temporal scheduling. The anachronic, concomitant and coincident operator can be translated in terms of the basic operators as follows: - The anachronic composition of two events x and y is defined by the sequential composition of the three events x, empty event and y, where the empty event is an event whose action consists of the skip substitution. The sequence is an associative operator, then, the sequential composition of three events can be expressed by two successive applications of the binary sequence. Thus, the anachronic operator is defined as below; An(x y) S q(S q(x empty event) y) - The concomitant composition of two events x and y is defined only if x and y can be expressed in terms of sequential composition of two sub-events (x1 and x2 for x, y1 and y2 for y). So, the anachronic operator is defined as the sequential composition of the three events: x1, the parallel composition of x2 and y1 and y2; x S q(x1 x2) y S q(y1 y2) Cc(x y) S q(S q(x1 Pl(x2 y1)) y2) - The coincident composition of two events x and y is defined only if the first event x can be expressed in terms of sequential composition of three sub-events x1, x2 and x3. The coincident operator is consequently defined as the sequential composition of the three events: x1, the parallel composition of x2 and y and x3. x S q(S q(x1 x2) x3) Cd(x y) S q(S q(x1 Pl(x2 y)) x3). In the sequential refinement (Table 6), the temporal scheduling of in f o1 and in f o2 according to i1 and i2 is formalized by the introduction of the variant var seq, a decreasing variable that activates sequentially in f o1, in f o2 and finally the refined event in f o using the guard grd2 in the three events. The information interpretations (d, d1 and d2) are initialized to the the empty interpretation and the variant variable var seq is set to 2. The in f o1 event is activated when var seq equals 2(grd2), it computes for a given i1 in I1 (grd1), its interpretation d1 (act2) and decreases var seq to 1 (act1). The in f o2 event is activated when var seq equals 1 (grd2)(after event in f o1), it computes for a given i2 in I2 (grd1), its interpretation d2 (act2) and decreases var seq to 0 (act1). The in f o event is activated when var seq equals 0 (grd2)(after event in f o2), under

210

L. Mohand-Oussa¨ıd, I. A¨ıt-Sadoune, and Y. A¨ıt-Ameur Table 6. The sequential machine MACHINE sequential REFINES information SEES [op sem ] INVARIANTS inv1 : d Dom inv2 : d1 Dom1 d2 Dom2 inv3 : var seq 0 1 2 inv4 : d ran(op sem ) VARIANT var seq EVENTS INITIALISATION EVENT info1 BEGIN STATUS act1 : d : empty D convergent act2 : d1 : empty D ANY i1 act3 : d2 : empty D WHERE act4 : var seq : 2 grd1 : i1 I1 END grd2 : var seq 2 THEN act1 : var seq : 1 act2 : d1 : Int(i1) END END

EVENT info2 STATUS convergent ANY i2 WHERE grd1 : i2 I2 grd2 : var seq 1 THEN act1 : var seq : 0 act2 : d2 : Int(i2) END

EVENT info ANY i WHERE grd1 : i I grd2 : var seq 0 grd3 : (d1 d2) Dop sem WITH i : Int(i) op sem (d1 d2) THEN act1 : d op sem (d1 d2) END

the condition that the pair (d1 d2) belongs to the semantic operator definition domain (DOpsem )(grd3) and given that Int(i) op sem(d1 d2) (witness). It computes d as the semantic combination of d1 and d2 (act1). The gluing invariant (inv4) expresses that d, defining the i interpretation, belongs to the range of the semantic operator, i.e. there are two information interpretations d1 and d2 so that d is the semantic combination of d1 and d2.

6

Case Study

We illustrate our Event-B development process by the temperature warning example presented in section 3.1. The semantic fission process of in f o wr, we have set up, produces, through two steps the following decomposition: in f o wr (S q C p)(in f o tp in f o at) in f o at It(3 in f o dg) The corresponding Event-B developments involve two successive refinements, the first one, modelling the binary fission of in f o wr in terms of in f o tp and in f o at, and the second one, expressing the iterative fission of in f o at into three repetitions of in f o dg. Thus, the presented Event-B process consists to encode these two decompositions in three steps (an abstract model and two refinements) as follows: Step 1. An abstract model describing the information in f o wr. It is defined by means of the: - binary warning context (Table 7) which describes the in f o wr information environment: composite information (in f o tp, in f o at, in f o dg) and their interpretations (temperture, attention, danger);

Modelling Output Multi-modal Interactive Systems Using Event-B

211

Table 7. The binary warning context

Table 8. The complementary context

CONTEXT warning bin SETS I in f o wr in f o tp in f o at in f o dg D warning temperture attention danger AXIOMS axm1 : In f I empty I axm2 : Dom D empty D axm3 : Int In f Dom axm4 : Int In f 1 Dom1 axm5 : Int In f 2 Dom2 axm6 : Int(empty I) empty D axm7 : I1 in f o tp axm8 : I2 in f o at in f o dg axm9 : D1 temperature axm10 : D2 attention danger axm11 : In f 1 I1 empty I axm12 : In f 2 I2 empty I axm13 : Dom1 D1 empty D axm14 : Dom2 D2 empty D axm15 : Int(in f o wr) warning axm16 : Int(in f o tp) temperature axm17 : Int(in f o at) attention axm18 : Int(in f o dg) danger END

CONTEXT complementary EXTENDS warning bin AXIOMS axm1 : DC p (temperature attention) (empty D attention) (temperature empty D) (empty D empty D) axm2 : Cp DC p Dom axm3 : Cp(empty D empty D) empty D axm4 : Cp(temperature empty D) empty D axm5 : Cp(empty D attention) empty D axm6 : Cp(temperature attention) warning axm7 : D1 D2 END

Table 9. The warning abstract machine MACHINE warning SEES complementary INVARIANTS inv1 : d Dom EVENTS INITIALISATION EVENT info wr BEGIN BEGIN act1 : d : empty D act1 : d : warning END END END

- complementary warning context (Table 8) which extends the warning context with the complementary and redundancy properties of interpretations; - abstract warning machine (Table 9) which describes the production of in f o wr before the binary fission and its interpretation (warning) computing. Step 2. A first refined model which describes the binary fission of in f o wr into the sequential and complementary combination (S q C p)(in f o tp in f o at)), it uses the complementary warning context and defines a refined machine sequential. In the sequential machine, the event in f o tp produces the temperature information, it computes its interpretation temperature and the event in f o at produces the attention information, it computes its interpretation attention. The event in f o wr produces the warning information, it computes its interpretation warning by the complementary combination of temperature and attention interpretations.

212

L. Mohand-Oussa¨ıd, I. A¨ıt-Sadoune, and Y. A¨ıt-Ameur

Step 3. A second refined model describing the iterative fission It(3 in f o dg) that we do not present in this paper. Table 10. The warning sequential machine MACHINE sequential REFINES warning SEES complementary INVARIANTS inv1 : d ran(C p) inv2 : d1 Dom1 inv3 : d2 Dom2 inv4 : var seq 0 1 2 inv5 : var seq 1 d1 inv6 : var seq 0 d2 VARIANT var seq EVENTS INITIALISATION BEGIN act1 : d : empty D act2 : d1 : empty D act3 : d2 : empty D act4 : var seq : 2

temperature attention

EVENT info tp STATUS convergent WHEN grd1 : var seq 2 THEN act1 : var seq : 1 act2 : d1 : temperature END

EVENT info at STATUS convergent WHEN grd1 : var seq 1 THEN act1 : var seq : 0 act2 : d2 : attention END

EVENT info wr WHEN grd1 : var seq 0 THEN act1 : d : C p(d1 END

d2)

END

Table 11. The warning example proof obligations synthesis Component number of PO Automatic proof Interactive proof warning bin complementary warning sequential Total

4 4 2 23 33

4 4 2 20 30

0 0 0 3 3

The first two steps of the warning example Event-B implementation process has induced thirty three proof obligations (PO) (Table 11). Thirty obligations proofs have been automatically proved and three other were interactively proved.

7 Conclusion This paper is devoted to the output multi-modal human-computer interfaces modelling. It extends a previous work that allowed us to propose a formal model for these interfaces. More precisely, this paper addressed the formalization of the first step of the formal model we proposed: the semantic fission in Event-B. We proposed transformation models that allow a developer to generate, from the semantic fission description, the corresponding Event-B models that specify the output multi-modal human-computer interface.

Modelling Output Multi-modal Interactive Systems Using Event-B

213

This work is ongoing by developing Event-B models for the second step of our formal model in order to enable a full support of the interface developing process in EventB and usability properties verification.

References 1. Nigay, L., Coutaz, J.: Espaces conceptuels pour l’interaction multi-m´edia et multi-modale. Sp´ecial Multi-m´edia et Collecticiel, 1195–1225 (1996) 2. Palanque, P., Schyn, A.: A Model-based for engineering multi-modal interactive systems. In: 9th IFIP TC13 International Conference on Human Computer Interaction (2003) 3. Schyn, A., Navarre, D., Palanque, P., Nedel, L.P.: Description Formelle d’une Technique d’Interaction Multi-modale dans une Application de R´ealit´e Virtuelle Immersive. In: Proceeding of the 15th French Speaking Conference on Human-Computer Interaction (IHM 2003), Caen, France, November 25-28 (2003) 4. Navarre, D., Palanque, P., Bastide, R., Schyn, A., Winckler, M., Nedel, L.P., Freitas, C.M.D.S.: A Formal Description of Multi-modal Interaction Techniques for Immersive Virtual Reality Applications. In: Costabile, M.F., Patern´o, F. (eds.) INTERACT 2005. LNCS, vol. 3585, pp. 170–183. Springer, Heidelberg (2005) 5. Kamel, N.: Un cadre formel g´en´eric pour la specification et la v´erification des interfaces multi-modales. Cas de la multi-modalit´e en entr´ee, Ph.D thesis, Universit´e de Poitiers (2006) 6. Bordegoni, M., Faconti, G., Maybury, M.T., Rist, T., Ruggieri, S., Trahanias, P., Wilson, M.: A Standard Reference Model for Intelligent Multi-media Presentation Systems. Computer Standards and Interfaces 18(6-7), 477–496 (1997) 7. Rousseau, C.: Pr´esentation multi-modale et contextuelle de l’information. Ph.D thesis (universit´e Paris sud XI-Orsay) (2006) 8. A¨ıt-Ameur, Y., A¨ıt-Sadoune, I., Baron, M., Mota, J.: V´erification et validation formelles de syst´emes interactifs fond´ees sur la preuve: application aux syst´emes Multi-Modaux. Journal d’Interaction Personne-Syst´eme (JIPS) 1(1) (Septembre 2010) 9. Mohand-Oussa¨ıd, L., A¨ıt-Ameur, Y., Ahmed-Nacer, M.: A generic formal model for fission of modalities in output multi-modal interactive systems. In: 3rd International Workshop on Verification and Evaluation of Computer and Communication Systems, VECoS 2009, Rabat, Morroco (July 2009) 10. Abrial, J.-R.: Modeling in Event-B: system and software engineering. Cambridge University Press, Cambridge (2010) 11. Abrial, J.-R.: The B-Book. Cambridge University Press, Cambridge (1996) 12. Dijkstra, E.W.: A Discipline of Programming. Prentice-Hall, Inc., Englewood Clis (1976) 13. Rodin, European Project Rodin (2004),

14. Ait-Ameur, Y., Baron, M., Kamel, N., Mota, J.: Encoding a process algebra using the EventB method. International Journal on Software Tools for Technology Transfer (STTT) 11(3), 239–253 (2009)

Specification and Verification of Model-Driven Data Migration Mohammed A. Aboulsamh and Jim Davies Department of Computer Science University of Oxford Oxford OX1 3QD, UK {Mohammed.Aboulsamh,Jim.Davies}@comlab.ox.ac.uk

Abstract. Information systems often hold data of considerable complexity and value. Their continuing development or maintenance will often necessitate the ‘migration’ of this data from one version of the system to the next: a process that may be expensive, time-consuming, and prone to error. The cost, time, and reliability of data migration may be reduced in the context of modern, model-driven systems development: the requirements for data migration may be derived automatically from the list of proposed changes to the system model. This paper shows how this may be achieved through the deﬁnition of a ‘language of changes’. It shows also how a formal semantics for this language allows us to verify that a proposed change is consistent with representational and semantic constraints, in advance of its application. Keywords: Data modeling, information systems evolution, data migration, the B method.

1

Introduction

In the model-driven engineering (MDE) approach to software engineering, system artifacts are generated automatically from abstract models. This tends to facilitate the process of system evolution: the models are easier to change than the more detailed, platform-speciﬁc source code they replace [1]. However, the problem of data migration remains: if the system is updated on the basis of a new model, which might specify new data structures, how can we move, or migrate, data from the old version to the new version of the system? In many systems, the data may be of considerable complexity and value: it may be of critical importance to the owning organisation. Migrating data to a new version may require careful consideration of the diﬀerences in the way that information is represented: the range of values that may be assigned to each data item, and the relationships between values assigned to diﬀerent items. In general, data migration is expensive, time-consuming, and error prone; it can present a signiﬁcant barrier to continuing development and perfective maintenance. Fortunately, if both old and new versions of the system have been produced using a model-driven approach, we may apply that approach also to the production of a data migration function. The list of changes made to the model, L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 214–225, 2011. c Springer-Verlag Berlin Heidelberg 2011

Model-Driven Data Migration

215

Fig. 1. Evolution Metamodel

suitably annotated, may be used to produce a data transformation: one that would serve as a translation between the old and the new data representations. Furthermore, if this transformation admits a formal, mathematical semantics, we can determine in advance whether it is applicable to the existing data: that is, whether or not the results would satisfy the model constraints of the new system. In this paper, we explain brieﬂy how a list of changes to a system model may be described as a sequence of (instances of) model operations. As the main contribution of the paper, we show then how a formal, mathematical semantics for the language of model operations, constructed using the Abstract Machine Notation (AMN) and Generalised Substitution Language (GSL) of Abrial [2] can be used to calculate the applicability of the corresponding data transformation, and hence to provide timely feedback on any proposed change, in terms of its eﬀect upon existing data.

2

Modeling Data Model Evolution

Figure 1 shows the main concepts of our proposed metamodel for evolution, presented in more detail in [18]. The shaded classes are drawn from the Meta Object Facility (MOF) and the Object Constraint Language (OCL). An instance of the Evolution class is a description of the changes of a source model. An instance of Evolution can use OclExpressions to describe invariant properties and well-formedness rules; it contains also a set of EvolutionElements, which may be specialised to address diﬀerent types of changes. A change could be primitive, corresponding to a single model element: for example, addClass() and

216

M.A. Aboulsamh and J. Davies

deleteAssociation(). Alternatively, it could be an instance of CompoundEvolutionOperation, acting upon multiple elements: for example, extractClass()[16], which may be speciﬁed as extractClass(srcClass, tgtClass, linkAttribute : String) where srcClass is the class to be split, tgtClass is the class to receive the properties, and linkAttribute is the association between them. Example. We will use a simple UML data model of an Employee Assignment Tracking system to illustrate our approach. Figure 2(a) represents an initial version, consisting of two classes Employee and Department. When an employee is hired, he/she is assigned to a department and becomes part of the employees association. Once in a department, an employee may, in addition, be assigned to a speciﬁc project and becomes part of team association. The current version of the model includes a constraint, written in OCL, shown below the diagram. The initial version of the model was instantiated in a valid model instance shown in Figure 2(b).

e1:Employee startDate = 201010

Employee startDate : Date empID : Integer hire (emp, dep) terminate (emp)

0..* employees

1 department

Department

empID = 7146

emp loyee s

d1:Department depa

location : String

rtme

nt

location = ‘location1’

e2:Employee 0..* team

setProjectAssgmt (emp) getProjectAssgmt (emp, dep)

/* employees and department are bi-directional associations*/ context Employee inv C1: self.department.employees ->includes(self)

startDate = 200901

empID = 9216

emp loyee s depa rtme nt

e3:Employee startDate = 200907

empID = 8310

loye emp

es

nt rtme depa

d2:Department location = ‘location2’

team

Fig. 2. Data and object models of a simpliﬁed Employee Tracking System

In a subsequent version, a new Project class is extracted from the Department class. This change can be described as extractClass(Department, Project, projects)

or, equivalently, as (addClass(Project); addAssociation (Department, projects : Project, 0, *); moveAttribute( addAttribute(Project, location, [Project.location = Department.location]); deleteAttribute(Department, location)) || moveOperation( addOperation(Project, setProjectAssigmt);

Model-Driven Data Migration

217

deleteOperation(Department, setProjectAssigmt)) || ... moveAssociation( addAssociation(Project, team : Employee, [Project.team = Department.team]); deleteAssociation(Department, team)) )

Figure 3 shows the evolved version of the model.

Employee

0..*

startDate : Date empID : Integer

Department

1

employees

department

hire (emp, dep) terminate (emp)

oj

ec

ts

Project location : String closingDate : Date

pr

0.. * pro jec ts

0.. *

0.. * tea m

setProjectAssgmt (emp) getProjectAssgmt (emp)

context Employee inv C1: self.department.employees ->includes(self)

Fig. 3. Employee Information System model (evolved)

The model evolution operations above constitute abstract speciﬁcations of the data model changes. Using model transformation, these speciﬁcations may be suitably translated into an executable program in a platform-speciﬁc implementation: [18] shows how this can be done using SQL.

3 3.1

Formal Modeling with B Overview of B

Introduced by Abrial [2], B is a formal method dedicated to the speciﬁcation of software systems. It consists of two notations: AMN, which is used to specify the structure and the state space of a system, and GSL, which is used to specify system operations to manipulate the state space. A B model is constructed from one or more abstract machines. Each abstract machine has a name and a set of clauses that deﬁne the structure and the operations of the machine. Figure 4 shows main clauses of B AMN where clause SETS contains deﬁnition of sets; VARIABLES deﬁnes the state of the system, which should conform to properties stated in the INVARIANT clause. INITIALIZATION of variables and variable manipulations in OPERATIONS clause should also preserve invariant properties. OPERATIONS are based on GSL whose semantics is deﬁned by means of predicate transformers and the weakest precondition [4].

218

M.A. Aboulsamh and J. Davies

Abstract Machine clauses

Partial list of GSL operators

Fig. 4. Main elements of B-method AMN and GSL notations

A generalized substitution is an abstract mathematical programming construct, built up from basic substitution, for example, the Assignment operator takes the form x := E, corresponding to assignment of expression E to state variable x. The Preconditioning operator P|S executes as S if the precondition P is true, otherwise, its behaviour is non-deterministic and not even guaranteed to terminate. The statement @x.(P ==>S) represents an unbounded choice operator which chooses an arbitrary x that satisﬁes predicate P and then executes S with the value of x. Other GSL operators include SKIP, Bounded Choice, Guarding, Sequential and Parallel composition as can be seen in ﬁgure 4. [2] provides more details on GSL and AMN. B supports the notion of data refinement. Design decisions that are closer to an executable code are stated in refinement machines as opposed to abstract machines. A refinement machine must include a linking invariant which relates abstract state to a reﬁnement state. In addition, a reﬁnement machine will have exactly the same operations as the abstract machine with exactly the same input and output parameters. Furthermore, operations in the reﬁnement machine are required to work only within the preconditions given in the abstract machine, so those preconditions are assumed to hold for the reﬁned operations. 3.2

Formal Semantics

To be able to reason about a proposed evolution and analyze its applicability and eﬀect, we need to deﬁne appropriate formal semantics. This formal semantics can be resolved into two aspects: semantics for the data modeling language (UML) and semantics for the evolution metamodel, which we summarize from [18]. Our proposed AMN semantics draws on previous work translating UML and OCL into B, for example [11] and [19]. It clariﬁes the deﬁnition of UML model concepts such as classes, attributes, associations and constraints. It also elaborates on the instantiation of model concepts into data instances. Our mapping may be characterized as follows:

Model-Driven Data Migration

219

MACHINE DataModel SETS CLASS; ObjectID; ATTRIBUTE; ASSOCIATION; VALUE; TYPE VARIABLES class, attribute, association, value, link INVARIANT class : CLASS +-> POW(ObjectID) attribute : CLASS +-> ATTRIBUTE +-> TYPE value : CLASS +-> ATTRIBUTE +-> ObjectID +-> VALUE association: CLASS +-> ASSOCIATION +-> CLASS link : CLASS +-> ASSOCIATION +-> ObjectID +-> POW(ObjectID) In the SETS clause, the machine presents the sets of possible names for classes, attributes, associations. ObjectID represents the set of object references; Type the set of possible attribute or parameter types and Value the set of possible values of attributes. In the INVARIANT clause, the function class maps a model class name to a power set (denoted by POW) of ObjectIDs; this function is partial (denoted by +->): not every class name will be instantiated. The function attribute maps each attribute, identiﬁed ﬁrst by the name of the class, and then by the name of the attribute itself, to a corresponding type. The function association, similarly indexed by class names and association names, yields a functional relation between class names. The function value returns the current value of the named property for each object. The function link, similarly, indexed by class names and association names; yields a relation between object identiﬁers. We may give semantics to our evolution operations by mapping each operation to a substitution on the variables of the machine state class, attribute, association, value, link. These substitutions may be composed, sequentially or in parallel, to produce a speciﬁcation of the data migration corresponding to a compound, evolutionary change. To give a semantics to complex operations and evolution patterns, we may expand the substitution obtained from the sequential or parallel combinators, or produce a direct deﬁnition using raw GSL.

4

Verification of Data Model Evolution

In this section, we focus on verifying two main kinds of evolutionary changes. First, we focus on refactoring changes i.e. changes that does not add or remove new functionality, but aim at improving the quality of a current data model. Second, we discuss data model evolution that results from changes in requirement and typically involves adding or deleting model features. 4.1

Data Model Refactoring

The essence of refactoring is to improve the design of a software artifact without changing its externally observable behavior [17]. Although the notion of refactor-

220

M.A. Aboulsamh and J. Davies

ing has been widely investigated, for example [16], a precise deﬁnition of behavior preservation is rarely provided [17]. The reﬁnement proof in B establishes that all invariant properties of a current data model and that all pre-post properties of model operations are also valid in the target (evolved) model. Thus, it can give us a means to verify the behavior preservation property of refactoring and, at the same time, guarantee that the data migration will succeed: that is, the data values after the changes will satisfy the target model constraints. This enables designers to perform a wide range of data model changes (those characterized as refactoring) while ensuring that data persisted under a current model can be safely migrated. The mapping from UML to B described in the previous section can be used for such veriﬁcation. We map the current data model to an AMN abstract machine and the target model to an AMN reﬁnement machine. In the reﬁnement machine, we need to deﬁne a linking invariant : an invariant property that relates data in both machines and acts as data transformation speciﬁcation. Currently, this invariant needs to be manually stated by the designer to complement the generated B speciﬁcations. Synthesizing this invariant from evolution speciﬁcations is a future work item. By discharging B reﬁnement proof obligations we can prove that a target model, after evolution, is a data reﬁnement of the current model and, hence, establish that the behavior of the two models is equivalent and that the data migration can be applied. Below, we state the proof obligations in general then apply them to an example. Where A and R represent an abstract and reﬁnement machines; J represents a linking invariant; TA , TR , SA and SR represent the initialization and any possible execution of the abstract and reﬁnement machines respectively; I and P represent the invariant and precondition properties of the abstract machine, B reﬁnement proof obligations state the following (note that ¬S ¬T means that there exists at least one S that establishes T ): 1. Initialization. [TR ] ¬ [TA ] ¬ J : every initial state [TR ] in the reﬁnement machine (representing the target model) must have a corresponding initial state [TA ] in the abstract machine (representing the current model) via linking invariant J. 2. Operations. I ∧ J ∧ P ⇒ [SR ] ¬ [SA ] ¬ J : the consequent of this implication states that every possible execution [SR ] of the reﬁnement machine must correspond (via the linking invariant J) to some execution [SA ] of the abstract machine. This is required to be true in any state that both the abstract machine and the reﬁnement machine can jointly be in (as represented by the invariants I ∧ J) and when the operation is called within its precondition (P). 3. Operations with outputs. I ∧ J ∧ P ⇒ [SR [out1/out]] ¬ [SA ] ¬ (J ∧out1 = out) : this proof obligation has exactly the same explanation provided in proof obligation 2 above with the added condition that output out1 of the reﬁnement operation, must also be matched by an output out of the abstract machine operation.

Model-Driven Data Migration

1

MACHINE

EmployeeTracking

2 3 4 5 6 7 8 9

SETS EMPLOYEE; DEPARTMENT VARIABLES employees, team INVARIANT ... employees : DEPARTMENT <-> EMPLOYEE & team : DEPARTMENT <-> EMPLOYEE ...

10 11 12

INITIALISATION employees := {} || team := {}

13 14 15 16 17 18 19 20

OPERATIONS ... setProjAssgmt (emp,dep)= PRE emp: EMPLOYEE & dep: DEPARTMENT THEN team := team \/ {dep |-> emp} END;

221

Refinement EmployeeTrackingR REFINES EmployeeTracking SETS PROJECT VARIABLES employeesr, projectsr, teamr INVARIANT ... projectsr : DEPARTMENT <-> PROJECT & teamr : PROJECT <-> EMPLOYEE & team = (projectsr ; teamr) ... INITIALISATION employeesr :={} || projectsr := {}|| teamr := {} OPERATIONS ... setProjAssgmt (emp,dep)= BEGIN ANY pp WHERE pp:projectsr[{dep}] THEN teamr := teamr \/ {pp|->emp} END END;

21 22 23 24 25 26 27

(a)

response<--getProjAssgmt(emp)= PRE emp: ran (team) THEN response:= team~[{emp}] END END

response<--getProjAssgmt(emp)= BEGIN VAR ppr IN ppr := teamr~[{emp}]; response:= projectsr~[ppr] END END

(b)

Fig. 5. Mapping of data models in Figure 2(a) and Figure 3 to AMN (partial)

Example. Figure 5(a) shows a partial mapping of the initial version of UML data model, presented in Figure 2(a), to an AMN abstract machine. Figure 5(b) shows a similar mapping of the same data model after applying extractClass() refactoring step, as shown in Figure 3, to an AMN reﬁnement machine. Note that the last invariant conjunct in the reﬁnement machine (line 9 in Figure 5(b)) describes a linking invariant in the form of data transformation of the team association. While in the source model, this association was a relation between Department and Employee classes (line 8 in Figure 5(a)), in the target model it is a relational composition through Project class (lines 7 and 8 in Figure 5(b)). Applying reﬁnement proof obligations to the example above, we get the following outcome: 1. Initialization. This condition holds since the empty sets in Initialization clauses can be related to each other. 2. Operations. When considering setProjAssgmt() operation (line 16 in Figure 5(b)), this condition holds since this operation in the reﬁnement machine has no explicit precondition (it works under the assumption of the precondition of corresponding setProjAssgmt() operation in the abstract machine) and every execution of this operation in the reﬁnement machine [SR ] updates teamr relation which is (according to linking invariant J) an equivalent relation to team which is updated by setProjAssgmt() operation in the abstract machine (line 16 in Figure 5(a)). 3. Operations with outputs. Applying this proof obligation on getProjAssgmt() operation (line 22 in Figure 5(b)), we ﬁnd that it holds.

222

M.A. Aboulsamh and J. Davies Employee startDate : Date empID : Integer

0..* employees

Department

1 department

hire (emp, dep) terminate (emp)

oj

ec

ts

Project location : String closingDate : Date

pr

0 pro ..* jec ts

0.. *

tea 0..* m

setProjectAssgmt (emp) getProjectAssgmt (emp)

Assignment assignmentDate : Date status : Status

context Employee inv C1: self.department.employees ->includes(self) /* a project assignment must occur before the project is closed*/ context Employee inv C2: projects->forAll(p|p.assignmentDate < p.closingDate)

Fig. 6. Employee Assignment Tracking Model - version 3

Every execution of this operation in the reﬁnement machine generates a response output assigned to an input employee. This is matched by the execution of the corresponding operation in the abstract machine via the linking invariant team = (projectsr ; teamr). We conclude that a reﬁnement relation exists between the two machines. 4.2

Data Model Evolution

Where the evolution represents a change in requirements, we should not expect the new model to preserve the behavior or establish the invariant properties of the current model; to the contrary, we should expect to ﬁnd that diﬀerent properties are now constrained in diﬀerent ways. In this case, we disregard the constraints of the current model, and instead calculate the weakest precondition [4] for the data migration to succeed—for the operation to achieve the constraints of the target model—and then check to see whether this applies to the current data. Example. Assume that the current version of the Employee Tracking System model, shown in Figure 3, has been evolved further into the model shown in Figure 6, with the following speciﬁcations describing the changes: addAssociationClass(Assignment, Employee, projects, Project, team); addAttribute(Assignment, assignmentDate: Date [team.startDate]); addAttribute(Assignment, status : Status [’active’]); addConstraint(Employee, C2, [context Employee inv C2: projects->forAll(p|p.assignmentDate < p.closingDate)) where the operations addAssociationClass() and addConstraint() have the obvious interpretations, and the Date and Status parameters to addAttribute()

Model-Driven Data Migration

223

represents the intended types of the properties being added. Applying the mapping we discussed in Section 3.2, we get the following substitution, in part: attribute := attribute \/ { Assignment |-> assignmentDate |-> Date } ; ! o : class(Assignment) . value := value \/ { Assignment |-> assignmentDate |-> o |-> value (Employee) (startDate) (link (Assignment)(team)(o)) }

this is the GSL semantics for the operation: addAttribute(Assignment, assignmentDate: Date [team.startDate]) The AMN semantics of the target model includes the constraint ! ee : class (Employee) . ! pp : link (Employee)(projects) (ee) . value (Assignment) (assignmentDate) (pp) < value (Project) (closingDate) (pp) and the weakest precondition for the substitution to achieve this constraint: ! ee : class(Employee) . ! pp : link (Employee) (projects) (ee) . value (Employee) (startDate) (ee) < value (Project) (closingDate) (pp) That is, for any object ee of class Employee, all of the project assignments must have a closingDate that is greater than the employee startDate. The weakest precondition calculation—which can be automated for this class of operations—returns a constraint expressed entirely in the notation of the current model. This is thus a condition that may be applied to the current data: if it is true, then the data migration speciﬁed by GSL operations would succeed, leaving the data in a state that conforms to the new, evolved model.

5

Discussion and Related Work

To be able to produce a new version of an information system, quickly and easily, simply by changing the system model, have signiﬁcant beneﬁts. These beneﬁts are sharply reduced if changes to the model may have unpredictable consequences for the existing data. In this paper we have outlined a possible solution: capturing the changes to a model using a language of model operations, mapping each operation to a formal speciﬁcation of the data model and corresponding data transformation, checking a sequence of operations for consistency with respect to the model semantics and the existing data, and-for a speciﬁc platform-automating the process of implementation. This approach can be applied regardless of technology, but its value is most obvious in the context of a Model-Driven Engineering (MDE). Here, we are likely

224

M.A. Aboulsamh and J. Davies

to ﬁnd models that are faithful abstractions of the working system, with a suitable implementation framework already in place, ready to translate constraints and operations at the model level into implementation-level checks and actions. The work we presented here is an extension of the main concepts we outlined in [12], [13] and [18] where we discussed how changes to a precisely deﬁned object model can be reﬂected on the structure of the model and on the representation of the data stored, however without the the detailed description of the veriﬁcation techniques which we presented here. More generally, the work we describe in this paper relates to the intersection of two main research areas: database schema evolution and Model-Driven Engineering (MDE). Schema evolution is the process of applying changes to a schema in a consistent way and propagating these changes to the instances while the database is in operation [5]. Schema evolution has been widely discussed in literature and therefore, various approaches have been proposed. Some of the most relevant approaches to the general problem of information system evolution are [5], [6], and [7]. While these and other attempts provide solid theoretical foundations and interesting methodological approach, the lack of abstraction was observed in [8] and remains largely unsolved after many years. The advent of MDE paradigm has promoted the idea of abstracting from implementation details by focusing on models as ﬁrst class entities. The abstraction answer to the issue of information system evolution we are addressing here builds upon some of the most recent results on model-driven engineering literature such as Model transformation [15]. Model weaving [9] and Model refactoring [10]. These and other approaches in MDE may help in characterizing information systems evolution however, they remain largely general-purpose and oﬀer no speciﬁc support for information systems evolution tasks such as data migration. The approach we presented here provides a solid foundation for a tool implementation that, given a suitable formalisation, can oﬀer immediate feedback to information system designers. As they edit a data model, changing properties, assigning new types or values, the development environment could indicate whether or not the current data, transformed accordingly, would ﬁt the new model. If not, then they could make further changes: relaxing a constraint, or choosing to update the model in a diﬀerent way. Furthermore, we would like to investigate the possibility of extending our consistency checking to include updates to the data modeling language as well as to updates to the data model resulting in a single mechanism for managing updates and constraints at both modeling and metamodeling levels of design.

References 1. Bezivin, J.: On the uniﬁcation power of models. Software and Systems Modeling 4(2), 171–188 (2005) 2. Abrial, J.R.: The B-book: Assigning Programs to Meanings. Cambridge University Press, Cambridge (1996)

Model-Driven Data Migration

225

3. Object Management Group (OMG), UML 2.0 infrastructure speciﬁcation, v2.1.2 (2007), http://www.omg.org/spec/UML/2.1.2/ (retrieved February 09, 2011) 4. Dijkstra, E.W.: A Discipline of Programming. Prentice Hall, Englewood Cliﬀs (1976) 5. Banerjee, J., Kim, W., Kim, H.-J., Korth, H.: Semantics and implementation of schema evolution in object-oriented databases. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 1987), pp. 311–322. ACM, New York (1987) 6. Ferrandina, F., Meyer, T., Zicari, R., Ferran, G., Madec, J.: Schema and database evolution in the O2 object database system. In: Very Large Database. Morgan Kaufmann, San Francisco (1995) 7. Jing, J., Claypool, K., Rundensteiner, E.: SERF: Schema Evolution through an Extensible, Re-usable and Flexible Framework. In: Int. Conf. on Information and Knowledge Management (1998) 8. Rashid, A., Sawyer, P., Pulvermueller, E.: A ﬂexible approach for instance adaptation during class versioning. In: Dittrich, K.R., Oliva, M., Rodriguez, M.E. (eds.) ECOOP-WS 2000. LNCS, vol. 1944, pp. 101–113. Springer, Heidelberg (2001) 9. Fabro, M., Bezivin, J., Jouault, F., Breton, E., Gueltas, G.: AMW: a generic model weaver. In: Proceedings of the 1´ere Journ´ee sur l’Ing´enierie Dirig´ee par les Mod´eles (2005) 10. Sunye, G., Pollet, D., Traon, Y., J´ez´equel, J.-M.: models. In: The 4th International Conference on The Modeling Languages, pp. 134–148. Springer, Heidelberg (2001) 11. Laleau, R., Mammar, A.: An Overview of a Method and Its Support Tool for Generating B Speciﬁcations from UML Notations. In: IEEE Proceedings Automated Software Engineering, pp. 269–272 (2000) 12. Aboulsamh, M., Crichton, E., Davies, J., Welch, J.: Model-driven data migration. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 285–294. Springer, Heidelberg (2010) 13. Aboulsamh, M., Davies, J.: A Metamodel-Based Approach to Information Systems Evolution and Data Migration. In: The 2010 Fifth International Conference on Software Engineering Advances (ICSEA 2010). IEEE Computer Society, Washington, DC (2010b) 14. Davies, J., Crichton, C., Crichton, E., Neilson, D., Sorensen, I.H.: Formality, Evolution, and Model-driven Software Engineering. Electron. Notes Theor. Comput. Sci. 130, 39–55 (2005) 15. Sendall, S., Kozaczynski, W.: Model transformation: The heart and soul of modeldriven software development. IEEE Software 20(5), 42–45 (2003) 16. Fowler, M., Beck, K., Brant, J., Opdyke, W., Roberts, D.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional, Reading (1999) 17. Mens, T., Tourwe, T.: A Survey of Software Refactoring. IEEE Trans. Softw. Eng. 30(2), 26–139 (2004) 18. Aboulsamh, M., Davies, J.: A Formal Modelling Approach to Information Systems Evolution and Data Migration. In: Halpin, T., Nurcan, S., Krogstie, J., Soﬀer, P., Proper, E., Schmidt, R., Bider, I. (eds.) BPMDS 2011 and EMMSAD 2011. LNBIP, vol. 81, pp. 383–397. Springer, Heidelberg (2011) 19. Lano, K., Clark, D., Androutsopoulos, K.: UML to B: Formal Veriﬁcation of Object-Oriented Models. In: Boiten, E.A., Derrick, J., Smith, G.P. (eds.) IFM 2004. LNCS, vol. 2999, pp. 187–206. Springer, Heidelberg (2004)

Towards a Simple Meta-model for Complex Real-Time and Embedded Systems Yassine Ouhammou1 , Emmanuel Grolleau1 , Michael Richard1 , and Pascal Richard2 LISI - Futuroscope, France ENSMA -2 Universit´e de Poitiers {ouhammoy, grolleau, richardm}@ensma.fr [email protected] 1

Abstract. We introduce an open meta-model that will be easily enriched to cover new real-time scheduling models and techniques. On the one hand, it will be possible to connect several independent schedulability analysis tools, following closely the advances in real-time scheduling theory, dealing with a temporal model that will be covered by our metamodel. On the other hand, we will use model transformation techniques in order to extract information from diﬀerent design methodologies. This extraction can be done at diﬀerent stages of the design (early for sensitivity analysis, at a later stage for temporal validation) to create a temporal model of the designed system, without the designer to be an expert in scheduling theory. We are working currently on the meta-model phase. The objective is to cover enough concepts to be able to represent a big part of the real-time scheduling models and problems, without introducing too much complexity. This paper uses the UML proﬁle mechanism to describe such a meta-model.

1

Introduction

The amount of functionalities of real-time and embedded systems has increased drastically in the last decade, and critical systems, such as automotive or avionic systems, have grown in hardware and software complexities. The design phase of such systems can take place over several years, and designers have to cope with new hardware, and/or new software during the design process. In order to allow a scalability and a re-usability of the design, the object-oriented paradigm is gaining popularity among designers of real-time systems, especially by using the Uniﬁed Modeling Language (UML) that has become a common language for system engineering modeling, and enables compliance with model driven engineering. Real-time systems require a temporal analysis phase, to prove that temporal constraint are met. Since temporal behavior depends on the software, the executive, and the hardware, several design languages and methods were proposed to facilitate schedulability : they oﬀer a task model, a hardware model, and an operational model including scheduling algorithms. As a result, they oﬀer a schedulability analysis L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 226–236, 2011. c Springer-Verlag Berlin Heidelberg 2011

Towards a Simple Meta-model for Complex Real-Time Systems

227

(worst-case response time analysis, simulation, etc.). We brieﬂy detail the main features of these design languages. Architecture Analysis & Design Language (AADL) [1] is a language designed for the speciﬁcation, analysis, and automated integration of real-time performancecritical distributed computer systems. Since AADL may lack semantics related to schedulability analysis, property sets are among ways enabling to extend the language and customize an AADL speciﬁcation to represent scheduling speciﬁc properties of requirements. System Modeling Language (SysML) [25] is designed to provide simple and powerful constructs for modeling a wide range of systems engineering problems. It is particularly eﬀective in specifying requirements, structure, behavior, and constraints on system properties to support engineering analysis. EAST Architecture Description Language (EAST-ADL) [6] is deﬁned to handle the software and electronics architecture of vehicle electronics with enough detail to allow modeling for design and analysis. These actions require system descriptions on several abstraction levels, from top level user features down to tasks and communication frames in processing units and communication links. In 2009, the OMG1 has adopted Modeling and Analysis of Real-Time and Embedded systems language (MARTE) [21] which is an UML proﬁle providing support for speciﬁcation, design, and validation stages. MARTE is structured around two main concerns: one to model the features of real-time and embedded systems and the other to annotate application models in order to support analysis of system properties. Several methods and tests have been developed to analyze the schedulability in real-time systems [13,12,26,5] by extracting main information from the analyzed system model. Many commercial and free schedulability analysis tools provide some subsets of these tests, to help designers in the analysis phase, such as SymTA/S [10], MAST [16], Cheddar [24], etc. Each of these tools uses a diﬀerent set of concepts to create the input models for simulation and analysis. The metamodels of those timing analysis tools diﬀer, therefore an analysis of models, based on existing design languages, is only possible if the designer follows speciﬁc methodologies [9,19,15,3] during the design phase, and then transforming those design models by using transformation techniques (Acceleo, ATL, Kermeta, etc.) usually targeting only one speciﬁc tool. The existing schedulability analysis tools use a classical model for task systems, and can run a ﬁnite set of schedulability tests over this model (worst-case response time analysis, time demand analysis, simulation, etc.). Nevertheless, the schedulability analysis research community is very active, and oﬀers various temporal analysis models and methods. Recent advances concern the software model with the study of task models closer to the actual task behavior, the executive model with new scheduling algorithms, ways to take overhead into account, etc., and the hardware model, including new architectures, multi-core and distributed systems. Moreover, the provided analyses are not only schedulability tests but can also address dimensioning, resource allocation, or quality of service concerns. Dimensioning techniques, like the sensitivity analysis [29], 1

Object Management Group.

228

Y. Ouhammou et al.

allows an early validation of temporal constraints. A tool cannot oﬀer all the scheduling models and techniques, as a result, scheduling models and methods can take years before being included in a schedulability analysis tool. This integration requires a lot of eﬀorts, especially if it is implying some changes in the task model, and in the way the task model can be built from a source language, which might need itself some additional semantics that might be standardized or ad-hoc. In this paper, we suggest how we could split the problem into two parts using a rich and ﬂexible meta-model as a pivot language. The aim of this meta-model is to oﬀer enough concepts and semantics to cover, with few modiﬁcations, most temporal analysis models and methods up to nowadays and for future evolutions. Then, on one hand, model-transformation methods can be used to extract the required temporal, software, hardware and operational information from several design methodologies and languages. On the other hand several temporal analysis tools can be used to analyze such a model. This meta-model would be used as a bridge between design-languages/methodologies and temporal analysis tools, covering any future change from both sides with minimal eﬀorts. The remainder of this article is organized as follows. In the next section, we give an overview of real-time scheduling concerns. Section 3 gives a general idea about the elements constructing the meta-model and their interaction, and Section 4 presents an example through a case study. Finally, Section 5 summarizes and concludes this article.

2

Real-Time Scheduling Concerns

A typical real-time system is applying control and command on a critical physical process. For this, it is implemented as a set of parallel, interdependent functionalities. Parallelism is often ensured by the multitasking paradigm, relying on a operational layer oﬀering task scheduling. This layer can be a real-time operating systems, and oﬀers hierarchical schedulers, sharing resources (memories, processors, networks, input/output, etc.). The temporal validation of real-time multitasking systems is mainly based on the scheduling theory and model checking of the system modeled using a formal model (time Petri nets, timed automate, etc.). The system designer analyzes the temporal behavior of a set of tasks submitted to a scheduling algorithm using algebraic methods called feasibility tests in order to prove that temporal constraints will be met at run-time. Model checking of formal models can also be used for validating real-time systems, but these approaches are out of the scope of our paper. The real-time problematic is based on three axes: the task models, the hardware equipments and the scheduling theory. Recently, the task models have been evolving to support industrial problems like multiframe models for multimedia systems [18], transaction models for serial communication systems, etc. The purpose of these models is to obtain a thin design granularity in order to avoid an over-dimensioning of critical systems. In order to ensure the performance and the robustness, the hardware architecture of industrial real-time systems has been steadily improved. Architectures

Towards a Simple Meta-model for Complex Real-Time Systems

229

are becoming more and more complex, especially for multi-processor systems, multicore, and networked control systems. This improvement of the hardware needs inﬁnitely improvements of the task modeling and the scheduling methods. Scheduling theory has been originally studied for the basic Liu and Layland model [13], and extended to cover more advanced and precise task models. The basic task model is mathematically simple (periodic independent jobs), but is a far abstraction for most practical applications. Since scheduling analysis performs a worst-case analysis, imprecision in models and methods implies a pessimistic analysis leading to a severe over-dimensioning of critical systems. Several problems have been studied, based on seminal works like the worst-case response time [11,26], the sensitivity analysis [28,4], the priority assignment [2], and the multi criteria optimization [17]. During the last decade, many practical factors, like the self-suspension tasks [23], the precedence constraints anomalies [22], the transactions and their generalization [27], approximation of response time [20], etc., have been investigated. The original analysis are applied on these models and that lead to combinatorial problems which need to conceive adapted methods. These studies could have a big impact on industrial systems by improving their performance, if the design, of such systems is thin enough to support them. Unfortunately, by the actual standard design languages, these models and methods are not supported and since models and methods keep being studied, they are highly volatile and subject to constant improvement. In such a critical real-time system context, the presence of a canvas for supporting this multi-variation is strongly needful. Thus, an intermediate metamodel is proposed.

3

Metamodel

A designer can choose his design methodology and languages depending on several criteria. And like the scheduling analysis, new languages and methodologies are constantly proposed and improved. The purpose of our work is not to introduce a new design language or to propose a new methodology, but we suggest to add a design phase through an intermediate meta-model framework (see Fig. 1). The purpose of this framework is to partition the eﬀorts between the design tools and the schedulability analysis tools, which are both in constant evolution and improvement in distinct scientiﬁc communities. 3.1

Functioning Description

The intermediate meta-model framework is a bridge between the real-time design languages and the temporal analysis tools. On the one hand, it interacts with design methodologies, which oﬀer the possibility to combine many design languages [7], and on the other hand, this framework should allows many schedulability analysis techniques and feasibility tests despites their dissimilarities, to get their inputs, and provide their results in an useful way for the designers. This partitioning oﬀers two big advantages. The ﬁrst one is to give the designers a free area

230

Y. Ouhammou et al.

Fig. 1. The intermediate meta-model framework interactions

for modeling their systems without any obligation to follow any speciﬁc methodology. Even if one follows a real-time methodology [14,8], the design would be accepted by the framework, thanks to the intermediate meta-model, because it is done regardless of the followed design approach. The second advantage is the optimization on the transformation process. Currently, transformation must be done several times and it inﬂuences the source model, knowing that the transformation process is often accompanied by a source model modiﬁcation to be compatible with the desired tool. Using the intermediate meta-model framework, the designer would not need to transform the source design for each temporal analysis tool, all the transformation processes would be encapsulated without any impact on the source model. In this article, we present the intermediate meta-model. The intermediate meta-model conception has to take into consideration the lack of schedulability analysis experience of the designer, and the lack of system modeling experience of the analyst. The intermediate meta-model is presented as an UML proﬁle so one can use it directly via an UML editor. An UML proﬁle is a generic extension mechanism for customizing UML models for particular domains and platforms, it operates by using three complementary kinds of elements: stereotypes, tagged values and constraints. Scheduling analysis techniques oﬀer several patterns, each pattern depends on the model completion. A real-time modeling phase is called complete when the designer provides three kinds of models: the application model kind represents the functional side of the system, the operational/behavioral model kind corresponds to the temporal task model, and the hardware model kind is a way to take into account the hardware system support. Validation is among available patterns which can be applied to a complete design, it enables to choose the best scheduling model and tests, it also oﬀers to the designer the testing results (e.g. worst-case response time analysis, simulation, etc.), and reports them on the design used. The dimensioning pattern is also very useful at an early stage, it helps designer to complete the modeling by providing a set of techniques like sensitivity analysis, task mapping, priority assignment and minimization of the number of processors.

Towards a Simple Meta-model for Complex Real-Time Systems

3.2

231

Architecture Description

The intermediate meta-model is based on three concepts: The platform concept is mandatory to design the system architecture and to obtain both software and hardware operational models. The behavior concept gives a dynamism to this architecture in order to express a behavioral model. Then, the analysis concept allows to add extra-functionalities to the models listed previously in order to have an analyzable system design. These concepts are found in the intermediate meta-model architecture (see Fig. 2).

Fig. 2. Architecture of the meta-model

The global architecture view shows three main packages: SoftwarePlatform package, HardwarePlatform package, and RealTimeProperties package. The SoftwarePlatform package represents the software operators and the software behavior side of the real-time systems. The SoftwarePlatform package contains two sub-packages, the ﬁrst one called SoftwareOperators contains the main elements found in task model, for instance, tasks, messages, shared resource, etc. The second sub-package is named SoftwareBehavior, it contains interaction between the software operators, for example, a task precedence relationship, a task triggered following an event’s arrival, transactions, etc. Through the SoftwareBehavior package, the model can be linked to a functional model without modifying it, this latter could be pointed by the action step elements. Our meta-model takes also the hardware operational side into account, this latter is represented by the HardwarePlatform package which enables modeling of a processor, a multi-processor system, communication networks, etc. All the listed packages interact with the RealTimeProperties package in order to enrich models by adding a real-time features for a prospective temporal analysis. RealTimeProperties package represents a model library of the intermediate meta-model. The Fig. 3 represents an excerpt of the intermediate meta-model where each element’s name begins with the initials of container package name. For visibility reasons this excerpt contains just main elements, it also shows relations between diﬀerent elements which consist of diﬀerent design views.

232

Y. Ouhammou et al.

Fig. 3. Excerpt of the meta-model

The intermediate meta-model has been done independently from the application/functional model, so it does not require any entry point. Moreover it could be used at two granularity levels. The ﬁrst one, when one designs a model without integrating a functional model. Thus, the model could be considered as a simple task model. The second level is met when one constructs links between the functional model and the behavioral model (e.g. through action steps) in order to have a thin analysis. Whatever the kind of granularity is, the functional model remains unchanged for any future analysis domain. Then, the design result could be analyzed by any real-time scheduling analysis tool supported by the intermediate framework (see Fig. 4).

Fig. 4. Architecture of an application modeled with the meta-model

4

Case Study

In this section we illustrate the application of the intermediate meta-model to a case study in embedded system domain. The hardware architecture is composed of two processors interconnected through a wireless LWAPP network. The ﬁrst

Towards a Simple Meta-model for Complex Real-Time Systems

233

processor is a station leader, it implements a simple graphical user interface allowing to command the robot and to display the battery energy level of the robot. The second processor is an embedded system on the robot that hosts operators of the robot. The software architecture is described on the class diagram given in Fig.5. The robot operator contains three tasks and one shared resource. The ﬁrst task is Regulate, it receives instruction data from the station, shares it with the display task via a blackboard communication (mutual exclusion protected memory area) and sends commands to actuators. The third task acquires the battery energy level and sends it to the station through the network. The software of the processor station consists of three tasks: the acquisition task acquires direction and power, it communicates with the treatment task, that sends instruction data to the robot via the network, the third task displays the battery energy level.

Fig. 5. Software operational model

4.1

Operational View

The Fig. 5 illustrates the software side of the application described previously. It shows the diﬀerent tasks and their communications. Each task, stereotyped by <<SoSchedulableTask>>, can use many communication resources to exchange data with other tasks in a synchronous or asynchronous manner. By using the tagged values, each task can be linked to its space process, to the scheduler and to the processing units. The Fig. 6 represents the hardware side of the application, for example to design the network we are using <> stereotype, this network contains a communication channel which matches processor robot with processor station.

234

Y. Ouhammou et al.

Fig. 6. Hardware platform model

4.2

Behavioral View

The Fig. 7 shows the system behavior, it enables to express precedence relationships and communications between diﬀerent tasks and the way tasks are triggered. Through the behavioral view, we can express the periodicity, the priority and the ready time of each task. The Behavioral model contains a set of activities, for example we have acquisitionActv which represents how the acquisition task behaves, it can be activated by two kinds of triggers that are stereotyped <<SbExternalEventTrigger>> and <<SbTimeTrigger>>. Let us remark that we did not use a functional model in this example.

Fig. 7. Software Behavior model

Towards a Simple Meta-model for Complex Real-Time Systems

5

235

Conclusion

We presented an intermediate meta-model for schedulability aware design of realtime systems. We outlined the architecture of the meta-model and we explained how one can use it through an example of an application. Our future work will consist in extending the meta-model by adding modeling constraints in order to have formally a correct design before forwarding it to a schedulability analysis tool. We are also working on the model transformations that will be used to create instances of the meta-model, using the information extracted from diﬀerent methodologies and languages. Later, we will focus on model transformation from our framework to schedulability analysis tools.

References 1. W. SAE AADL. The SAE Architecture Analysis & Design Language Standard, volume 2008 (2008) 2. Audsley, N.C., Dd, Y.: Optimal priority assignment and feasibility of static priority tasks with arbitrary start times (1991) 3. Bartolini, C., Bertolino, A., De Angelis, G., Lipari, G.: A uml proﬁle and a methodology for real-time systems design. In: EUROMICRO, pp. 108–117 (2006) 4. Bini, E., Di Natale, M., Buttazzo, G.: Sensitivity analysis for ﬁxed-priority realtime systems. Real-Time Syst. 39, 5–30 (2008) 5. Davis, R.I., Burns, A.: Hierarchical ﬁxed priority pre-emptive scheduling. In: Proceedings of the 26th IEEE International Real-Time Systems Symposium, pp. 389– 398. IEEE Computer Society, Washington, DC (2005) 6. Debruyne, V., Simonot-Lion, F., Trinquet, Y.: EAST-ADL: An architecture description language. In: Architecture Description Languages. IFIP, vol. ch. 12, pp. 181–195. Springer, Boston (2005) 7. Espinoza, H., Cancila, D., Selic, B., G´erard, S.: Challenges in combining sysml and marte for model-based design of embedded systems. In: Paige, R.F., Hartman, A., Rensink, A. (eds.) ECMDA-FA 2009. LNCS, vol. 5562, pp. 98–113. Springer, Heidelberg (2009) 8. Gu, Z., He, Z.: Real-time scheduling techniques for implementation synthesis from component-based software models. In: ACM SIGSOFT International Symposium on Component Based Software Engineering, CBSE (2005) 9. Hagner, M., Goltz, U.: Integration of scheduling analysis into uml based development processes through model transformation. In: Proceedings of International Multiconference on Computer Science and Information Technology - IMCSIT 2010, Wisla, Poland, October 18-20, pp. 797–804 (2010) 10. Henia, R., Hamann, A., Jersak, M., Racu, R., Richter, K., Ernst, R.: System level performance analysis - the symta/s approach. In: IEE Proceedings Computers and Digital Techniques (2005) 11. Joseph, M., Pandya, P.K.: Finding response times in a real-time system. Comput. J. 29(5), 390–395 (1986) 12. Kay, J., Lauder, P.: A fair share scheduler. Commun. ACM 31, 44–55 (1988) 13. Liu, C.L., Layland, J.W.: Scheduling algorithms for multiprogramming in a hardreal-time environment. J. ACM 20(1), 46–61 (1973)

236

Y. Ouhammou et al.

14. Masse, J., Kim, S., Hong, S.: Tool set implementation for scenario-based multithreading of uml-rt models and experimental validation. In: Proceedings of the The 9th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2003, p. 70. IEEE Computer Society, Washington, DC (2003) ´ 15. Medina, J.L., Cuesta, A.G.: From composable design models to schedulability analysis with uml and the uml proﬁle for marte. SIGBED Rev. 8, 64–68 (2011) 16. Pasaje, J.L.M., Harbour, M.G., Drake, J.M.: Mast real-time view: A graphic uml tool for modeling object-oriented real-time systems. In: Proceedings of the 22nd IEEE Real-Time Systems Symposium, RTSS 2001, p. 245. IEEE Computer Society, Washington, DC (2001) 17. Mishra, R., Rastogi, N., Zhu, D., Moss´e, D., Melhem, R.: Energy aware scheduling for distributed real-time systems. In: International Parallel and Distributed Processing Symposium, p. 21 (2003) 18. Mok, A.K., Chen, D.: A multiframe model for real-time tasks. IEEE Transactions on Software Engineering 23, 635–645 (1996) 19. Mraidha, C., Tucci-Piergiovanni, S., Gerard, S.: Optimum: a marte-based methodology for schedulability analysis at early design stages. SIGSOFT Softw. Eng. Notes 36, 1–8 (2011) 20. NGuyen, T.H.C., Richard, P., Bini, E.: Approximation techniques for responsetime analysis of static-priority tasks. Real-Time Systems 43, 147–176 (2009) 21. OMG. UML Proﬁle for MARTE: Modeling and Analysis of Real-Time Embedded Systems (2009) 22. Richard, M., Richard, P., Grolleau, E., Cottet, F.: Contraintes de pr´ec´edences et ordonnancement mono-processeur. In: Teknea (ed.) Real Time and Embedded Systems, March 26-28, pp. 121–138 (2002) 23. Ridouard, F., Richard, P., Cottet, F.: Negative results for scheduling independent hard real-time tasks with self-suspensions. In: Proceedings of the 25th IEEE International Real-Time Systems Symposium, pp. 47–56. IEEE Computer Society, Washington, DC (2004) 24. Singhoﬀ, F., Legrand, J., Nana, L., Marc´e, L.: Cheddar: a ﬂexible real time scheduling framework. In: Proceedings of the 2004 Annual ACM SIGAda International Conference on Ada: The Engineering of Correct and Reliable Software for RealTime & Distributed Systems Using Ada and Related Technologies, SIGAda 2004, pp. 1–8. ACM, New York (2004) 25. SysML. OMG system modeling language (OMG SysML) V1.0 (2007) 26. Tindell, K., Clark, J.: Holistic schedulability analysis for distributed hard real-time systems. Microprocess. Microprogram. 40, 117–134 (1994) 27. Traore, K., Grolleau, E., Cottet, F.: Characterization and analysis of tasks with oﬀsets: Monotonic transactions. In: Proceedings of the 12th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2006, pp. 10–16. IEEE Computer Society, Washington, DC (2006) 28. Vestal, S.: Fixed-priority sensitivity analysis for linear compute time models. IEEE Trans. Software Eng., 308–317 (1994) 29. Zhang, F., Burns, A., Baruah, S.: Sensitivity analysis of arbitrary deadline realtime systems with EDF scheduling. Real-Time Systems, 1–29 (April 2011)

Supporting Model Based Design R´emi Delmas, David Doose, Anthony Fernandes Pires, and Thomas Polacsek ONERA – The French Aerospace Lab, F-31055, Toulouse, France {Remi.Delmas,David.Doose,Anthony.Fernandes_Pires, Thomas.Polacsek}@onera.fr http://www.onera.fr

Abstract. In software systems engineering, the generally understood goal of veriﬁcation is to assess the compliance of a software component with respect to the inputs and standards applying to a given phase in the design process. The goal of validation is to determine if the requirements are correct and complete, and validation is performed in the ﬁnal system assessment phase. Nevertheless, the introduction of formal methods in model based engineering tends to blur the boundary between veriﬁcation and validation, by allowing validation tasks to be performed early in the process, before the system has been fully designed and implemented. In particular, we consider recent work using constraint satisfaction techniques to perform formal veriﬁcation and validation tasks at model level. The purpose of this article is twofold. First, we attempt to ﬁt the existing methods and tools in a global design, veriﬁcation and validation process. Second, we show that in addition to veriﬁcation and validation, constraint based techniques can be used to automate part of the design activity itself, by synthesizing correct by construction and quantitatively optimal models from a speciﬁcation. Keywords: MDE, formal methods, veriﬁcation, validation, synthesis, optimization, constraint solvers.

1

Introduction

There is no formal deﬁnition of veriﬁcation and validation that is uniformly agreed upon by the whole computer science community. If we refer to the ISO/IEEE15288, the role of validation is “to provide objective evidence that a system performs as expected by the stake-holders”, and the role of veriﬁcation is “to confirm that design requirements are fulfilled by the system”. In [15], a thorough study of V&V in the context of numerical simulation is given, with a focus of how it is tightly linked to the origins of simulation: for mathematical models, validation amounts to make sure “we solve the right equations”, whereas veriﬁcation is checking that we “solve the equations right”. In software systems engineering, we can safely consider that veriﬁcation applies during the diﬀerent creation phases, and that validation comes last in the process. The DO-178b L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 237–248, 2011. c Springer-Verlag Berlin Heidelberg 2011

238

R. Delmas et al.

norm entitled Software considerations in airborne systems and equipment certification, speciﬁc to critical avionics software, deﬁnes veriﬁcation as “the evaluation of the products of a design phase to ensure their correctness and consistency against inputs and standards applying to this phase.”, and validation “ensuring that requirements are correct and complete.”. In practice, deciding whether a given task is veriﬁcation or validation depends on ones role and position in the process, be it as the customer, the designer, the subcontractor, etc. Moreover, the introduction of formal methods in Model Driven Engineering (MDE) tends to blur the boundary even more, by introducing validation tasks early in the design process, supported by constraint satisfaction engines. The goal of this paper is to show that it is possible to use such techniques not only to support V&V activities, but also to provide an automatic design assistance, by synthesizing correct models from speciﬁcations. In section 2, we review a family of constraint satisfaction techniques relevant to the MDE application framework, as well as the current state of the art of their integration in MDE tools. Section 3 introduces our ﬁrst contribution, a return on experiments on using these existing techniques in a global design and V&V process for embedded software platforms. Section 4 introduces the main contribution of this paper, namely the use of constraint satisfaction techniques for automatic design generation, together with a description of a new model synthesis tool implementing these concepts. Section 5 illustrates the features and performance of this tool on real world inspired application. Last, section 6 concludes the paper and draws some perspectives to this work.

2

Constraint Solvers and Modeling

2.1

Constraint Solvers

SAT Solvers A SAT isﬁability solver is a decision procedure for propositional logic, operating on conjunctions of disjunctions of literals, a literal being a variable or its negation, which is able to ﬁnd a satisfying assignment to a formula or prove that none exists. The SAT problem is the emblematic NP-complete problem. After years of relatively modest improvements over the original DPLL procedure [6], a performance breakthrough has been achieved in early 2000 with the introduction of Conflict Driven Clause Learning (CDCL) algorithms [14]. Since then, these algorithms have been continuously enhanced to obtain unprecedented performance and robustness levels, allowing to use them to solve a number of hard problems by compilation to SAT : formal veriﬁcation of circuits or software, cryptography, product line conﬁguration, dependency management, bio-informatics, etc. Every year, the SAT competition1 gathers a panel of SAT theorists and practitioners and allows to benchmark the latest solver evolutions. Pseudo-Boolean solvers The pseudo-Boolean SAT problem can be seen as an extension of the SAT problem. A pseudo-Boolean problem is a conjunction of 1

http://www.satcompetition.org/

Support for Model Based Design

239

constraints of the form Σi ai li ≤ k, where ai and k are integer coeﬃcients, and li a literal. A ﬁrst approach to solving this problem is by compilation to SAT [2]. Other algorithms, based on CDCL principles extended with dedicated propagation and conﬂict learning rules [12], operate natively on the pseudo-Boolean structure. In addition, pseudo-Boolean solvers allow to optimize numerical criteria speciﬁed as linear, weighted combinations of literals. The search for an optimal solution is done either through MAX-Sat inspired algorithms, or through relaxation techniques based on UNSAT cores [13]. Every year the international PB-Eval competition2 allows to benchmark the latest evolutions of the ﬁeld. CSP solvers A constraint satisfaction problem is expressed at a higher level than a SAT or PB problem, by specifying a set of discrete or continuous variables, a domain for each variable, and a set of arbitrary constraints (arithmetic and logic constraints) over variables. A solution to a CSP is a valuation of variables in their domains that satisﬁes all constraints. A large number of CSP solvers exist today, based on various principles : arc-coherence algorithms, local search, linear relaxations and LP solving, but also SAT [20] or pseudo-Boolean compilation, etc. Again, an international solver competition allows to benchmark latest evolutions every year3. 2.2

Validation of a Specification

In speciﬁcation documents, entities and constraints of a problem are often deﬁned informally in natural language. Graphical languages such as the Unified Modeling Language (UML) or the Systems Modeling Language (SysML) allow to formalize the original problem in a non ambiguous manner. Moreover, by conveying visually intelligible cues, and by allowing several people of diﬀerent horizons to cooperate on a design, they help mastering some of the design task complexity. However, graphical only speciﬁcation languages lack the expressiveness needed to completely formalize the semantics of a design, and it becomes necessary to extend the graphical models with textual speciﬁcations written in the Object Constraint Language (OCL) for instance. In this situation constraint solvers can be valuable to analyze generic properties of a speciﬁcation, to verify that the formalization of a problem matches the designer’s intuition, takes into account all aspects of the original problem without contradiction or redundancy. Using a correct speciﬁcation as starting point is almost mandatory in the MDE context, where an increasingly larger part of the implementation of a system can be generated from the speciﬁcation. Speciﬁcation errors which are not detected before implementation only increase the development costs. The tool UMLtoCSP [5] compiles an UML speciﬁcation to a CSP and uses the ECLi P S e4 solver to automatically generate a collection of instances 2 3 4

http://www.cril.univ-artois.fr/PB09/ http://cpai.ucc.ie/09/ http://eclipseclp.org/

240

R. Delmas et al.

(objects, relations, attribute values) which allows to witness the following generic properties (among others) of a speciﬁcation S : Consistency – S has at least one instance where each class has at least one object; Weak consistency – S has at least one instance where at least one class has at least one object; Class consistency for a class C – S has at least one instance where class C has at least one instance; Constraints independence – S has several non-empty ﬁnite instances in which the diﬀerent constraints can be satisﬁed or violated independently from one another. This allows to show that each constraint really brings extra information to the speciﬁcation. The tool UML2Alloy [1] allows to conduct similar analyzes (consistency, independence, etc.) but proceeds by translating a UML/OCL speciﬁcation to an Alloy speciﬁcation [8]. The formal analysis is performed using the Alloy Analyzer or the SAT-based KodKod back-end [21]. An interesting feature of UML2Alloy is that it was itself implemented using a typical MDE approach, by generating a UML model of the Alloy language from its EBNF, by deﬁning the model of a suitable UML subset in UML, and by using a model transformation to encode the translation rules from UML to Alloy. Last, the authors of [18] describe a tool allowing the analysis of generic properties of UML class diagrams and accompanying OCL constraints by compilation to SAT instances analyzed using the MiniSat solver. One should always keep in mind however that reasoning on class diagrams alone, without extra constraints, is already an EXPTIME-hard problem [4]. In the general case with arbitrary constraints, it is impossible to know whether a satisfying instance exists in a ﬁnite amount of time. The problem becomes decidable when it is restricted to searching for satisfying instances of ﬁxed and ﬁnite cardinality. However, the absence of satisfying instances of cardinality n does not prove the absence of satisfying instances of cardinality m > n. 2.3

Instance Verification

The veriﬁcation of a completely determined instance against a model and OCL constraints is a well studied problem. Currently, numerous tools allow to verify OCL constraints through an exhaustive evaluation on the instance, most frequently using either of the following approaches : Interpretation – A generic OCL interpreter, parametrized by the instance to verify and the OCL constraints, traverses the instance and evaluates the OCL expressions. Representatives of this technique are USE [7] or the now discontinued ROCLET [9]. Code generation – Executable code dedicated to the evaluation of OCL constraints (most often Java code) is generated from the OCL constraints, and runs directly on the data structures representing the instance in memory, bringing a

Support for Model Based Design

241

signiﬁcant performance gain. The most advanced representative of this approach is the Dresden environment5 . In [16], the authors address the scalability issue for constraint veriﬁcation, and compare two veriﬁcation techniques. First, they generate executable code from OCL constraints to verify properties of an Ecore instance. In comparison to interpretation based approaches, the performance gain is noticeable when model cardinalities become large. Second, the authors perform transformation of the design problem into a Constraint Satisfaction Problem, which is solved using the Choco [10]6 solver. The CSP-based approach, while not on par with executable code generation, still represents an improvement over the interpretation approach, thanks to the solvers pruning abilities. This observation is also reported in [18]. It should be noticed however that constraint-based tools only oﬀer partial support for OCL constructs. Tools based on interpretation and code generation, introduced earlier and hence more mature, provide almost full support for OCL. As will be seen in the rest of this paper, the added value of constraint satisfaction is not only to verify instances, but ﬁrst and mostly its ability to synthesize instances.

3

A Design Process

As seen in section 2, the integration of constraint solvers in MDE has the following consequences : ﬁrst, it opens new perspectives on the validation of speciﬁcations, and second it allows scaling up veriﬁcation of instances against speciﬁcations. In this section we try to integrate these new technologies in a global modeling, design and V&V process, proposed in ﬁgure 1. This process originated in the context of embedded software architectures design (constrained resource allocation, placement, data routing). The purpose of this process is to allow handling models with complex properties, or of large instances encountered in practice which cannot be handled by manual V&V techniques. This process comprises the four following phases : 1 – Design of the Specification. It consists in formalizing the original problem from natural language to a class diagram or into a DSL, in a notation such as UML, SysML or Ecore, extended with semantic constraints.

Fig. 1. A design, veriﬁcation and validation process using constraint solving 5 6

http://www.reuseware.org/index.php/DresdenOCL http://www.emn.fr/z-info/choco-solver/index.html

242

R. Delmas et al.

2 – Validation of the Specification. The goal is to establish good confidence in the formalization of the specification, by using constraint based analysis tools such as the ones described in section 2 to verify generic properties of the specification as described in section 2.2. Even if verification techniques are used, this task is really an early validation task, done before any system implementation is available, because its purpose is to analyze properties of the set of requirements, and not the conformance of a particular implementation against the requirements. 3 – Solution Creation. In this step, the user attempts to build an instance (i.e. a model in the DSL defined in step 1) that fully satisfies the specification, either manually or using the support of some constraint solver based automatic tool. 4 – Verification of the Solution. In this step, the correctness of the instance built in step 3 is checked formally against the formal specification. If any problems are detected, phase 3 and 4 are iterated until completion. Depending on the complexity of constraints, and on the size of the problem, step 3 can become infeasible by hand without automatic assistance from the computer. In [16] the authors show on an example that constraint solvers can be used not only to verify the correctness of an instance, but also to synthesize correct instances from partial instances. The CSP solver Choco is run on the CSP model of the design problem, generated from the speciﬁcation and a partial instance, in order to generate an extended solution that satisﬁes all constraints of the speciﬁcation. In the next section, we present our own proposal for automatic design support, in which the user builds a partial solution which gets automatically extended by the synthesis tool to become correct and optimal with respect to some quantitative design criterion.

4 4.1

Design Assistance through Constraint Solving Correct Versus Optimal Designs

Computerized systems requirements have become so complex that it has become diﬃcult, if not impossible, for a human to produce both correct and optimal designs. As a consequence, design assistance tools appear very valuable, necessary in some cases, to achieve design goals. With this work, we are trying to bring automatic design assistance in domain oriented applications. To do so, we have developed a tool able to read Ecore speciﬁcations, partial instances representing seeds for potential solutions, in which all class objects are declared, and only relations between objects and some of their attribute values are unknown, also in Ecore format. This tool uses various constraint solvers as back-ends and attempts to generate an extension of the given seed that satisﬁes the speciﬁcation. In practice, the partial instance used as a starting point is built manually, but recent work by [17] proposes automatic completion in a DSL editor through constraint solving as well. If constraint solvers allow to extend partial instances to meet qualitative requirements, they also allow to optimize the produced solutions with respect to

Support for Model Based Design

243

quantitative design metrics. When several instances satisfying a set of constraints exist, some of them can be considered better than others with respect to criteria that cannot be formalized as purely logical constraints, but rather through ranking functions. A ranking function takes a solution and returns a numerical value which reﬂects its quality according to some domain speciﬁc criterion. It is only once quantitative criteria are taken into account that constraint solvers can be used not only to probe the solution space randomly, as it is done with tools like UMLtoCSP or UML2Alloy, but rather to actually generate designs which makes sense from an engineering point of view. Our process and tools, currently in prototype state, allows the user to specify qualitative and quantitative criteria on Ecore [19] models, in a simple logic described in section 4.2. Requirements and partial solutions are translated either to a Choco CSP, or to a pseudo-Boolean problem in OPB format, which can then be analyzed with a solver such as Sat4J-PB [11] or WBO [13], or any OPB compatible solver. 4.2

A Simple Constraint Language

Instead of trying to fully support an OCL-like language right away, we decided to focus on the type of constraints on which the chosen set of solvers oﬀer a decent performance, and to bring them up in MDE. Our language is less expressive than OCL, but in its current state it is suﬃcient to capture speciﬁcations routinely encountered in our application domain (aerospace embedded systems). It was meant to be simple to use by non-computer scientists. It allows to express the structural part of a design, partial instances and relatively complex constraints while staying in the class of constraints adapted to the targeted solvers, as well as numerical criteria allowing to capture design preferences. Last, being based on restricted set of constructs, it is relatively simple to build tool support for it. The language oﬀers base types : boolean, integer, object and set. It allows M O I B S Q C D

::= ::= ::= ::= ::= ::= ::= ::=

[C] B* [D*] o | O.attr | I + I | if B then I else I | Count( o in S , B) F | T | not B | B and B | B or B | Q | I >= I | O in S | O nin S O.attr | S.ref | classId | {O*} forall (v, B) | exists (v, B) solve | minimize I | maximize I complete S.ref Fig. 2. EBNF of the simple constraint language (subset)

to write logical expressions, set theoretic expressions, cardinality constraints on sets and populations of booleans, and a restricted form of arithmetic expressions. Figure 2 gives an extract of the EBNF of the language, necessary to understand the examples presented in section 5. A constraint speciﬁcation S is a collection of boolean expressions B representing qualitative constraints, together with an optional numerical criterion C representing a quantitative metric. The class declarations are imported from an Ecore ﬁle, and the partial instance from an XMI

244

R. Delmas et al.

ﬁle. The operator semantics is standard. Count(o in S, B) counts the number of objects o of S satisfying an expression B. in and nin represent membership tests. The construct complete S.ref allows the user to specify that a class reference is ﬁxed and should not be extended by synthesis, by default they can all be extended. The criterion solve asks for an arbitrary solution satisfying the constraints. The criterion minimize (resp. maximize), asks for a solution which minimizes (resp. maximizes), the value of an integer expression I. The tool translates this logic to pseudo-Boolean logic in two steps. First, quantiﬁed expressions are expanded on their domains, performing constant propagation and normalization at the same time. Then, the propositional part of the normalized formulas is translated to clauses using standard Tseitin rules [22]. For set-theoretic expressions, the membership o1 in o2.attr and equality o1 = o2.attr constructs are translated to fresh literals, cardinality constraints are translated to pseudo-Boolean constraints over corresponding propositions. Arithmetic expressions are normalized as c0 + ci .li , using (1) and (2) below. Last, integer relational expressions are normalized using (3) to obtain native pseudoBoolean constraints. if c then i else j ; j + i.c − j.c (1) x.a (2) Count (x ∈ {o1 , . . . , on }, x.a) ;

e relop e ;

i

5

ci .li −

x∈{o1 ,...,on }

cj .lj

relop c0 − c0

(3)

j

Illustration

This section illustrates the proposed process and tool on a case study, described in section 5.1, taken from embedded software architecture design. Section 5.2 illustrates the synthesis and quantitative optimization on a small, human-readable instance. Last, section 5.3 discusses the scalability of the tool on large models. 5.1

Case Study Description

In this example, data ﬂows must be routed in an embedded network. The network architecture is modeled as a set of paths, each data ﬂow must be allocated to a ﬁnite number of paths taken from a data ﬂow dependent set of allowed paths. In addition, a set of minimal operating conﬁgurations and a set of failure scenarios are modeled. The allocation of data ﬂows to network paths must guarantee that all minimal operating conditions are satisﬁed in each possible failure scenario. Failures disable some of the paths and hence the data-ﬂow they carry. A minimal operating condition indicates which combination of data ﬂows must always be available. A simple solution would be to allocate each data ﬂow to all paths. This solution makes no sense from an engineering point of view, so we introduce a quantitative criterion to measure the number of paths used by the mapping, and we ask to minimize this number.

Support for Model Based Design

245

The problem is modeled by the Ecore diagram in ﬁgure 3 using ﬁve classes : (i) Flow : a data ﬂow, characterized by the collection of Paths on which it can be mapped (allowedPaths), and on which it is actually mapped (mapping); (ii) Path : physical path in the embedded network, characterized by the collection of FailureScenarios it is aﬀected by; (iii) FailureScenario : a failure scenario, characterized by the collection of physical paths it aﬀects; (iv) DependencyTree : a minimal operating conﬁguration, given as an N -ary tree of arbitrary depth where each Leaf represents the availability of a data ﬂow, in a given failure scenario, as a function of its mapping; and each Node speciﬁes the availability of at least nbChildrenMin out of its N children. These trees represent nested M -out-of-N combinations of data ﬂows that must be available regardless of the active failure scenario; (v) Model : A root container for all the entities above.

Fig. 3. Case study Ecore diagram

A mapping of Flows to Paths is safe if and only if it is such that for each FailureScenario, all roots of the DependencyTree of the Model evaluate to True. The listing 1.1 show how this is expressed in our constraint language. Constraints 1 and 2 deﬁne the evaluation semantics for dependency trees, constraint 3 speciﬁes the top level constraint that must hold in a safe allocation. Last, the minimization of the number of Paths used by the mapping is deﬁned as shown on the 4th line of the listing 1.1. constraint ( f o r a l l l : L e a f , l . e v a l <=> ( exists p : l.flo w.ma pping ( p nin l . f a i l u r e . i m p a c t ) ) ) . constraint ( f o r a l l n : Node, n . e v a l <=> ( ( count k : n . c h i l d r e n k . e v a l ) >= n.nbChildr enMin ) ) . constraint ( f o r a l l d : m o d e l . d e p e n d e n c i e s , d . e v a l ) . minimize ( count p : Path ( e x i s t s f : Flow ( p in f . m a p p i n g ) ) ) . Listing 1.1. Safe mapping speciﬁcation

5.2

Synthesis and Quantitative Optimization

Let us consider an architecture composed of two computers (CP A, CP B) and three communication buses (BU S 1, BU S 2, BU S 3). Three Flow instances

246

R. Delmas et al.

(F lux 1, F lux 2, F lux 3) must be mapped on this architecture. All ﬂows have CP A as source and CP B as destination. The set of paths Path instances is made of BU S 1, BU S 2 and BU S 3. The mapping of F lux 1 is allowed on either BU S 1 or BU S 2. The mapping of F lux 2 is allowed on either of the three buses. The mapping of F lux 3 is allowed on BU S 1 and BU S 3 (it deﬁnes the allowedPath attribute for Flow instances in the model). FailureScenarios are deﬁned as follows : BU S 1 fails alone; BU S 2 fails alone; BU S 3 fails alone; BU S 1 and BU S 2 fail at the same time. The two following minimal operation conﬁgurations are requested to hold : F lux 2 must be available in all failure scenarios; at least one of F lux 1 and F lux 3 must be available in all failure scenarios. First, via a simple synthesis without optimization, we obtain a naive solution in which all ﬂows are mapped to all allowed paths. It is not satisfying because it does not allow to discover potentially useless structural redundancies in the data routing architecture. By optimizing the criterion number of paths used by the mapping (cf. Listing 1.1), the solution obtained is as follows : F lux 1 is not mapped at all, F lux 2 is mapped on paths BU S 2 and BU S 3, F lux 3 is mapped on paths BU S 2 and BU S 3. This solution reveals two important features : ﬁrst, that F lux 1 has no importance for the safe operation of the system. Second, that BU S 1 is not needed to ensure a safe operation. 5.3

Scalability

In order to show that the proposed approach has a good scalability, we have generated a large number of instances as follows : each instance has a 1000 Path instances and a 100 Flow instances; the number of allowedPath per Flow is 10; the number FailureScenario is 100; each scenario impacts 10 Path. DependencyTree we generated as balanced 5-trees of depth 3, and their number varied between 100 and 1000. We have generated several partial Ecore models for each conﬁguration. The number of pseudo-Boolean variables and constraints varies explosively with the number of dependency trees of the model. However OPB models generated in this way are often trivially unsatisﬁable. For benchmarks to be representative, we maximize the number of satisﬁed dependency trees, which is expressed as shown in listing 1.2). In the industrial practice, over constrained speciﬁcations are frequently encountered. They are consistent in the sense of section 2.2, but the partial instances produced by hand render them unsatisﬁable because of some hidden design problem. Being able to automatically identify the maximal subset of satisﬁable requirements on a given partial instance with good performance using state of the art MAX-Sat algorithms adds a great value to such analyzes, by allowing to pinpoint the root causes of design errors in the hand built partial instances, and by allowing to correct them in an iterative process. maximize ( count d : m o d e l . d e p e n d e n c i e s d . e v a l ) . Listing 1.2. Maximization of the number of satisﬁed dependencies

Support for Model Based Design

247

Table 1. Performance analysis summary Dep. Variables Constraints Nb. Nb. Nb. 100 482 641 3 281 656 500 2 354 642 16 273 752 1000 4 694 638 32 508 466

Computation time Min. Max. Avg. 1.4s 9.9s 5.2s 22.0s 164.6s 56.7s 42.5s 409.6s 177.6s

Experiments were conducted on a 3GHz Intel Core2Duo computer using the WBO solver, and are summarized in table 1. These results show the excellent performance and robustness of state of the art pseudo-Boolean solvers, which can handle very large models. These results validate our choice of back-end solving technology (CDCL pseudo-Boolean solvers) and show a good potential for scalability to large scale industrial design problems.

6

Conclusion

The goal of this work was twofold : ﬁrst, to identify the most appropriate ways to use emerging constraint solving technologies in a model driven design and validation process; second, to propose new ways of using these technologies (and accompanying tools) for automatic synthesis and quantitative design optimization with scalability to industrial problems. A very positive feature of the MDE approach is to be able to easily communicate design data and requirements using formalisms well established in the industry. The problem used as illustration in this paper is frequently encountered in critical embedded systems design, and results are really positive for this kind of application, in which formal methods can be used to debug a speciﬁcation, and to generate designs optimal according to some domain speciﬁc metric. Even if it is relatively easy to reach the limits of the current solving technology, our experiments showed that a number of industrial design problems are amenable to formalization and analysis using the proposed process. Future work is planned and aims at making this technology more mature and robust. The creation of the dedicated constraint language presented in section 4.2 was guided by our case studies, but it would be valuable to clearly identify its expressiveness limits and to extend it to make it usable on more applications, or to be able to accept OCL speciﬁcations. On the other hand, our tool chain is currently composed of ad-hoc translators. Work is planned to make it more generic and better integrated in a major MDE editor platform. In addition, to make constraint solving more robust, we plan to implement a portfolio solver back-end, in which a variety of diﬀerent solvers would be used in parallel (SAT, BP, CSP, SMT [3]), in order to beneﬁt from the most eﬃcient solver in each case. Last, even greater added value for designers could be obtained by generating understandable explanations for the inconsistency of a speciﬁcation or partial model. Some SMT solvers for instance are able to isolate minimal UNSAT cores, at a very low level. Some work has to be done to ﬁnd useful ways of translating this low level information in the MDE framework, and to use this information to guide evolutions of the designs or speciﬁcations.

248

R. Delmas et al.

References 1. Anastasakis, K., Bordbar, B., Georg, G., Ray, I.: UML2Alloy: A Challenging Model Transformation. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 436–450. Springer, Heidelberg (2007) 2. Bailleux, O., Boufkhad, Y., Roussel, O.: New encodings of Pseudo-Boolean constraints into CNF. In: SAT (2009) 3. Barrett, C., Stump, A., Tinelli, C.: The smt-lib standard: Version 2.0. In: Proceedings of the 8th International Workshop on Satisﬁability Modulo Theories, Edinburgh, England (2010) 4. Berardi, D., Calvanese, D., Giacomo, G.D.: Reasoning on UML class diagrams. Artiﬁcial Intelligence 168 (October 2005) 5. Cabot, J., Claris´ o, R., Riera, D.: Veriﬁcation of UML/OCL Class Diagrams using Constraint Programming. In: ICSTW 2008 (2008) 6. Davis, M., Logemann, G., Loveland, D.W.: A machine program for theoremproving. Commun. ACM 5(7) (1962) 7. Gogolla, M., B¨ uttner, F., Richters, M.: USE: A UML-based speciﬁcation environment for validating UML and OCL. Sci. Comput. Program. 69(1-3) (2007) 8. Jackson, D.: Alloy: A logical modelling language. In: Bert, D., Bowen, J.P., King, S. (eds.) ZB 2003. LNCS, vol. 2651, p. 1. Springer, Heidelberg (2003) 9. Jeanneret, C., Eyer, L., Markovi´e, S., Baar, T.: RoclET: Refactoring OCL Expressions by Transformations. In: ICSSEA (2006) 10. Jussien, N., Rochart, G., Lorca, X.: The CHOCO constraint programming solver. In: CPAIOR 2008 Workshop on Open-Source Software for Integer and Contraint Programming (OSSICP 2008), Paris, France (June 2008) 11. Leberre, D.: SAT4J, a SATisﬁability library for java (2004) ` propos de l’extension d’un solveur SAT pour traiter 12. Leberre, D., Parrain, A.: A des contraintes pseudo-bool´eennes. In: JFPC 2007 (2007) 13. Manquinho, V.M., Martins, R., Lynce, I.: Improving unsatisﬁability-based algorithms for boolean optimization. In: SAT (2010) 14. Moskewicz, M., Madigan, C.F., Zhao, Y., Zhang, L., Malik, S.: Chaﬀ: Engineering an Eﬃcient SAT Solver. In: DAC (2001) 15. Roache, P.J.: Veriﬁcation and validation in computational science and engineering. Hermosa Publishers (1998) 16. de Roquemaurel, M., Polacsek, T., Rolland, J.F., Bodeveix, J.P., Filali, M.: Assistance a ` la conception de mod`eles ` a l’aide de contraintes. In: AFADL 2010 (2010) 17. Sen, S., Baudry, B., Vangheluwe, H.: Towards domain-speciﬁc model editors with automatic model completion. Simulation 86(2) (2010) 18. Soeken, M., Wille, R., Kuhlmann, M., Gogolla, M., Drechsler, R.: Verifying UML/OCL Models Using Boolean Satisﬁability. In: Mller, W. (ed.) Proc. Design, Automation and Test in Europe, DATE 2010 (2010) 19. Steinberg, D., Budinsky, F., Paternostro, M., Merks, E.: EMF: Eclipse Modeling Framework 2.0. Addison-Wesley Professional, Reading (2009) 20. Tamura, N., Tanjo, T., Banbara, M.: Solving constraint satisfaction problems with SAT technology. In: Blume, M., Kobayashi, N., Vidal, G. (eds.) FLOPS 2010. LNCS, vol. 6009, pp. 19–23. Springer, Heidelberg (2010) 21. Torlak, E., Jackson, D.: Kodkod: A relational model ﬁnder. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424, pp. 632–647. Springer, Heidelberg (2007) 22. Tseitin, G.S.: On the complexity of derivations in the propositional calculus. Studies in Mathematics and Mathematical Logic II (1968)

Modeling Approach Using Goal Modeling and Enterprise Architecture for Business IT Alignment Karim Doumi, Salah Baïna, and Karim Baïna ENSIAS, Mohamed V - Souissi University, Morocco [email protected], {sbaina,baina}@ensias.ma

Abstract. Nowadays, the business IT alignment has become a priority in most large organizations. It is a question of aligning the information system on the business strategies of the organization. This step is aimed at increasing the practical value of the information system and makes it a strategic asset for the organization. Many works showed the importance of documentation, the analysis and the evaluation of business IT alignment, but few proposed solutions applicable to the strategic and functional level. This paper aims has to fill this gap by proposing a simple approach for modeling enterprise strategy in the context of strategic alignment. This approach is illustrated by case study of a real project in a Moroccan public administration. Keywords: strategic alignemnt, goals modeling, Enterprise architecture, Business process, information system.

1 Introduction The strategy of the enterprise is to set up the long-term commitments to reach the explicit objectives. It is a question of studying, via real cases, how an enterprise can position itself in an international competing. The alignment of this strategy with the evolution of information system requires an alignment allowing the perfect coherence of all the actions and the decisions with the strategic objectives of the enterprise. This alignment will transform strategic objectives into operational actions to align them in the information system. Today, it is not quite enough to build powerful information systems. In order for the enterprise to be performing and be able to compete and evolve, its information systems and business processes must be permanently aligned and in perfect coherence with its strategy. Many authors have shown the importance of alignment in the evolution of the enterprise[1-2] and according to [3-6], this alignment has a great influence on the performance of the organization and any rupture in the process of alignment causes a fall of the organization’s performance. If the interest of alignment is greatly recognized, its implementation remains very limited. According to [1,7], few leaders consider that the strategy and the information systems are aligned. Thus, this implies that actors of the organization are not able to distinguish between alignment and non-alignment. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 249–261, 2011. © Springer-Verlag Berlin Heidelberg 2011

250

K. Doumi, S. Baïna, and K. Baïna

Also, the absence of methods of maintenance of alignment makes the task extremely difficult at the decisional level. There exist a number of models of strategic alignment. A well known model is Henderson's and Venkatraman's Strategic Alignment Model which give a rather total vision of strategic alignment. However, this kind of model remains very related to the field of management. According to [17], a step of engineering is necessary to analyze the strategic alignment of the information system. This vision is also supported by the approaches of enterprise architecture [10] as well as the leaders of information system [9]. In the literature several approaches have been developed to solve the problem of alignment: 













Approach of Enterprise architecture (French): urbanization of Information System [9]. This approach provides a guide to manage the strategic alignment to define future Information system. However, the method of this approach does not say how to ensure an evolution of enterprise strategy, its business processes and its information system and how to measure and improve the alignment between these elements. Approach of modeling and construction of alignment oriented needs [10]: The approach of Bleistein is interesting in the sense that it takes into account the strategic level in the presentation of the alignment but is impractical and very complicated to master it. Is an approach to building alignment and not the evaluation and evolution of alignment. Approach of evaluation and evolution of strategic alignment [2]: The approach of Luftamn gives guidance for the construction of the alignment. The approach does not seek to change the alignment of the elements but to achieve a higher maturity level of alignment between strategic objectives and IT strategy. Approach of modeling and construction of alignment between the environment, processes and the systems [11]: for example the SEAM method uses the same notations in different levels and thus between the different elements of alignment. The SEAM method does not take into account the particularity of each level of abstraction. Approach of evaluation of the degree of alignment of the business process and Information system [12, 13]. This approach allows to model and evolve the alignment between business process and information system but do not take into consideration the strategic level in the representation of the alignment. Approach of evaluation of the degree of alignment between the couple strategy of the enterprise and < Business process, information system> [14]. The INSTALL method takes into consideration the strategic level in representing the strategic alignment but impractical because it uses formalizes card that does not include all elements of the strategic level. Approach oriented values [22]. The e3-value framework particularly interested in the value stream, the creation, exchange and consumption of objects in network multi-actor that includes the company itself and its environment (customers, partners) . According to Bleistein missing a crucial point in the e3value is the distinction between value analysis and business strategy. Moreover, the link between the creation of economic value and goals of low-level system is unclear. Also tools and guidance to change are not defined.

Modeling Approach Using Goal Modeling and Enterprise Architecture

251

In all these approaches, the concept of alignment of information system is traditionally treated through the results obtained after alignment. Thus, according to [25], alignment exists when the information system is consistent with the purposes and activities selected to position the enterprise in its market. [12] defines alignment as the set of linkages between elements of the model business processes and elements of the model of the information system support. [26] defined alignment as the degree to which the mission, goals and plans contained in the competitive strategy is shared and supported by the IT strategy. According to the report CIGREFF 2002 the term "alignment" expresses the idea of the consistency of the information system strategy with business strategy. This alignment requires constant maintenance throughout its life cycle. In other words, the classical vision of alignment involves two main areas: the area of Business (competitive strategy and activities of the organization) and the field of IT (Information System support) that it is to ensure consistency. The issue of business IT alignment must necessarily pass through the cycle of alignment: (1) identification of elements that will contribute to the construction of the alignment and (2) the evaluation, and (3) necessary actions to correct this alignment (figure 1).

Fig. 1. Cycle of strategic alignment

In this paper we propose a model driven approach to (1) represent and (2) evaluate the business IT alignment. This approach allows to construct the alignment from the elements belonging to different abstraction levels (strategic & operational). The paper is organized as follows: Section 2.1 presents a brief introduction of our approach. Sections 2 and 3 present our approach in strategic and functional level. Finally, Section 4 presents primary conclusions of the work presented in this paper and gives short term perspectives for ongoing research work.

2 Modeling of Business IT Alignment 2.1 Related Work One of the most recurrent problems lately is the lack of strategy in strategic alignment [26], and even when it is taken into account, it remains ambiguous and very difficult

252

K. Doumi, S. Baïna, and K. Baïna

to adapt. Indeed in the industry can find a set of techniques dedicated to the strategy. Each has its own concepts, methods and tools (eg, BCG matrix, the method MACTOR, SWOT analysis, the McKinsey 7S, internal value chains ... etc). These techniques are often used to plan and coordinate the business decision process with the Business Strategy. They are often used by business leaders and strategy consulting firms. They are thus based on measurements and performance values, but these approaches are rarely used in a process of alignment with the operational level. At most research approaches alignment does not always specify explicitly which elements of the business that are involved in strategic alignment. For example, Bleistein & al. [10] in trying the method of using B-SCP requirements engineering for linking high-level requirements (strategic) with those of lower level, and focusing on the alignment of strategy business and information system components. Yu & al. [20] look at the reasons and contexts (including strategic goals) that lead to system requirements. The approach e3 values interest in values exchange between the network actors. The approach e3-alignment focuses on the alignment within and between organizations with respect to: (1) business strategy, (2) values, (3) business processes, and (4) Information System. In all these approaches, there is little explicit links with the elements of the enterprise to align (strategic and functional level). These models use either intermediate or dependencies between the elements, or the decomposition of high level goals into low-level goals. Approaches ACEM (Alignment and Evolution Correction Method) [12] and INSTALL (Intentional Strategic Alignment) [14], fit into the type methods that use an intermediate model to represent alignment. Note however that the first (ACEM) addresses the alignment of IT and business process but do not take into account the strategy. Approaches of the dependence that propose to define dependencies between highlevel goals (strategic) and operational goals. Approaches based on i * models [10], [20] and the approach of urbanization Longépé [9] fall into this category. Decomposition approaches propose to decompose high-level goals into lower level goals (operational). Among these approaches, we find KAOS or approaches of Enterprise Architecture (eg Zachman). 2.2 Our Approach The approach we propose for modeling strategic alignment is an approach oriented models. This approach ensures that the models of the strategy are linked with models of the functional level through a study of alignment between these 2 levels. Modeling in the two levels is traditionally expressed in different languages, and in separated documents. At the strategic level, one may find concepts like goal, task, actor, role and indicator. Whereas at the functional level, one may find object, operation, function, application etc. The concept of alignment that we adopt in our approach is defined as the set of links (impact of element of model on an element of another model) between the strategic model and the IS model. Thus, the degree of alignment is measured by comparing: (i) the set of linkages between elements of the IS model and elements of

Modeling Approach Using Goal Modeling and Enterprise Architecture

253

strategic model and (ii) the aggregate maximum possible links between these models (figure 2). For modeling the alignment, our approach allows: - Represent elements of the fields of Business (enterprise strategy) and IT (information system) by the models. - Measuring the degree of alignment by checking similarities between elements of these models. Formalism of the enterprise

Enterprise strategy

Goals Modeling

Strategic level Information system

functional level Enterprise architecture

Fig. 2. Framework of our approach

2.3 Strategic Study In this paper we consider the use of a goal model approach that supports analysis of strategic business goals such as I* [22] or the Business Motivation Model (BMM) [10]. The I* technique focuses on modeling strategic dependencies among business agents, goals, tasks and resources. The I* model adopts an agent-oriented approach to model information system requirements. The intentional elements are the task and the soft goals, hard goals and can be related the ones to the others with relations of the type "means ends" and relations of the decompositions type of spots. The following figure 3 illustrates the elements of formalism I * adapted: Hard goals: Represents and intentional desire of an actor, the specifics of how the goal is to be satisfied is not described by the goal. This can be described through task decomposition Soft goals: are similar to (hard) goals except that the criteria for the goal's satisfaction are not clear-cut, it is judged to be sufficiently satisfied from the point of view of the actor. The means to satisfy such goals are described via contribution links from other elements. Task: The actor wants to accomplish some specific task, performed in a particular way. A description of the specifics of the task may be described by decomposing the task into further sub-elements.

254

K. Doumi, S. Baïna, and K. Baïna

Contribute to: A positive contribution strong enough to satisfy a soft goal. Means end link: These links indicate a relationship between an end, and a means for attaining it. The "means" is expressed in the form of a task, since the notion of task embodies how to do something, with the "end" is expressed as a goal. In the graphical notation, the arrowhead points from the means to the end. Target: target or indicator is information to help an actor, individual or collective, to drive action toward achieving a goal or to enable it to assess the result. It is that which makes it possible to follow the objectives defined at the strategic level related to a high level orientation.

Fig. 3. I* legend adapted

Several authors used the formalism of I * for strategic modeling due to its flexibility and the possibility to be used in different contexts. In order to let this formalism become more adapted to our approach, we added another element "target». Indeed, once the objectives are clearly definite, it is necessary to associate the indicators (target) for the regular follow-up of the actions implemented at the functional level. In the figure 3, elements 1, 2,3,4,5 are fundamental elements of formalism i* and the element 6 (target) is the element added.ï This indicator one finds it at the strategic level (for example in a score board) and operational level has through its execution. For this reason one chose approaches based on models I * [22] and approaches of enterprise architecture [10] and which belong to the approaches which propose to define bonds of the dependence between the goals. 2.4 Functional Study At the functional level we have been inspired by the approach of urbanization (enterprise architecture) for several reasons: (i) In the context of urbanization, the functional view is generally deducted from the business view. (ii) This functional view is designed to meet the needs of the strategy (iii) the link between the two views is realized by evaluating their alignment.

Modeling Approach Using Goal Modeling and Enterprise Architecture

255

This architecture at the functional level use the metaphors to found the concept structures, in particular the metaphor of the city is used like base of information system [9].indeed Any functional architecture comprises several Business areas. A business area is broken up into several neighborhoods (district in notation city). Each neighborhood is composed of several blocks. This last belongs has only one and only one neighborhood. A block should never be duplicated and 2 blocks should never have of exchange direct (figure 4).

Business Area Neighborhood Neighborhood Neighborhood Bloc Bloc Bloc

Fig. 4. Structure of a business area

The problems thus consist in making the information system most reactive possible (i.e. able to evolve quickly to answer the new requests) while preserving the informational inheritance of the enterprise. The urbanization of the information systems aims at bringing an answer to this need. 2.5 Alignment Study It is the most important step in our approach or one puts in correspondence the strategic indicator with bloc of plan of urbanization. For this our approach proposes a projection of the indicators of the strategic level with the blocks of the urbanization plan. This confrontation thus will enable us to align this last with the objectives of the organization (figure 5). Correspondance

Bloc1

Bloc 2

Bloc3

Bloc 5

X

Indicator 1 Indicator 2

Bloc 4

X

Indicator 3 Indicator 4 Fig. 5. Correspondence between the blocks and strategic indicators

X

256

K. Doumi, S. Baïna, and K. Baïna

In this level of the possible dysfunctions of alignment will be detected. For example it will be noted can be that such a function is covered by several application different or a strategic indicator is supported by no block.

3 Case Studies: Project of the Ministry of Higher Education (Morocco) The project we have chosen is very important for the Moroccan government, which is part of a national program to improve the situation in higher education. The study of the alignment of this project will help actors to decide if information system is aligned with this project. The case study is inspired from a real project at Rabat University, Morocco. 3.1 Description of the Case In the context of the reform of higher education in Morocco, a reorganization of the university cycles based on LMD System (License - Master - Doctorate) took place. Also, important efforts were made to develop the technical and professional options in each University. The objectives of studied project are: • •

To improve the internal output of higher education and the employability of the award-winnings who arrive on the job market. To offer to the students good conditions of training and lodging.

Some of the awaited results are: • • • •

Creation of almost 124,000 places at the University; Multiplication by 2 of the capacity of reception of university. Registration of the 2/3 of all students of higher education in technical, scientific and professional options. Creation of almost 10,000 places in the halls of residence.

3.2 Strategic Study In our project we have main strategic goals (1) “To improve the output interns and the employability of the award-winnings who arrive on the market” and (2) “To offer to the students’ good conditions of training and lodging”. Instead of customer the university aims to satisfy its users: students, teachers and administrative staff. The internal axis process is organized around four strategic topics: • • • •

To extend the capacity of reception To define the university components of tomorrow To accelerate the development of technical and vocational trainings To set up an orientation system and devices of council.

Modeling Approach Using Goal Modeling and Enterprise Architecture

257

In order to apply our approach for strategic alignment to the university Mohamed 5, the first step consists in the translation of all objectives of the project into goal model formalism. (Figure 6)

Fig. 6. Modeling Strategic of the project with the formalism I *

3.3 Functional Study At this level, all applications and databases of the university are listed. After analyzing the existing we identified three major areas: an area for the activities of education, area of management of the library and the archive and the last for the management of human resources. For example the area of education we identified three Neighborhoods that correspond to three major information systems: information system for student registration, another for the management of reviews and deliberations and the last for the management of cycle master. In each area we identified a set of blocks that match a set of application. For example in the neighborhood "Reviews & Deliberations" we identified two blocks: one for the management of reviews and the other for deliberations. (Figure 7).

258

K. Doumi, S. Baïna, and K. Baïna

A_ Student Affairs N_Registration B_new students B re-registration

N_Review & deliberation

A_Human Resources N_Human resource management B_human resource management

Q_Training B_Training

B_review B_deliberation

N_Documents management

A_Library & archives N_Documentation Management

& archiving

B license

B_Library management

B_Master

B_archives management

Fig. 7. Cutting areas of the system of information

This step consists in reorganizing the information systems in order to make them modular (via the blocks). The block is owner of its data and treatments; it is in relation with different blocs. For example in the area of student affairs the neighborhood of the registration constitutes several blocks, The block of registration which is dedicated to the management of the procedure of the registration of the new students and which is in relation w the block of management of documents (license). 3.4 Alignment Study In the order to link the strategic objectives as define in the figure 1 to the existing information system as depicted by figure 7. The aim of this step is to establish the relation between the indicators with the neighborhood and blocks in functional level. This step permits to verify that the university meets the objectives, and to reorganize these business processes to meet the expected indicators. In the table below we present all indicators related to our case study and its projections on the areas and blocks of functional level This confrontation will therefore enable us to align the elements of information systems with strategic objectives. At this level, some failures of alignment will be detected. We might find that to be such an indicator is covered by several different blocks or a strategic level indicator is not supported by any block of information system.

Modeling Approach Using Goal Modeling and Enterprise Architecture

Indicator

Target

Increase the number of Creation of nearly students at the University 124000 seats in the university % enrollment in technical Registration for 2 / 3 and professional option. of the students in technical options Capacity in technical Multiplication by 2 options. of the capacity of the technical option Number of student in 10 000 Engineers and technical options. 3 300 Doctors per year. Number of places in university cities Number of training days for university staff.

Area

259

Bloc

Unsupported

Unsupported

A_ Student Affairs

B_new students

A_ Student Affairs

B_new students

A_ Student Affairs

B_new students

Creation of nearly unsupported unsupported 10000 places in 10 Cities hosting university Approximate 1.5 mi l l i o n d a y s o f A_Human Resources B_human resource training per year for management staff of the Education

Fig. 8. Table of correspondence of the strategic indicators with the blocks

For example the indicator “Number of places in university is not supported by any block, which shows that there is an alignment problem between the two levels of abstraction' functional and strategic. After this step any corrections may be made to resolve the alignment problem. There are two distinct approaches: (1) adaptation of the objectives of the strategic level with the functional level "top down» or (2) adapting elements of the functional level so as to cover the strategic objectives "bottom up".

4 Conclusion and Discussion In this paper we presented an approach to strategic alignment convenient and easy to apply. It is an approach with two levels of modeling (1) a strategic level model in a formalism for I * (2) a functional level, based on the approach to enterprise architecture. The main contribution is to show that a process of strategic alignment can be implemented in practice by adapting the model in two levels of abstraction. Our approach allows to build the alignment based on elements belonging to the two abstraction levels (strategic and functional). The correspondence between strategic indicators and the blocks of the information system has allowed us to assess the alignment. Our goal then is to improve the quality assessment of alignment, to determine the degree of alignment and to locate the level of dysfunction. Also among our research objectives is to develop a procedure to correct the alignment. In this way to develop a procedure that affects the set of step of construction, evaluation and correction of strategic alignment.

260

K. Doumi, S. Baïna, and K. Baïna

References [1] Luftman, J., Maclean, E.R.: Key issues for IT executives. MIS Quarterly Executive 3, 89–104 (2004) [2] Luftman, J.: Assessing business-IT alignment maturity. Communications of the Association for Information Systems 14(4), 1–50 (2000) [3] Baïna, S., Ansias, P., Petit, M., Castiaux, A.: Strategic Business/IT Alignment using Goal Models. In: Proceedings of the Third International Workshop on Business/IT Alignment and Interoperability (BUSITAL 2008) Held in Conjunction with CAISE 2008 Conference Montpellier, France, June 16-17 (2008) [4] Chan, Y., Huff, S., Barclay, D., Copeland, D.: Business Strategic Orientation: Information Systems Strategic Orientation and Strategic Alignment. Information Systems Research 8, 125–150 (1997) [5] Croteau, A.-M., Bergeron, F.: An Information Technology Trilogy: Business Strategy. Technological Deployment and Organizational Performance. Journal of Strategic Information Systems (2001) [6] Tallon, P.P., Kraemer, K.L.: Executives’ Perspectives on IT: Unraveling the Link between Business Strategy, Management Practices and IT Business Value. In: Americas Conference on Information Systems, ACIS 2002, Dallas, TX, USA (2002) [7] Renner, A.R., Latimore, D., Wong, D.: Business and IT operational models in financial services: Beyond strategic alignment. IBM Institute for Business Value study (2003) [8] Zachman, J.A.: A Framework for Information Systems Architecture. IBM Systems Journal 26, 276–292 (1987) [9] Longépé, C.: Le projet d’urbanisation du SI. Collection Informatique et Entreprise, Dunod (2001) [10] Bleistein, S.J.: B-SCP: an integrated approach for validating alignment of organizational IT requirements with competitive business strategy. The university of new south wales, phD thesis, Sydney Australia, January 3 (2006) [11] Wegmann, A., Regev, R., Loison, B.: Business and IT Alignment with SEAM. In: Proceedings of REBNITA Requirements Engineering for Business Need and IT Alignment, Paris (August 2005) [12] Etien, A.: L’ingénierie de l’alignement: Concepts, Modèles et Processus. La méthode ACEM pour la correction et l’évolution d’un système d’information aux processus d’entreprise, thèse de doctorat, Université Paris 1, March 13 (2006) [13] Etien, A., Salinesi, C.: Managing Requirements in a Co-evolution Context. In: Proceedings of the IEEE International Conference on Requirements Engineering, Paris, France (September 2005) [14] Thevenet, L.H., Rolland, C., Salinesi, C.: Alignement de la stratégie et de l’organisation: Présentation de la méthode INSTAL, Ingénierie des Systèmes d’Information (ISI). Revue Ingénierie des Systèmes d’Information Special Issue on IS Evolution, Hermès, 17–37 (June 2009) [15] Brown, T.: The Value of Enterprise Architecture. ZIFA report (2005) [16] Meersman, B.: The Commission Enterprise Architecture cadre. Presentation to European Commission Directorate Genral Informatics (2004) [17] Khory, R., Simoff, S.J.: Enterprise architecture modelling using elastic metaphors. In: Proceedings of the First Asian-Pacific Conference on Conceptual Modelling, vol. 31 (2004) [18] Bonne, J.C., Maddaloni, A.: Convaincre pour urbaniser le SI. Hermes, Lavoisier (2004)

Modeling Approach Using Goal Modeling and Enterprise Architecture

261

[19] Jackson, M.: Problem Frames: Analyzing and Structuring Software Development Problem. Addison-Wesley Publishing Company, Reading (2001) [20] Yu, E.: Towards Modeling and Reasoning Support for Early-Phase Requirements Engineering. In: Proceedings of the 3rd IEEE International Symposium on Requirements Engineering, p. 226 (1997) [21] Rolland, C.: Capturing System Intentionality with Maps. In: Conceptual Modeling un information Systems Engineering, pp. 141–158. Springer, Heidelberg (2007) [22] Gordijn, J., Akkermans, J.: Value-based requirements engineering: Exploring innovative e-commerce ideas. Requirements Engineering 8(2), 114–134 (2003) [23] Gordijn, J., Petit, M., Wieringa, R.: Understanding business strategies of networked value constellations using goal- and value modeling. In: Glinz, M., Lutz, R. (eds.) Proceedings of the 14th IEEE International Requirements Engineering Conference, pp. 129–138. IEEE CS, Los Alamitos (2006) [24] Pijpers, V., Gordijn, J., Akkermans, H.: Exploring inter-organizational alignment wit e3alignment – An Aviation Case. In: 22nd Bled eConference eEnablement: Facilitating an Open, Effective and Representative eSociety, BLED 2009, Bled, Slovenia, June 14-17 (2009)

MDA Compliant Approach for Data Mart Schemas Generation Hassene Choura and Jamel Feki Faculty of Economics & Management, University of Sfax, P.O. Box 1088 - 3018 Sfax, Tunisia, Miracl Laboratory [email protected], [email protected]

Abstract. Decision Support System (DSS) designers face two data models complexities: OLTP (On Line Transaction Processing) database model and multidimensional model. These models relying on different concepts force the designer to have double skills. The objective of this paper is to assist the DSS designer producing rapidly Data Mart (DM) schemas starting from the Data Warehouse (DW) data model. More accurately, we aim to automate the generation of DM schemas from a relational DW schema according to the MDA (Model Driven Architecture) paradigm. To do so, we propose an approach based on a set of transformation rules that we define in ATL language and illustrate with an example. Keywords: ATL, Data mart, Data warehouse, MDA, Multidimensional, Transformation.

1 Introduction Three main categories of DW design approaches produce DM schemas. Top-down approaches [1][2] start from user requirements of the future DSS have a main lack: they generate schemas not completely loadable from the operational information system of the organization. This occurs when users expect to analyze a business process for which not all required data are available in the data source. The second category of design approaches said Bottom-up start with examining the static data model of the operational system. Generally, this model is graphical (e.g, E/R, UML class diagram) [3] [4] [5] [6] so then decision makers can participate in the design phase. Bottom-up approaches produce a set of candidate DM schemas. Although candidate schemas are loadable, some of them may be uninteresting. As for the Mixed approaches [7] [8] [9] [10] [11] [12], they have a twofold objective: First, they involve decision makers during the DSS design process, and secondly, they consider the data model of the operational system that will feed the DM with data. As a consequence, mixed approaches produce DM schemas that first respect decision maker requirements in term of OLAP analyses and, secondly, consider data that the operational system of the organization can supply for these analyses. Basically, all these approaches focus on the conceptual level and, voluntary neglect the other implementation levels of the DSS. Moreover, software tools emerging from L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 262–269, 2011. © Springer-Verlag Berlin Heidelberg 2011

MDA Compliant Approach for Data Mart Schemas Generation

263

dedicated research or, even supplied by software editors, do not tackle simultaneously the three abstraction levels namely conceptual, logical and physical. Recently, the scientific community has invested much more efforts by addressing the problem of how to articulate these three levels in a single automated process; this is the major promise aimed by the MDA paradigm of the OMG [13]. MDA motivated DW researchers to study how to automate transformations between these design levels. In this context, we propose an MDA approach for the construction of DM schemas starting from a relational DW. This paper is organized as follows: Section 2 overviews DSS design approaches and MDA fundamentals. Section 3 addresses the construction of DM schemas and defines our transformation rules. Section 4 illustrates our experimentation. Finally, section 5 summarizes our contribution and concludes the paper.

2 Context In the following subsections we introduce the two models we use at the source and target levels of transformations, as well as the MDA paradigm. 2.1 Abstraction Modeling Levels Modeling a DSS follows the three abstraction levels: Conceptual, logical and physical. We briefly describe these levels in the next that follows. The conceptual level elaborates an abstract representation of the real world, independent of any technical aspects. Approaches used at this level differ in concepts and formalisms, but they converge in goals to produce comprehensive diagrams. The logical level details the description in accordance with the specificities of the DBMS used for implementation. It is directly derived from the conceptual data model by adding technical details. For instance, for a relational DB, relations are more refined with supported data types, constraints … And, for a multidimensional DB the logical description is refined according to the target technology (e.g., R-OLAP). The physical level is interested with how data will be physically stored and then accessed. Naturally, it details the logical schema while taking into account the target technological platform. For example, the DW designer should decide which optimization techniques are to be used (e.g. materialized views, Binary-join-indexes). To our knowledge, and so far, we note that these transformation steps, between the three design levels, are not considered completely and conjointly neither by the existing design methods nor performed by software tools dedicated for DSS. In practice, these models are manually derived. Consequently, this may lead to an undesirable situation: alarming incoherence in the DSS development cycle. In this work, we suggest to automate the derivation of DM schemas from the DW schema. To do so, the MDA paradigm is a promising way since it relies on models and automatic transformations between models until the generation of the code. MDA enables us to reach many objectives: mainly, it considerably reduces the development costs, improves the quality of software and, enhances reuse capabilities [13] when moving between platforms.

264

H. Choura and J. Feki

2.2 Model Driven Architecture (MDA) Paradigm MDA is a software design approach for the development of software systems. Its principle is based on structuring specifications as models at different levels of abstraction. Its aim is to perform automatic transformations of models until the generation of the code that implements the software. MDA expects to replace the slogan « Write once, Run anywhere » by « Model once, Generate anywhere ». It is based on three model levels: Computation Independent Model (CIM), Platform Independent Model (PIM) and Platform Specific Model (PSM). The CIM Model describes the business process to be automated out of technical details. It reflects services that the application software should supply in accordance with user requirements. The PIM Model describes the design solution for a given problem at a conceptual level; it is derived from the CIM model by a series of transformations. It defines the structure of the system and its behavior independently of technical details of an implementation platform. Finally, the PSM Model is an accurate description of the solution; it is derived from the PIM by transformations in respect to technical details required by the target platform that implements the software. Since MDA uses models at different abstraction levels, it relies on Meta models. Meta modeling has twofold advantages. First, it enables to correctly define models, handle and use them at different levels; secondly, it helps automating transformations. Naturally, model definition requires the use of concepts and relationships, all of them are defined at a higher level of abstraction, i.e., in a Meta model. MDA has three modeling levels: Model, Meta model and Meta Meta-model. The Meta Meta-model is the highest abstraction level; it defines the specification language of the Meta model, and its unique entities. The Meta model is the level where all instances of a Meta Metamodel are defined; its definition language permits the specification of models. The model level is the lowest one where instances of Meta-models are defined.

3 Proposed Method for DM Schema Construction Let us remember that our objective consists in deriving, in accordance to the MDA, DM models starting from a given DW model. In our proposed method, the DW is a 3NF relational database. To perform this derivation, we transform the relational DW model into the target DM dimensional model. In order to graphically visualize the source and target data models, we use the EMF (Eclipse Modeling Framework) and GMF (Graphic Modeling Framework) editors. Let us underline here that the definition of the necessary transformation rules represents problematic issues because of the absence of one-to-one correspondences between the concepts of the source and target models. To facilitate the control of this task, we have split the problem into two complementary parts: define rules that identify what elements of the source model stand for a dimensional concept (cf., section 3.3), and then transform each identified element into a dimensional component for the DM schema under construction. To produce DM schemas, we have opted for the ATL language to define the transformation rules. ATL presents many advantages, in particular ATL is composed of three layers (AMW: Atlas Model Weaving, ATL, and ATL VM: ATL Virtual

MDA Compliant Approach for Data Mart Schemas Generation

265

Machine). Moreover, ATL offers two kinds of constructors, namely imperative and declarative, for models construction. In order to illustrate the steps of our DM construction method by transformations according to MDA, we give in figure 1 the Meta model of the relational DW. This diagram is produced using EMF. Year Category subCategory ID_Prod

Weak attributes prix_unit description

Month

Parameters

ID_Date

DATE

PRODUCT

Fact

Measures

SALES Unit Price Quantity Amount

Hierarchy

Dimensions

Country City

CLIENT

Fig. 1. Meta model of the relational DW

FName LNane ID_CLT

Fig. 2. Star schema multidimensional concepts

3.1 DW Source Meta Model Being a relational database, the DW source model is a set of relational tables; each table is composed of a finite set of columns. Each column has a basic data type and may participate to the definition of a primary key. A column may reference another column to establish a navigational relationship between rows belonging to the same table or to different tables (Foreign key concept). 3.2 Target Meta Model for DM A DM schema is composed of facts and dimensions. The fact is the business process of interest for decision making; it is the subject of OLAP analyses [1] [2]. Conceptually speaking, a fact is composed of a finite set of attributes called measures. Measures represent indicators reflecting the business activity to be analyzed. In multidimensional modeling, a fact is an n-ary relationship between dimensions. Dimensions represent simultaneously the axes according to which the fact measures are recorded and then analyzed. Each dimension is made up of a set of attributes. Some attributes of a dimension are often semantically ordered from the finest to the highest granularity, these ordered attributes are said parameters; they define the concept of hierarchy in multidimensional modeling. In fact, within hierarchies, parameters represent levels for aggregating measures.

266

H. Choura and J. Feki

Fig. 3. Multidimensional Meta model for data marts

Figure 2 is a star schema; it illustrates the multidimensional concepts. This schema has one central fact called SALES with three measures (Unit_Price…); this fact is linked to three dimensions (PRODUCT…). Figure 3 is a class diagram describing the multidimensional Meta-model of DMs. It is the target Meta model of transformations. 3.3 Transformations To perform transformations we have defined a set of ATL platform-independent rules. They produce DM schemas starting from a relational PIM model following two consecutive tasks: i) Identification of multidimensional components from within the relational DW data model, and ii) multidimensional schema construction in accordance with the constraints of the target data model. The application of these rules generates a target multidimensional PIM. Below we give a textual explanation of our identification/transformation rules we have coded in ATL language. Fact rule. A source table transforms into a fact if it contains at least one non-key numeric attribute. We are not interested with empty facts as they are rare in practice. Measure rule. Within each source table identified as a fact, we look for non-key numerical attributes and transform them into measures, if any. Dimension rule. Within each table T identified as a fact F, we find all its foreign keys and then, we transform each table referenced by a foreign key into a dimension for F. In addition, each numeric column belonging to the primary key of table T transforms into a dimension for fact F. Parameter rule. Each table transformed into a dimension is likely to provide parameters which are its non-key columns. Since these parameters are extracted from the same table, we consider them at the same hierarchical level. In addition, when a foreign key refers a table T therefore, T becomes a source of parameters for the next

MDA Compliant Approach for Data Mart Schemas Generation

267

level. This step will be reiterated until all possible navigation paths are followed. More precisely, we operate according to the two following steps: 1. 2.

Create the first parameter. The finest parameter (i.e., identifier) of dimension d is the primary key of the table on which d is built. Create next parameters (of rank i>1) for dimension d. We define a recurrent rule; it is to navigate between tables linked via foreign keys where we extract parameters. Iteration i produces the ith parameter.

Weak attributes rule. This rule associates to each parameter, issued from table T, all non-key columns of T. DATE dimension rule. The dynamic of the organization activities is traced through its business processes that generate transactional data over time. Therefore, the DATE dimension should be present in all DMs [1] and enables the analyses of data evolution over time. It is generally built on a Date attribute via trivial transformations. Thus, we derive standard hierarchies having their parameters Day, Month… Other domainspecific hierarchies can be manually added for the DATE dimension (e.g.Sale period).

4 Experimentation In order to validate the proposed approach, we have programmed our transformation rules in ATL (ATLAS Transformation Language) and then, we have tested them on four relational DWs. In this section, we restrict our illustration to the DW schema of figure 4. To do so, firstly we have transformed the DW schema according to the Meta model of figure 1. Secondly, we have run our transformation rules and obtained star schemas conform to our target multidimensional model. STUDENT (ID_STD, First_Name, Last_Name, City, Birth_Date, Gender, Housing, Tel, Nationality, ID_FAC#) FACULTY (ID_FAC, Extended_Name, Short_Name, City, ID_Univ#) COURSE (ID_CRS, Name, Curr_ID#, Semester_No) CURRICULUM (ID_Curr, Designation, Study_Years) COURSE-RESULT (ID_STD#, ID_CRS#, Year, Grade-Oral, Grade-Crs-Sess1, Grade-Crs-Sess2, Final-Grade-Crs) ANNUAL-RESULT (ID_STD#, Univ_Year, Term, Final-Grade-Sess1, Final-GradeSess2, Final-Grade, Result) UNIVERSITY (ID_Univ, Extended_Name, Short_Name) Fig. 4. Relational Data Warehouse Model (Source model)

The application of rule 1 on the DW of figure 4 produces facts listed in Table 1 (left column). The second rule gives measures (second column), and the third rule gives dimensions. Hierarchy construction produces parameters ordered from the finest to the highest granularity (last column). Figure 5 depicts the multidimensional star schema built for the fact ANNUAL-RESULT that is linked to three dimensions. It is obtained with the GMF editor.

268

H. Choura and J. Feki Table 1. Extracted facts, measures and dimensions Fact name COURSE ANNUAL-RESULT

Measure name SemesterNo

Dimension name Curriculum

Parameter name

Final-Grade-Sess1 Final-Grade-Sess2 Final-Grade

Univ_Year

Univ_Year

STUDENT

ID_STD Gender Birth_Date Housing Nationality City Tel ID_FAC (of level 2) ID_Univ (of level 3) City

FACULTY

ID_FAC ID_Univ City

Result

STUDENT

Tel

CURRICULUM

Study_Years

Birth_Date FACULTY UNIVERSITY ….

…

Fig. 5. Data mart schema compliant to the multidimensional Meta model (Target model)

MDA Compliant Approach for Data Mart Schemas Generation

269

5 Conclusion This paper presented an MDA compliant approach for the construction of DM schemas starting from a relational DW. This construction relies on a set of rules expressed in the ATL language. First, we classify the DW source tables into two categories: relational tables that are candidate for the generation of facts, and tables likely to produce dimensions. Secondly, we transform relational concepts into multidimensional ones. Currently, our transformation rules are able to generate star schemas. Actually, we have completed this work with the code generation for DM implementation on a specific platform (Model-to-text transformations). In addition, hierarchy detection may be improved by addressing the semantics of table columns in the source model. Semantics helps to decide whether a column may be transformed into a parameter or a weak attribute. To do so, the use of domain ontology is required.

References 1. Kimball, R.: The Data Warehouse Toolkit. John Wiley and Sons Inc., Chichester (1997) 2. Kimball, R., Revues, L., Ross, M., Thornthwaite, W.: Le data warehouse: Guide de conduite de projet, Eyrolles (2005) 3. Golfarelli, M., Maio, D., Rizzi, S.: Conceptual Design of Data Warehouses from E/R Schemas. In: Conference on System Sciences, Kona-Hawaii, vol. VII (1998) 4. Cabibbo, L., Torlone, R.: A Logical Approach to Multidimensional Databases. In: Conference on Extended Database Technology, Valencia, Spain, pp. 187–197 (1998) 5. Moody, L.D., Kortink, M.A.R.: From Enterprise Models to Dimensional Models: A Methodology for Data Warehouses and Data Mart Design. In: Proc. of the International Workshop on Design and Management of Data Warehouses, Stockholm, Sweden (2000) 6. Hüsemann, B., Lechtenbörger, J., Vossen, G.: Conceptual Data Warehouse Design. In: Proc. of the Int’l Workshop on Design and Management of Data Warehouses, Stockholm, Sweden, pp. 6.1–6.11 (2000) 7. Böhnlein, M., Ulbrich-vom Ende, A.: Deriving Initial Data Warehouse Structures from the Conceptual Data Models of the Underlying Operational Information Systems. In: Proc. Int. Workshop on Data Warehousing and OLAP, Kansas City, MO, USA, pp. 15–21 (1999) 8. Sinz, E.J.: Datenmodellierung im Strukturierten Entity-Relationship-Modell (SERM), Fachliche Analyse von Informations systemen. Addison-Wesley, Bonn (1992) 9. Phipps, C., Davis, K.: Automating data warehouse conceptual schema design and evaluation. In: DMDW 2002, Canada (2002) 10. Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., Paraboschi, S.: Designing Data Marts for Data Warehouse. ACM Transaction on Software Engineering and Methodology 10, 452– 483 (2001) 11. Soussi, A., Feki, J., Gargouri, F.: Approche semi-automatisée de conception de schémas multidimensionnels valides. Revue RNTI B(1), 71–90 (2005) 12. Prat, N., Akoka, J., Comyn-Wattiau, I.: A UML-based data warehouse design method. Decision Support Systems 42(3), 1449–1473 (2006) 13. OMG, Object Management Group, MDA Guide 1.0.1, http://www.omg.org/cgibin/doc?omg/03-06-01

A Methodology for Standards-Driven Metamodel Fusion Andr´ as Pataricza, L´ aszl´o G¨ onczy, Andr´ as K¨ovi, and Zolt´ an Szatm´ ari Budapest University of Technology and Economics, Department of Measurement and Information Systems, H-1117 Budapest, Magyar tud´ osok k¨ or´ utja 2 {pataric,gonczy,kovi,szatmari}@mit.bme.hu

Abstract. This paper describes a model-driven development methodology that supports the design, implementation and maintenance of complex, evolving systems. The key asset in this methodology is a domain speciﬁc design ontology which incorporates multiple aspects of the system under design. The methodology considers diﬀerent input data and metamodels during the information fusion and uniformization. The aim of this ontology building is threefold: i) it helps the early phase of system design and modeling, ii) it provides a basis for conformance check and validation, and iii) facilitates the generation of skeletons (data mode, service interface, process structure, business rules and assertions, etc.) which are conform with the standards and domain requirements. The methodology is illustrated on an example logistics system of the e-Freight European project. Keywords: Model Driven Engineering, Role of Ontologies in Modelling Activities, Domain Speciﬁc Modeling.

1

Introduction

The paper presents a model-driven development methodology that supports the design, implementation and maintenance of complex, systems facing rapid evolution of user and socio-economical requirements. The model-driven architecture-styled (MDA) methodology is built around a design ontology in order to assure a uniform handling of the large variety of target run-time platforms. Fig. 1 depicts the levels of information that are integrated in the design ontology, considering the design principles of MDA and ontology development [5] and the development of knowledge-based systems [7]. Evolving standards in a particular case study are collected, uniformized and merged into this design ontology, thus, forming the equivalent of the metamodel of a domain where all the entities of a given domain are described. Subsequently, the business entities and processes of the target application being developed are inserted as reﬁnements into the design ontology, resulting in a skeleton of

This work was partially supported by the e-Freight EU FP7 project (233758).

L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 270–277, 2011. c Springer-Verlag Berlin Heidelberg 2011

A Methodology for Standards-Driven Metamodel Fusion

271

Fig. 1. Ontology supported MDA for information fusion

a common information model (CIM) equivalent. Extra-functional requirements (aspects,) like maintainability, conﬁdentiality and access control to the objects have their own ontologies thus creating a similar, but hierarchical system for marking, like stereotypes and tagged values in UML based system design. Domain experts classify the individual components in the CIM according to their extra-functional properties by associating concepts from the aspect ontologies to the concepts in the CIM, therefore performing ”aspect weaving” on ontologies. The result marked CIM provides the base for the subsequent generation of process descriptors, business rules, database schemes, etc. Special care is given to the question of design for maintainability in the introduced development methodology as it best supports the development of long lived, evolving applications. Due to the nature of model integration, the method does not aim at full automation, it rather support the step-by-step review of expert users of a domain (e.g., maritime). The paper ﬁrst presents the general modeling approach and introduces the case study (Sec. 2), then discusses the steps in the information fusion process (Sec. 3) and the role of non-functional aspects (Sec. 4). Then the modeling framework and the concrete development process in e-Freight is presented (Sec. 5), followed by the description of the related work and conclusion of the paper.

2 2.1

Modeling Approach and Application Domain Modeling Approach

Industrial and scientiﬁc experience and best practice suggests the model driven architecture (MDA) based approach to cope with all the requirements posed

272

A. Pataricza et al.

by a complex environment. MDA in the current respect is not conﬁned to its traditional OMG styled, UML-based implementation but it refers to the application of a combination of model-based design paradigms in general to explore and formalize the domain speciﬁc knowledge. In our framework the analysis and synthesis will be solved by applying the paradigm of aspect oriented modeling. The design ontology (see Fig. 1) which is the core of our development methodology 1. serves as a central, semi-formal concept and terminology repository; 2. provides a simple means for aspect weaving by creating associations (concept relations) between the concepts of the individual aspect ontologies. The design ontology serves as information fusion medium for diﬀerent generalpurpose standards and their specializations or other information sources. Note that notion of ”ontology” is used here in its traditional core sense as a means to deﬁne concepts and their mutual relations. This design ontology is near to a metamodel; however, it contains individually all the concepts (classes) and attributes in an unfolded form; thus it is less syntax oriented. The proposed design environment will employ many modeling approaches, like Business Entity Modeling [8] for platform independent data and document modeling, Database modeling and (XML based) document modeling, Use Cases, Business Process Models and Business Rules (branching conditions in BPMs. In conclusion the overall requirements are formulated for the design methodology in three points: 1. It should support information fusion from virtually any sources, including legacy documentation and standards as well. 2. The non-functional aspects need to be integrated into the design ontology and this information is then used to make design decisions and ease analysis of change eﬀects. 3. Planning for maintainability is an ultimate objective. 2.2

Case Study: Enhanced e-Logistics in the e-Freight Project

The e-Freight project [1] of the European Union aims at developing solutions that will facilitate the use of diﬀerent transport modes (road, rail, waterborne) on their own and in combination to obtain an optimal and sustainable utilization of the European freight transport resources. Services provided by the e-Freight platform needs to address the continuous changes in business needs, user demands, regulations, technological advancements and other factors. Due to this evolving nature, only the initial set of basic functionalities can be addressed in the ﬁrst phase of development, then system is subject of the typical iterative e-business development process. A service oriented run-time architecture is proposed in the project that solves the fundamental questions of maintainability, extensibility, interoperability at the level of information technology. Additionaly, the platform design methodology has to cope with several challenges related to the same aspects at the business

A Methodology for Standards-Driven Metamodel Fusion

273

logic level in order to ensure a proper level of augmentative maintenance and evolution of the entire platform. One of the main goals of the project is to simplify the processes and reduce the paperwork required for transportation. The so-called Single Window is a facility that supports the communication of authorities, regulatory bodies and transporters by providing all management functionality through a single access point. The project aims to provide solutions for the Next Generation Single Window (NGSW) for co-modal transportation that goes beyond current approaches by managing a Single Transport Document (STD) for various transport modes (road, rail, waterborne), and facilitating integration with EU Single Windows and Platforms (e-Customs, River Information Services, etc.) to support co-operation between administrations in security, safety and environmental risk management. In this paper NGSW is used as a case study for the development methodology.

3

Information Fusion

Service interoperability can be ensured only if the appropriate standards describing the business objects with their representation, constrains on their lifecycle and the domain processes are considered during the development. Standardization covers the following main aspects: (i) deﬁnition of the underlying notions, (ii) representation and interrelation of business objects including (iii) their appearance in paper-based and electronic documents, (iv) IT related and legal constraints to which they underlie, like security (primarily conﬁdentiality and integrity aspects). Our development methodology tries to consolidate all information that is available in standards and other sources (legacy documentation, source code and comments, etc.) in the design ontology and then enrich it with aspects and let experts review it. The ﬁrst step of the fusion process is the import of the information into the ontology. The information is available in numerous formats: textual descriptions, XML, UML with diﬀerent semantics even in the same modeling notation. The ontology supports the representation of the semantics and the domain model, so the information will be available in a semantically reasonable way. Transformations for this fusion can be automatic (XSLT, graph transformations) but manual steps and (computer aided) semi automatic transformations are also required in the initial phase due to the diversity of the source representations. The design ontology contains the association between the basic notions from the standards and their typical use cases. This way the interdependence of business entities and services managing them on the standards is explicit. Management of the impacts upon changes in the standards is therefore simpliﬁed, because by starting from the changed standard notion the developer can trace all the dependences and estimate the change conﬁnement region.

274

4

A. Pataricza et al.

Non-functional Aspects

A complex application, like the e-Freight platform, has several extra-functional requirements, which may be well captured by associating the business entity notions with concepts in the design ontology corresponding to the domain of the particular extra-functional requirement. Such associations provide a proper means to check potential contradictions in the requirements. For example, in the case of a FAL form business object if the business requirements are extended with the conﬁdentiality (e.g. in special cases the consignor must not be revealed to unauthorized users), a conﬂict of requirements can be detected unless we apply a ﬁne granular scheme on the diﬀerent data elements in the invoice. The proposed approach supports the speciﬁcation of non-functional attributes of the individual business objects and process steps by domain experts in a way resembling aspect oriented modeling. Thus for each individual aspect a domain ontology is created and the experts of the given domain have to associate the individual business elements (and potentially use cases) with the relevant values/concepts from each aspect ontology. Another important issue is the maintainability of the system. Changes are usually either performed in the standards or made in the business logic, due to the changes of company policy or business environment. Even though standards are usually developed by keeping backward compatibility as top priority, nonbackward compatible changes are still possible, as in the case of business services. The ontology is also used to trace the eﬀect of a change in the data model or the service interface and create ”wrappers”. Also the principle is followed to make the largest possible set of data available on the service and business rule interfaces (keeping security and business constraints) to minimize the interface level changes. The main advantage of including such an extension is that all interdependencies will be made explicit and the additional design objects can be enriched by further attributes. For instance, a simple rule based access control (RBAC) can be created by associating required access rights to the individual business objects in the general purpose part of the ontology and classifying the actors and operations in the use cases according to the same schema describing the services by their respective use cases in a framework speciﬁc sub-ontology. Note that extensions can be physically collected into a separate sub-ontology referring to the main one. This separation keeps the general elements and the actual implementation in diﬀerent ontologies thus prevents confusion.

5

The e-Freight Case Study Experience

Section 2.2 described the Next Generation Single Window (NGSW) which is one of the most important development eﬀorts in the e-Freight project. One of the main goals of the work is to develop a new document that simpliﬁes the processes and includes all the information that is currently scattered in many forms. The idea is to create a Single Transport Document (STD) based on

A Methodology for Standards-Driven Metamodel Fusion

275

Fig. 2. The current e-Freight design ontology

existing standards and transportation industry stakeholder requirements. The e-Freight design ontology will be a key element for this eﬀort since it allows the integration of various information sources into a common knowledge base and enables the continuous development and reasoning over this knowledge. The work on the NGSW application started with the waterborne transport mode and other modes will be integrated gradually. On the one hand, the preparation for the STD standardization began with creating the Common Regulatory Schema (CRS), and on the other hand, a conceptual demonstrator was being developed. Both eﬀorts are based on FAL forms [6] of the International Maritime Organization but a diﬀerent subset of information is integrated in them according to their purpose. For example, the CRS contains parts of other standards as well, like the security measures of the Safe Sea Net (SSN). As a result, the data consistency of the two eﬀorts needs to be maintained, which is where our proposed methodology can be applied. Automated transformations import the models from these two information sources into the design ontology. The transformations were designed so that the resulting knowledge base contains both the semantic items and the format information of the input source (XML or Excel). An example for the harmonized data model for the declaration forms is shown in Fig. 3 At the same time, the demonstrator described the FAL forms according to a tool-speciﬁc business entity descriptor schema, and also described the processes operating on these entities using a proprietary XML schema for encoding IDEF0 [9] diagrams. After merging the domain models to a common ontology, the following development support is given for the development workﬂow: – – – –

Consistency check across diﬀerent models. Support for runtime assertion check implemented by validator rules. Generation of deployment artefacts (e.g. interfaces, Java object model). Generation of extracts of the business logic that correspond to business process patterns.

Prototype implementations for these facilities have been created and are under development at the moment.

276

A. Pataricza et al.

Fig. 3. Extract of the ontology: declaration of a ship’s cargo

6

Related Work

The idea of capturing design information in ontologies is not a new one. Several ontologies have been created that describe speciﬁc aspects of a domain. For example, the transportation ontology from the DARPA [11] is a quite concise one for describing transportation entities and processes. However, it does not provide enough notions for describing freight transportation and it lacks any public documentation. Other works, like [2] focus on providing ways to describe transportation ﬂows. The recent Transitioning Applications to Ontologies (TAO) and Ontorule projects are placing the ontology in the middle of the development process, and use the information in it to automate in part the application design. The TAO project provides a very concise description of the tools and methods that have been developed in the project. Several ways for the information fusion phase have been developed from database mining techniques [4] to text mining and other solutions. The best practices for building up the ontology have also been identiﬁed together with a formal semantics for the OWL-S [13,12] ontology. These results will be very useful in later phases of the e-Freight project. Although many of the information fusion and ontology-based development techniques have been identiﬁed in the literature, the maintenance and long term evolution are less elaborated. Our approach to change eﬀect tracking is very similar to those used in requirement tracking [10], but while the previous solutions focus on deﬁning a precise formalism, our methodology tries to follow a more natural way by allowing the users to deﬁne aspects and use the features of ontologies to handle them. The relation of business rules and ontologies has been thoroughly examined in the Ontorule project [3]. We mainly use the design ontology for generating the vocabulary of business rules.

A Methodology for Standards-Driven Metamodel Fusion

7

277

Conclusions

In this paper we presented a model-driven development methodology that supports the design, implementation and maintenance of complex, evolving systems. The methodology is based on a design ontology that incorporates the domain knowledge connected to the application and contains additional aspects that describe speciﬁc feature of the concepts stored in the design ontology. During our future work, we will further investigate means for semi-automated information fusion and marking to support a semantically enabled MDA for service integration.

References 1. e-Freight: European e-Freight capabilities for Co-modal transport, FP7 SST-2008TREN-1, Grant no.: 233758 (2011), http://www.efreightproject.eu/ 2. Bendriss, S., Benabdelhaﬁd, A., Boukachour, J.: Information system for freight traceability management in a multimodal transportation context. In: Int. Conf. Computer Systems and Applications, AICCSA 2009, pp. 869–873 (2009) 3. Bonnard, P., Citeau, H., Dehors, S., Heymans, S., Korf, R., P¨ uhrer, J., Eiter, T.: D3.1 state-of-the-art survey of issues. Tech. rep., ONTORULE project (2009) 4. Cerbah, F.: Mining the content of relational databases to learn ontologies with deeper taxonomies. In: Int. Conf. on Web Intelligence and Intelligent Agent Technology, WI-IAT 2008, vol. 1, pp. 553–557 (2008) 5. Gaaevic, D., Djuric, D., Devedzic, V., Selic, B.: Model Driven Architecture and Ontology Development. Springer-Verlag New York, Inc., Secaucus (2006) 6. International Maritime Organization. Convention on Facilitation of International Maritime Traﬃc, FAL (2011), http://goo.gl/oDGdW 7. Knublauch, H., Rose, T.: Round-trip engineering of ontologies for knowledge-based systems. In: Ruhe, G., Bomarius, F. (eds.) SEKE 1999. LNCS, vol. 1756. Springer, Heidelberg (2000) 8. Nandi, P., Koenig, D., Moser, S., Hull, R., Klicnik, V., Claussen, S., Kloppmann, M., Vergo, J.: Data4BPM, Part 1: Introducing Business Entities and the Business Entity Deﬁnition Language (BEDL). Tech. rep., IBM Developerworks (2010) 9. National Institute of Standards and Technology. Integrated DEFinition Methods IDEF-0 (2011), http://www.idef.com/IDEF0.htm 10. Soares, M., Vrancken, J.: Model-Driven User Requirements Speciﬁcation using SysML. Journal of Software 3(6) (2008) 11. The DARPA Agent Markup Language Homepage (DAML). Transportation Ontology (2011), http://www.daml.org/ontologies/409 12. Wang, H., Payne, T., Gibbins, N., Saleh, A.: Formal Speciﬁcation of OWL-S with Object-Z: The Dynamic Aspect. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 237–248. Springer, Heidelberg (2007) 13. Wang, H.H., Saleh, A., Payne, T., Gibbins, N.: Formal speciﬁcation of OWL-S with Object-Z: the static aspect. In: Int. Conf. on Web Intelligence, pp. 431–434 (2007)

Metamodel Matching Techniques in MDA: Challenge, Issues and Comparison Lamine Lafi1, Slimane Hammoudi2, and Jamel Feki3 1

ISSAT, Institut Sup. des Sciences Appliquées et de Technologie, University of Sousse, Tunisia 2 ESEO, Ecole Supérieure de l’Ouest Angers, France 3 FSEG, Faculté des Sciences Economiques et de Gestion, University of Sfax, Tunisia

Abstract. Nowadays, it is well recognized that model transformation is at the heart of model driven engineering approaches (MDE) and represents as a consequence one of the most important operation in MDE. However, despite the multitude of model transformation language proposals emerging from university and industry, these transformations are often created manually, which is a fastidious and error-prone task, and therefore an expensive process. In this context, we argue that the semi-automatic generation of transformation rules is an important challenge in future MDE development to make it easier, faster, and cost-reduced process. In this paper we propose to discuss metamodels matching as a key technique for a semi-automatic transformation process. First, we review and discuss the main approaches that have been proposed in the state of the art for metamodels matching. Secondly, we compare three algorithms of metamodel matching namely “Similarity Flooding”, SAMT4MDE+ and ModelCVS using match quality measures proposed for schema matching in databases. A Plug-in under the Eclipse framework has been developed to support our comparison using three couple of metamodels. Keywords. Metamodel matching, Model transformation, semi-automatic transformation process, Comparison and Evaluation.

1 Introduction Research and practice for Model Driven Engineering (MDE) have significantly progressed over the last decade for dealing with the increase of complexity within systems during their development and maintenance processes by raising the level of abstraction using models as a core development artifact. New significant approaches, mainly Model Driven Architecture (MDA) [1] defined at the OMG (Object Management Group), “Software Factories” proposed by Microsoft [2] and the Eclipse Modeling Framework (EMF) [3] from IBM, are born and have been experimented. In the literature, several issues around MDE have been studied and subjected to intensive research, e.g. modeling languages [4], [5] model transformation [6], [7] mapping between metamodels [8], [9] and design methodologies [10]. Among these issues, model transformation languages occupy a central place and allow the definition of how a set of elements from a source model are analyzed and transformed into a set of L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 278–286, 2011. © Springer-Verlag Berlin Heidelberg 2011

Metamodel Matching Techniques in MDA: Challenge, Issues and Comparison

279

elements in a target model. A formal semi-automation of the transformation process leads to a real challenge allowing many advantages: mainly, it enhances significantly the development time of transformation and decreases the errors that may occur in a manual definition of transformations. In [9] and [11], the authors have initiated a first attempt towards this semi-automation. They introduced an approach separating mapping specifications from transformation definitions, and implemented this approach in a software tool called Mapping Modeling Tool (MMT). In [11] the authors have proposed to push the semi-automation process one step further by using matching techniques discussed in [12] to semi-automatically generate mappings between two metamodels. The produced mappings could then be adapted and validated by an expert for the automatic derivation of a transformation model, as a set of transformation rules. Thus, matching techniques between metamodels are the centerpieces for a semi-automatic transformation process in MDE and particularly in MDA. In fact, metamodels matching allows discovering mappings between two metamodels and the mappings allow in turn generating transformation rules between two metamodels. However, there has been little research in metamodel matching in contrast to the ontology [13] and database [14], [15] domains, where an intensive research has been conducted. In this paper, we firstly review the main different approaches that have been proposed for metamodel matching in the context of MDE/MDA. We compare three recent algorithms of metamodel matching namely ModelCVS [16], Similarity Flooding [17] and Extended Semi-automatic Matching Tool for Model Driven Engineering [18] (noted SAMT4MDE+) using match quality measures proposed for schema matching in databases. This paper is organized as follows: Section 2 reviews and presents three algorithms for metamodel matching, and section 3 discusses an experimental comparison between these three algorithms using eight pairs of metamodels. Section 4 presents all highlights lessons learned. Finally, section 5concludes our work and enumerates some final perspectives.

2 ModelCVS Versus Similarity Flooding Versus SAMT4MDE+ Our previous work [12] gave a summary of a theoretical comparison of five approaches of metamodels matching using features inspired mainly from schema matching in databases. These approaches are: Similarity Flooding (SF), ModelCVS project, Semi-Automatic Matching Tool for MDE (SAMT4MDE), Generic Model Weaver (AMW) and an extended SAMT4MDE (SAMT4MDE+). In this work we propose to compare from an experimental point of view the three more recent approaches ModelCVS, SF, and SAMT4MDE+. Moreover, these three approaches have been chosen from the fact that they are sufficiently defined for a comparison process. 2.1 ModelCVS In [16], the authors propose an approach said “lifting”, allowing transforming the source and target metamodels into equivalent ontologies. This approach proposes a framework of matching the metamodels thanks to a transition of ModelWare into

280

L. Lafi, S. Hammoudi, and J. Feki

OntoWare while using transformations of the metamodels Ecore-based into OWLbased ontologies. After having done this transition, one can reuse the tools of ontologies matching that exploit the ontologies which really represent the metamodels. Once the matching task is over, the transition of the ontology mapping into a model of texture will be done. From this model of texture, the necessary transformation rules to transform some models in conformity with a metamodel A to a model in conformity with a metamodel B can be deduced. In this work, they concentrate on evaluating schema-based matching tools, i.e., they do not consider instance-based matching techniques. This is due to the fact that they are using the data provided by metamodels (schema-level) and not data from models (instance-level) to find equivalences between metamodel elements. 2.2 Similarity Flooding Similarity Flooding (SF) in [17] is a generic alignment algorithm that allows calculating the correspondences between the nodes of two labeled graphs. This algorithm is based on the following intuition: if two nodes stemming from two graphs have been determined as similar, therefore, there is strong probability for the neighboring nodes to be similar, too. More precisely, SF applies five successive phases on the labeled graphs which have been provided at the input phase. This algorithm is applied after the transformation phase that consists in transforming the MMsource and the MMtarget to the directed labeled graphs Gsource and Gtarget. Along this phase a set of six strategies to encode the metamodel into such a graph has been used. Each of these strategies has got its proper techniques to transform these two models into a graph and they are explained in [17]. In this paper we consider only three tree strategies of encoding namely: Standard, Saturated and Flattened, since they have given the best quality measures in [17]. 2.3 SAMT4MDE+ In [18] a new metamodel matching algorithm uses structural comparison between a class and its neighboring classes in order to select the equal or similar classes from source and target metamodels. The proposed algorithm for metamodel matching called SAMT4MDE+ is an extension and enhancement of the algorithm presented in [19] and it is implemented in the Semi-Automatic Matching Tool for MDE (SAMT4MDE) which is capable of semi-automatically creating mapping specifications and making matching suggestions that can be evaluated by users. This provides more reliability to the system because mapping becomes less error-prone. The algorithm proposed can identify structural similarities between metamodel elements. However, sometimes elements are matched by their structures but they do not share the same semantic. The lack of semantic analysis leads the tool to find false positives matches, i.e. derived correspondences that are not true. The function similarity (c1, c2) is the weighted mean which has as parameters basicSim (c1, c2) and structSim (c1, c2) with the weights coefBase and coefStruct, respectively. It returns continuous values representing the similarity level between c1 and c2. The weights are coefBase and coefStruct that added result in 100%, or 1(one).

Metamodel Matching Techniques in MDA: Challenge, Issues and Comparison

281

3 Comparative Study and Experimental Results To evaluate the three techniques of metamodels matching ModelCVS, Similarity Flooding, and SAMT4MDE+, we have developed a plug-in under eclipse environment. In order to achieve a comparative study between the above three approaches for metamodels matching we have used quality metrics defined for database schema matching and metamodels among the following list: Ecore, Minjava, Minjava_V2,UML, UML_V2, Webml, ODM, traceabilityToolMM, traceRepository, etrace, BibTeXA, and BibTeXB.These metamodels are detailed in [20]. Eight alignments have been considered for our comparison: • • • • • • • •

Ecore_Minjava_V2 Ecore_UML Webml_ODM traceabilityToolMM_traceRepository etrace_traceabilityToolMM Ecore_UML_V2 BibTeXA_BibTeXB Ecore_Minjava

The SF algorithm is evaluated on the basis of three configurations Standard, Flattened and Saturated, since they have given good measures of quality in [17]. The results of the experimentation of the three algorithms ModelCVS, Similarity Flooding, and SAMT4MDE+ are given by figures (Fig. 3 to Fig. 7). These figures give a detailed overview of the results gained with the three tools presented in section 2. Each axis of the graph represents a mapping of two metamodels. The three curves represent precision, recall and F-measure (in blue, red and green colors respectively). These figures allow for an easy comparison of the matching quality of the three approaches and furthermore can identify which algorithms deliver good or bad results. Some tools did not find any result for some matching tasks, therefore some axis of the star glyph are empty. The inner gray ring in Fig.3 to Fig. 7 describes the value of 0.5, which is in particular the most interesting value for the F-measure. For ModelCVS an F-measure higher than 0.5 indicates a positive benefit, a value lowers than 0.5 means a negative benefit. SAMT4MDE+ uses a threshold value equal to 0.6. The method we use to filter the produced multimapping for SF is called Select Threshold in [17]. Fig. 3 to Fig. 7 depict a detailed overview of the results gained with the default settings of the tools illustrated as a star glyph. Each axis of the glyph represents a mapping of two metamodels. The three values represent precision, recall and F-measure. Now, we discuss the best and worst cases for our three measures when using the three algorithms. The highest precision was achieved by the algorithm SAMT4MDE+ for the couple of metamodels Ecore_minjava_v2 (precision = 0.75, recall = 0.85, Fmeasure = 0.79), the best value of F-Measure is also given by the same algorithm which sends back the most elevated value of F-Measure with the same couple of metamodels. The most important recall value (recall = 0.933) is sent back by the two

282

L. Lafi, S. Hammoudi, and J. Feki

algorithms Similarity Flooding and ModelCVS with the couple of metamodels BibTexA_BibTeXB. According to the three figures Fig. 3 to Fig. 5, we notice that the size of the metamodels and the principle of transformation within each of the configurations (Standard, Flattened, Saturated) used have an influence on the values of the quality measure values. Indeed, these measures vary from a configuration to another. According to the experimental study we remarked that the algorithm SF finds its performance for metamodels of small size. We can give examples of the couples webml_ODM (Precision = 0.62, Recall = 0.66, F-Measure = 0.64) and BibTexA_BibTexB (Precision = 0.5, Recall = 0.93, F-Measure = 0.65) with the standard configuration represented by Fig. 3.

Fig. 3. Similarity Flooding (Standard)

Fig. 4. Similarity Flooding (Flattened)

Fig. 5. Similarity Flooding (Saturated)

It is also important to note that for SF the values of the measures of qualities become very weak when it comes to corresponding two metamodels whose sizes are not similar. For instance, for the couple of metamodels Ecore_UML_V2 and according to the Flattened configuration (judged the best in our experimental study) illustrated by Fig. 4, we obtained the following values (Precision = 0.2, Recall = 0.52, F-Measure = 0.29). This specificity (un-equivalence of size) has also qualified the approach ModelCVS (Fig. 6) for the same couple of metamodels Ecore_UML_V2, so the obtained results will be very weak (Precision = 0.19, Recall =0.39, F-Measure = 0.26). As for the saturated configuration, it is also characterized by the instability of the precision, recall and F-measure values, so bigger the size of the metamodels becomes, the more the values diminish. Fig.5 illustrates what we have just commented. We give as an example, for the couple Ecore_UML, the measures became too weak with this configurations (Precision = 0.14, Recall =0.32, F-Measure = 0.2) We have also noted that the Flattened configuration itself brings back the best measures compared to the other two configurations (Standard and Saturated), we give

Metamodel Matching Techniques in MDA: Challenge, Issues and Comparison

283

as an example (Precision = 0.73, Recall = 0.62, F-Measure = 0.67) for the couple of metamodels Ecore_minjava_v2, whereas the standard configuration sent back the values (Precision =0.51; Recall = 0.69; F-measure =0.60), and that the saturated configuration gave (Precision = 0,52, Recall =0.58, F-Measure = 0.55). ModelCVS illustrated by Fig.6 is not far from being effective, because the best value of the quality measures (Precision, Recall, F-Measure) does not exceed the value 0.6 (to the neighborhood of 0.5), but it cannot be more effective than Similarity Flooding nor of SAMT4MDE+. Let's recall here that for ModelCVS the value of F-Measure is used to assess the performance of this algorithm. Indeed, a value of F-Measure superior to 0.5 means a positive effect and a lower value to 0.5 means the opposite. By comparing the ModelCVS approach to the other tools it determines less weak values of quality measures in relation to the algorithms Similarity flooding and SAMT4MDE+ on the basis of the same couples of metamodels used in this experimental study, although all values of the quality measures are at the neighborhood of 0.5.

Fig. 6. ModelCVS

Fig. 7. SAMT4MDE+

Based on the different modifications and combinations of this matching technique which produce higher precision values but at the same time lower recall values, the metamodels must fulfill some requirements in order to prove that matching tools are worth using. The experimental result of ModelCVS approach in our study showed that the metamodels must have a common terminology and taxonomy, which is the case when matching UML, UML_V2 and Ecore. These combinations lead to the best results despite their size which obviously lead to a higher number of elements that have to be matched. Furthermore, good results are achieved when matching BibtexA with BibtexB or Webml_ODM. These two metamodels also have a common terminology and both do not heavily use inheritance relationships. In contrast, matching Ecore with UML, UML_V2 results in a very low precision and in a very poor recall which is mostly below 0.38. These results lead to the conclusion that ontology matching tools are not always appropriate for matching metamodels. Instead, the metamodels must fulfill some common properties which of course is not always the case when matching real-world metamodels. Contrary to the two algorithms Similarity Flooding and ModelCVS, the SAMT4MDE+ algorithm gives very good measures of quality of matching for any

284

L. Lafi, S. Hammoudi, and J. Feki

size of the metamodels to correspond. That is whether the two metamodels to correspond are with important sizes, or that one is big and the other is of average size or even small. We note that this approach (SAMT4MDE+) is the only one among the three, subject of our experimentations, which sent back good measures of quality (Precision=0.66, Recall = 0.79, F-Measure = 0.72) for the couple Ecore_UML assessed to be the most important, while with ModelCVS these measures are of (precision = 0.145, Recall = 0.32, F-measure = 0.2). The couple of metamodels BibTexA_BibTeXB has given good measure of quality with the three approaches, notably the approach SAMT4MDE+ where the three values of precision, recall and F-Measure are equal to 1 (meaning that all mappings found are correct). This approach is therefore perfect if the correspondence is to achieve between two couples of small size metamodels. Now, if we consider the extreme case where the metamodel is of a very big size we also notice that this approach remains always effective, so for the couple Ecore_minjava the measures are also important (Precision = 0.750, Recall = 0.85, FMeasure = 0.79). The same is true for the couple Ecore_UML (Precision = 0.66, Recall = 0.8, F-Mesure = 0.72). Fig.7 justifies all these conclusions.

4 Lessons Learned In this section we present our explained advice based on the experimental matching metamodels presented in the previous section.

• •

•

•

•

SAMT4MDE+ is more suitable for the metamodels of large size, whereas SF is suitable only for metamodels of small size with a restricted number of elements. The user's intervention to validate the suggestions of mappings determined by the algorithm will have very positive consequences on the performance of the quality measures. This is due to the fact that the expert user tried to choose all the mappings that appears correct during the validation phase. In SAMT4MDE+ the use of enumerations is very beneficial: enumerations are used in metamodels, representing a field of constants (expressed as letters) that are used for attribute types. Matching scenarios showed that the mappings between the enumerations are very helpful to derive model transformation rules, because it must be exactly specified how the data are processed. The use of the best versions or types of metamodels provides a high quality of matching in terms of measures used to evaluate the degree of similarity between the metamodels. In particular, most transformation rules can be automatically derived from ontology models of mappings or metamodels. There is no common classification for the design of models: the construction of manual mappings showed that there was no significant heterogeneity between the concrete classes of metamodel, but it is rather between abstract

Metamodel Matching Techniques in MDA: Challenge, Issues and Comparison

•

•

285

classes. In addition, the design of hierarchy in the metamodel is not based on a common point. For model transformations, the structural properties are more important, because they must be compliant to transformation rules. Name equivalence does not necessarily equate with conceptual equivalence and vice versa: When building our manual mappings, we found out that in some cases metamodel elements have the same name, but their semantics are quite different and the elements should not be mapped by equivalence links. One prominent example is the class EnumLiteral of the Ecore metamodel. EnumLiteral has an attribute value and also an attribute name. The WebML metamodel contains a class DomainValue which is semantically equivalent with EnumLiteral and has an attribute value. However, the mapping between EnumLiteral value and DomainValue.value is not correct; instead EnumLiteral.name should be mapped to DomainValue.value. This is due to the fact that EnumLiteral.value is only a running counter for the literals. Instead, EnumLiteral.name and DomainValue.value both present constant values of an Enumeration. This case is not solvable without additional knowledge or exploring instances of the metamodels, i.e., the models.

5 Conclusion and Future Work The semi-automatic generation of transformation rules is an important challenge in future MDE development to make the transformation easier, faster, and cost-reduced process. The contribution of this work is twofold: First, we presented the main techniques and artifacts involved in the semi-automatic transformation process. Second, we reviewed main approaches that have been proposed in the literature for metamodel matching, and, then we have studied from an experimental point of view the three most recent techniques of metamodels matching ModelCVS, Similarity Flooding and SAMT4MDE+. This experimental comparison allowed us to get different values of matching quality measures using different couples of metamodels. We have noticed that the algorithm SAMT4MDE+ gave more effective results than those given by the algorithms Similarity Flooding and ModelCVS. In our future work, we will concentrate on identification of many criteria of comparison, and how to situate this three approaches with different others approaches to enhance the matching process. In addition, we will consider studying the optimization of mapping models which seems to be another important issue in MDE.

References 1. OMG, 2001. Model Driven Architecture (MDA)- document number ormsc/2001-07-01 (2001) 2. Dominguez, K., Pérez, P., Mendoza, L., Grimán, A.: Quality in Development Process for Software Factories According to ISO 15504. CLEI Electronic Journal 9(1), Pap. 3 (June 2006), http://www.clei.cl

286

L. Lafi, S. Hammoudi, and J. Feki

3. Budinsky, F., Steinberg, D., Merks, E., Ellersick, R., Grose, T.J.: Eclipse Modeling Framework: A Developer’s Guide, 1st edn. Addison-Wesley Pub. Co, Reading (2003) 4. Bézivin, J., Hammoudi, S., Lopes, D., Jouault, F.: Applying MDA Approach for Web Service Platform. In: 8th IEEE International Conference on EDOC, pp. 58–70 (2004) 5. Booch, G., Brown, A., Iyengar, S., Rumbaugh, J., Selic, B.: An MDA Manifesto. MDA Journal (May 2004) 6. Jouault, F.: Contribution à l’étude des langages de transformation de modèles, Ph.D. thesis (written in French), University of Nantes (2006) 7. OMG, MOFQVT Final Adopted Specification, OMG/2005-11-01 (2005) 8. Lopes, D.: Study and Applications of the MDA Approach in Web Service Platforms, Ph.D. thesis (written in French), University of Nantes, France (2005a) 9. Almeida, A.J.P.: Model-driven design of distributed applications. PhD thesis, University of Twente (2006) ISBN 90-75176–422 10. Hammoudi, S., Lopes, D.: From Mapping Specification to Model Transformation in MDA: Conceptualization and Prototyping. In: MDEIS, First International Workshop On Model Driven Development, Miami, USA, pp. 3–15 (2005) 11. Hammoudi, S., Alouini, W., Lopes, D., Huchard, M.: Towards A Semi-Automatic Transformation Process in MDA: Architecture, Methodology and First Experiments. International Journal IJISMD (2010) 12. Lafi, L., Alouini, W., Hammoudi, S., Gammoudi, M.: Metamodels Matching: Issue, techniques and comparison. In: 2nd Intenational Workshop FTMMD, Joint to International Conference ICEIS, Portugal (2010) 13. Feiyu, L.: State of the Art.: Automatic Ontology Matching, Research Report, School of Engineering, Jonkoping, Sweden (2007) 14. Do, H.H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Chaudhri, A.B., Jeckle, M., Rahm, E., Unland, R. (eds.) NODe-WS 2002. LNCS, vol. 2593, pp. 221–237. Springer, Heidelberg (2003) 15. Rahm, E., Bernstein, P.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001) 16. Kappel, G., Kargl, H., Kramler, G., Schauerhuber, A., Seidel, M., Strommer, M., Wimmer, M.: Matching Metamodels with Semantic Systems – An Experience Report. In: BTW, Date Bank System in Business, Technologie and Web (2007) 17. Falleri, J.R., Huchard, M., Lafourcade, M., Nebut, C.: Metamodel matching for automatic model transformation generation. In: Busch, C., Ober, I., Bruel, J.-M., Uhl, A., Völter, M. (eds.) MODELS 2008. LNCS, vol. 5301, pp. 326–340. Springer, Heidelberg (2008) 18. de Sousa Jr, J., Lopes, D., Claro, D.B., Abdelouahab, Z.: A Step Forward in Semiautomatic Metamodel Matching: Algorithms and Tool. In: Filipe, J., Cordeiro, J. (eds.) Enterprise Information Systems. LNBIP, vol. 24, pp. 137–148. Springer, Heidelberg (2009) 19. Chukmol, U., Rifaiem, R., Benharkat, N.: EXSMAL: EDI/XML Semi-Automatic Schema Matching ALgorithm. In: Proceedings of the Seventh IEEE International Conference on ECommerce Technology, pp. 422–425. IEEE Computer Society, Los Alamitos (2005) 20. Falleri, J.R.: Contributions à l’IDM.: reconstruction et alignement de modèles de classes. Ph.D. thesis (written in French), University of Montpellier 2 (2009)

Author Index

Abdouli, Majed 85 Aboulsamh, Mohammed A. A¨ıt-Ameur, Yamine 200 A¨ıt-Sadoune, Idir 200 Aleb, Nassima 186 Ali, Mouez 85 Amanton, Laurent 85 Amroune, Mohamed 122 Antonio do Prado, Hercules Antunes, M´ ario 178 Baba, Takahiro 152 Ba¨ına, Karim 249 Ba¨ına, Salah 249 Balaniuk, Remis 143 Basiri, Javad 133 Ben-Abdallah, Hanene 71 Benslimane, Sidi Mohammed Bogdan, Stepan 170 Bouaziz, Raﬁk 85 Boukhobza, Jalil 97 Bounour, Nora 110 Boussaid, Omar 71

214

143

50

143

Hamani, Mohamed said 162 Hammoudi, Slimane 278 Harbi, Nouria 71 Hirokawa, Sachio 152 Inglebert, Jean-Michel Ivanov, Petko 18

122

Kaddes, Mourad 85 Kamel, Nadjet 186 Khetib, Ilyes 97 K¨ ovi, Andr´ as 270 Kudinov, Anton 170 Laﬁ, Lamine 278 Lammari, Nadira 42 Lopes, Antonia 3 Maamri, Ramdane 162 Markov, Nikolay 170 Mathew, Wesley 62 Mekour, Mansour 50 M´etais, Elisabeth 42 Mohand-Oussa¨ıd, Linda 200

Charrel, Pierre-Jean 122 Cherait, Hanene 110 Choura, Hassene 262 Cobbe, Paulo Roberto 143 Costa, Joana 178 Davies, Jim 214 Delmas, R´emi 237 Doose, David 237 Doumi, Karim 249 Ellouze, Nebrasse

Grolleau, Emmanuel 226 Guadagnin, Renato da Veiga Guarda, Teresa 31

Nakatoh, Tetsuya Olivier, Pierre 97 Ouhammou, Yassine

152

226

42 Pataricza, Andr´ as 270 Pinto, Filipe Mota 31, 62 Pires, Anthony Fernandes 237 Polacsek, Thomas 237

Fathian, Mohammad 133 Feki, Jamel 262, 278 Ferneda, Edilson 143 Gago, Pedro 31 Gholamian, Mohammad Reza G¨ onczy, L´ aszl´ o 270

133

Ribeiro, Bernardete 178 Richard, Michael 226 Richard, Pascal 226

288

Author Index

Sadeg, Bruno 85 Santos, Manuel 62 Sellis, Timos 1 Siami, Mohammad 133 Silva, Catarina 178 Simonet, Ana 4 Szatm´ ari, Zolt´ an 270

Tamen, Zahia 186 Triki, Salah 71 Venkatachaliah, Girish Voigt, Konrad 18 Zarour, Nacereddine

2

122

Model and Data Engineering First International Conference, MEDI 2011 Óbidos, Portugal, September 28-30, 2011 Proceedings

Food Engineering September 2011

Pollution Engineering September 2011

Innovative Computing and Information: International Conference, ICCIC 2011, September 17-18, 2011

Harper's Magazine September 2011

Ceramic Industry September 2011

Dairy Foods September 2011

Candy Industry September 2011

Outdoor Photographer - September 2011

Prepared Foods September 2011

HAPPI September 2011

Appliance Design September 2011

Process Heating September 2011

Coatings World September 2011

Rock and Gem September 2011

Quality September 2011

Flexible Packaging September 2011

Scientific American, September 2011

Oracle Magazine September 2011

Proceedings of the International Conference on Human-centric Computing 2011 and Embedded and Multimedia Computing 2011: HumanCom & EMC 2011

Sustainable Automotive Technologies 2011: Proceedings of the 3rd International Conference

Advances in Intelligent Data Analysis X: 10th International Symposium, Ida 2011, Porto, Portugal, October 29-31, 2011, Proceedings

Software engineering and formal methods : 9th international conference, SEFM 2011, Montevideo, Uruguay, November 14-18, 2011 : proceedings

Information Security: 14th International Conference, ISC 2011, Xi'an, China, October 26-29, 2011, Proceedings

Algorithmic Decision Theory: Second International Conference, ADT 2011, Piscataway, NJ, USA, October 26-28, 2011. Proceedings

5th Kuala Lumpur International Conference on Biomedical Engineering 2011 - BIOMED 2011

Pollution Engineering November 2011

Pollution Engineering October 2011

Food Engineering August 2011

Food Engineering May 2011

Pollution Engineering December 2011

Model and Data Engineering First International Conference, MEDI 2011 Óbidos, Portugal, September 28-30, 2011 Proceedings

Food Engineering September 2011

Pollution Engineering September 2011

Innovative Computing and Information: International Conference, ICCIC 2011, September 17-18, 2011

Harper's Magazine September 2011

Ceramic Industry September 2011

Dairy Foods September 2011

Candy Industry September 2011

Outdoor Photographer - September 2011

Prepared Foods September 2011

HAPPI September 2011

Appliance Design September 2011

Process Heating September 2011

Coatings World September 2011

Rock and Gem September 2011

Quality September 2011

Flexible Packaging September 2011

Scientific American, September 2011

Oracle Magazine September 2011

Proceedings of the International Conference on Human-centric Computing 2011 and Embedded and Multimedia Computing 2011: HumanCom & EMC 2011

Sustainable Automotive Technologies 2011: Proceedings of the 3rd International Conference

Advances in Intelligent Data Analysis X: 10th International Symposium, Ida 2011, Porto, Portugal, October 29-31, 2011, Proceedings

Software engineering and formal methods : 9th international conference, SEFM 2011, Montevideo, Uruguay, November 14-18, 2011 : proceedings

Information Security: 14th International Conference, ISC 2011, Xi'an, China, October 26-29, 2011, Proceedings

Algorithmic Decision Theory: Second International Conference, ADT 2011, Piscataway, NJ, USA, October 26-28, 2011. Proceedings

5th Kuala Lumpur International Conference on Biomedical Engineering 2011 - BIOMED 2011

Pollution Engineering November 2011

Pollution Engineering October 2011

Food Engineering August 2011

Food Engineering May 2011

Pollution Engineering December 2011

Recommend Documents