Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2814
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Manfred A. Jeusfeld Óscar Pastor (Eds.)
Conceptual Modeling for Novel Application Domains ER 2003 Workshops ECOMO, IWCMQ,AOIS, and XSDM Chicago, IL, USA, October 13, 2003 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Manfred A. Jeusfeld Tilburg University Department of Information Systems and Management P.O. Box 90153, Tilburg, 5000 LE, The Netherlands E-mail:
[email protected] Óscar Pastor Polytechnical University of Valencia Camino de Vera s/n, Valencia, 46022 Spain E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): H.2, H.3, H.4, H.5, K.4.4, I.2 ISSN 0302-9743 ISBN 3-540-20257-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP Berlin GmbH Printed on acid-free paper SPIN: 10950012 06/3142 543210
Preface
ER 2003, the 22nd International Conference on Conceptual Modeling in Chicago, Illinois, hosted four workshops on emerging and maturing aspects of conceptual modeling. While the entity-relationship approach is used to address data (base) modeling, the increasingly connected information infrastructure demands answers that can handle complexity and can develop models about systems that are maintainable. We received seven excellent proposals for workshops to be held at ER 2003, out of which we selected the following four based on peer reviews: – Conceptual Modeling Approaches for E-Business (eCOMO 2003) brought together researchers and practitioners interested in conceptual modeling techniques for e-business. – The International Workshop on Conceptual Modeling Quality (IWCMQ 2003) concentrated on approaches to quality assurance in the modeling process. – The International Bi-Conference Workshop on Agent-Oriented Information Systems (AOIS 2003) was devoted to investigating the agent paradigm for information systems development. – Finally, the International Workshop on XML Schema and Data Management (XSDM 2003) addressed the impact of XML on topics like data integration, change management, and the Semantic Web. All four workshops highlighted relatively new viewpoints on conceptual modeling. Conceptual modeling as such has been greatly influenced and shaped by the entity-relationship model of Peter Chen. However, new developments like object-orientation and the World-Wide Web require adaptions and new techniques. No longer can developers assume that they can completely understand or model the information system. The new developments create challenges in various directions; some of these were discussed in detail in the four ER 2003 workshops: E-Business and E-Commerce. The rise of the Internet has created new opportunities for defining and enacting business relations between partners. The question is how information systems can help in finding business partners, creating new services, and enacting those new services. Any lack of information about some business partners or their products and services needs to be compensated for using some kind of trust-building institution or mechanism. Moreover, services for e-business are not necessarily linked tightly togethers, as used to be the case for information systems developed for single enterprises. Can a service be modeled independently from the provider of the service who is selected at run time? Last but not least, one has to take into account different business (process) models, business contracts, and their monitoring. Hence, the field of e-business stresses the need for comprehensive modeling and analysis techniques. Model Quality. Conceptual models are products of modeling processes undertaken by a group of human experts. Industrial quality management has shifted
VI
Preface
from quality tests at the end of the production process to quality assurance over all product development steps, including the early stages of requirements analysis. The same idea is being applied to improving or at least assessing the quality of conceptual models and the related modeling processes that create them. The more that such models are abstracted from the final implementation, the more difficult it appears to be to assess and control their quality. What constitutes an error in a model? Can we distinguish useful parts of a conceptual model from not so useful parts? Certainly, a team of modelers who are aware of the quality of their products has better opportunities to improve than a team of modelers who are not assessing quality aspects at all. Still, the questions are: what aspects to measure, with which methods, and how frequently? Agent Orientation. Object-orientation is a programming and modeling paradigm that aims at encapsulation (hiding internal details) and re-use (of code and models). While this paradigm is still successful and valid, the lack of information about some components of an information system makes it less applicable to loosely coupled system, like Web services or complex factories that are under constant evolution. Agent orientation provides a promising approach to deal with the increased complexity by including a flavor of autonomy into the components of an agent-oriented system: the co-operating agents have goals and they govern over multiple possible strategies to achieve their goals. The challenge from a conceptual modeling perspective is to represent agent systems in a way that makes them subject to analysis. Suitable languages from agent communication, goal representation, etc., are still under development. XML Data and Schema. The last, but not least, topic covered by the ER 2003 workshops is XML. XML was, after the revolutionary rise of the Internet, in particular the World-Wide Web, an attempt to bring some order into the Web by tagging data elements with labels that indicate their interpretation (or schema). In a way, it is the global representation of interoperable data and perhaps processes. But does XML solve the problems of data/schema integration or does it just shift the problem to a new (yet uniform) syntax? XML databases are already on the market, including XML-based query languages. So, what parts of the traditional data modeling theory can be translated for the XML case? The ER 2003 workshops addressed these issues and created a forum for fruitful discussions. The fact that three of the four workshops have already a long history shows that such discussions are long-term, and convincing answers will only appear after some time. We thank our colleagues in the ER 2003 organization committee for their support. In particular, we thank the organizing chairs of the four workshops who came up with the ideas and imagination that made the workshop program at ER 2003 possible. Last but not least, our special thanks go to the paper authors and the reviewers who created the content of this volume and ensured its high quality. October 2003
Manfred Jeusfeld ´ Oscar Pastor
ER 2003 Workshop Organization
General ER 2003 Workshops Chairs
Manfred A. Jeusfeld Tilburg University, The Netherlands ´ Oscar Pastor Politechnical University of Valencia, Spain
eCOMO 2003 Organization Heinrich C. Mayr Willem-Jan van den Heuvel Christian Kop
University of Klagenfurt, Austria Tilburg University, The Netherlands University of Klagenfurt, Austria
IWCMQ 2003 Organization Jim Nelson Geert Poels Marcela Genero Mario Piattini
Ohio State University, USA Ghent University, Belgium Universidad de Castilla, Spain Universidad de Castilla, Spain
AOIS 2003 Organization Paolo Giorgini Brian Henderson-Sellers
University of Trento, Italy University of Technology, Sydney, Australia
XSDM 2003 Organization Sanjay Madria
University of Missouri-Rolla, USA
VIII
Organization
eCOMO 2003 Program Committee Fahim Akhter Boldur Barbat Boualem Benatallah Anthony Bloesch Antonio di Leva Vadim A. Ermolayev Marcela Genero Martin Glinz J´ ozsef Gy¨ork¨ os Bill Karakostas Roland Kaschek Stephen Liddle Zakaria Maamar Norbert Mikula ´ Oscar Pastor Barbara Pernici Matti Rossi Michael Schrefl Daniel Schwabe Il-Yeol Song Bernhard Thalheim Jos van Hillegersberg Ron Weber Carson Woo Jian Yang
Zayed University, United Arab Emirates Lucian Blaga University, Sibiu, Romania University of New South Wales, Sydney, Australia Microsoft Corporation, USA University of Torino, Italy Zaporozhye State University, Ukraine University of Castilla-La Mancha, Ciudad Real, Spain University of Zurich, Switzerland University of Maribor, Slovenia City University, London, UK Massey University, New Zealand Brigham Young University, USA Zayed University, United Arab Emirates Intel Labs, Hillsboro, USA University of Valencia, Spain Politecnico di Milano, Italy Helsinki School of Economics, Finland University of Linz, Austria PUC-Rio, Brazil Drexel University, Philadelphia, USA BTU, Cottbus, Germany Erasmus University, Rotterdam, The Netherlands University of Queensland, Australia UBC, Vancouver, Canada Tilburg University, The Netherlands
Organization
IX
IWCMQ 2003 Program Committee Deb Armstrong Sjaak Brinkkemper Giovanni Cantone Guido Dedene Brian Henderson-Sellers Paul Johannesson Barbara Kitchenham John Krogstie Heinrich Mayr Daniel Moody Jim Nelson Jeff Parsons ´ Oscar Pastor Gustavo Rossi Houari Sahraoui Reinhard Schuette Keng Siau Guttorm Sindre Monique Snoeck Bernhard Thalheim
University of Arkansas, USA Baan, The Netherlands University of Rome, Italy Katholieke Universiteit Leuven, Belgium University of Technology, Sydney, Australia Stockholm University, Sweden Keele University, UK Sintef, Norway University of Klagenfurt, Austria Norwegian University of Science and Technology, Norway Ohio State University, USA Memorial University of Newfoundland, Canada University of Valencia, Spain National University of La Plata, Argentinia Universit´e de Montreal, Canada University of Essen, Germany University of Nebraska-Lincoln, USA Norwegian University of Science and Technology, Trondheim, Norway Katholieke Universiteit Leuven, Belgium Brandenburg University of Technology at Cottbus, Germany
X
Organization
AOIS 2003 Program Committee B. Blake P. Bresciani H.-D. Burkhard L. Cernuzzi L. Cysneiros F. Dignum B. Espinasse I.A. Ferguson T. Finin A. Gal U. Garimella A.K. Ghose G. Karakoulas K. Karlapalem L. Kendall D. Kinny S. Kirn M. Kolp N. Jennings G. Lakemeyer Y. Lesp´erance D.E. O’Leary F. Lin J.P. Mueller J. Odell O.F. Rana M. Schroeder N. Szirbik F. Zambonelli C. Woo Y. Ye B. Yu
Georgetown University, Washington, DC, USA ITC-IRST, Italy Humboldt Univ., Germany Universidad Cat´ olica Nuestra Se˜ nora de la Asunci´ on, Paraguay York University, Toronto, Canada Univ. of Utrecht, The Netherlands Domaine Universitaire de Saint-J´erˆome, France B2B Machines, USA UMBC, USA Technion, Israel Institute of Technology, Israel Andra Pradesh Govt., MSIT, India Univ. of Wollongong, Australia CIBC and Univ. Toronto, Canada Indian Inst. of Information Technology, India Monash University, Australia University of Melbourne Techn. Univ. Ilmenau, Germany Universit´e catholique de Louvain, Belgium Southampton University, UK RWTH Aachen, Germany York University, Canada Univ. of Southern California, USA Hong Kong Univ. of Science and Technology, Hong Kong Siemens, Germany James Odell Associates, USA Cardiff University, UK City University London, UK Technische Universiteit Eindhoven, The Netherlands University of Modena and Reggio Emilia, Italy Univ. British Columbia, Canada IBM T.J. Watson Research Center, USA North Carolina State University, USA
Organization
XI
XSDM2003 Program Committee Elisa Bertino Bharat Bhargava Sourav Bhowmick Tiziana Catarci Qiming Chen Chakravarthy Sharma Kajal Claypool Ee-Peng Lim David W. Embley Alberto H.F. Laender Le Gruenwald Mengchi Liu Qing Li Mukesh Mohania Wee-Keong Ng Stefano Paraboschi Giuseppe Psaila Elke A. Rundensteiner Kian-Lee Tan Katsumi Tanaka Christelle Vangenot Osmar R. Zaiane Xiaofang Zhou
Universit´ a di Milano, Italy Purdue University, USA Nanyang Technological University, Singapore Universit´ a degli Studi di Roma “La Sapienza,” Italy Commerce One, USA University of Texas, Arlington, USA University of Massachusetts, Lowell, USA Nanyang Technological University, Singapore Brigham Young University, USA UFMG, Brazil University of Oklahoma, USA Carleton University, Canada City University of Hong Kong, China IBM Research Lab, India Nanyang Technological University, Singapore University of Bergamo, Italy University of Bergamo, Italy Worcester Polytechnic Institute, USA National University of Singapore, Singapore Kyoto University, Japan EPFL, Switzerland University of Alberta, Canada University of Queensland, Australia
External Referees Gajanan Chinchwadkar Farshad Fotouhi Lars Olsen
Muhammed Al-Muhammed
Table of Contents
Conceptual Modeling Approaches for E-Business at ER 2003 (eCOMO 2003) Preface to eCOMO 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heinrich C. Mayr, Willem-Jan van den Heuvel Managing Evolving Business Workflows through the Capture of Descriptive Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S´ebastien Gaspard, Florida Estrella, Richard McClatchey, R´egis Dindeleux The Benefits of Rapid Modelling for E-business System Development . . . . Juan C. Augusto, Carla Ferreira, Andy M. Gravell, Michael A. Leuschel, Karen M.Y. Ng
3
5
17
Prediction of Consumer Preference through Bayesian Classification and Generating Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Su-Jeong Ko
29
Developing Web Applications from Conceptual Models. A Web Services Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Vicente Pelechano, Joan Fons, Manoli Albert, Oscar Pastor
40
A Framework for Business Rule Driven Web Service Composition . . . . . . . Bart Orri¨ens, Jian Yang, Mike P. Papazoglou
52
Virtual Integration of the Tile Industry (VITI) . . . . . . . . . . . . . . . . . . . . . . . ´ Ricardo Chalmeta, Reyes Grangel, Angel Ortiz, Ra´ ul Poler
65
Second International Workshop on Conceptual Modeling Quality at ER 2003 (IWCMQ 2003) Preface to IWCMQ 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jim Nelson, Geert Poels, Marcela Genero, Mario Piattini
79
Multiperspective Evaluation of Reference Models – Towards a Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Fettke, Peter Loos
80
On the Acceptability of Conceptual Design Models for Web Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Franca Garzotto, Vito Perrone
92
XIV
Table of Contents
Consistency by Construction: The Case of MERODE . . . . . . . . . . . . . . . . . . 105 Monique Snoeck, Cindy Michiels, Guido Dedene Defining Metrics for UML Statechart Diagrams in a Methodological Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Marcela Genero, David Miranda, Mario Piattini Visual SQL – High-Quality ER-Based Query Treatment . . . . . . . . . . . . . . . 129 Hannu Jaakkola, Bernhard Thalheim Multidimensional Schemas Quality: Assessing and Balancing Analyzability and Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Samira Si-Said Cherfi, Nicolas Prat Conceptual Modeling of Accounting Information Systems: A Comparative Study of REA and ER Diagrams . . . . . . . . . . . . . . . . . . . . . . 152 Geert Poels
Agent-Oriented Information Systems at ER 2003 (AOIS 2003) Preface to AOIS 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Paolo Giorgini, Brian Henderson-Sellers Bringing Multi-agent Systems into Human Organizations: Application to a Multi-agent Information System . . . . . . . . . . . . . . . . . . . . . 168 Emmanuel Adam, Ren´e Mandiau Reconciling Physical, Communicative, and Social/Institutional Domains in Agent Oriented Information Systems – A Unified Framework . . . . . . . . . 180 Maria Bergholtz, Prasad Jayaweera, Paul Johannesson, Petia Wohed An Agent-Based Active Portal Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Aizhong Lin, Igor T. Hawryszkiewycz, Brian Henderson-Sellers Agent-Oriented Modeling and Agent-Based Simulation . . . . . . . . . . . . . . . . . 205 Gerd Wagner, Florin Tulba REF: A Practical Agent-Based Requirement Engineering Framework . . . . 217 Paolo Bresciani, Paolo Donzelli Patterns for Motivating an Agent-Based Approach . . . . . . . . . . . . . . . . . . . . 229 Michael Weiss Using Scenarios for Contextual Design in Agent-Oriented Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Kibum Kim, John M. Carroll, Mary Beth Rosson
Table of Contents
XV
Dynamic Matchmaking between Messages and Services in Multi-agent Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Muhammed Al-Muhammed, David W. Embley
International Workshop on XSDM at ER 2003 (XSDM 2003) Preface to XSDM 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Sanjay Madria A Sufficient and Necessary Condition for the Consistency of XML DTDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Shiyong Lu, Yezhou Sun, Mustafa Atay, Farshad Fotouhi Index Selection for Efficient XML Path Expression Processing . . . . . . . . . . 261 Zhimao Guo, Zhengchuan Xu, Shuigeng Zhou, Aoying Zhou, Ming Li CX-DIFF: A Change Detection Algorithm for XML Content and Change Presentation Issues for WebVigiL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Jyoti Jacob, Alpa Sachde, Sharma Chakravarthy Storing and Querying XML Documents Using a Path Table in Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Byung-Joo Shin, Min Jin Improving Query Performance Using Materialized XML Views: A Learning-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Ashish Shah, Rada Chirkova A Framework for Management of Concurrent XML Markup . . . . . . . . . . . 311 Alex Dekhtyar, Ionut E. Iacob Object Oriented XML Query by Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Kathy Bohrer, Xuan Liu, Sean McLaughlin, Edith Schonberg, Moninder Singh Automatic Generation of XML from Relations: The Nested Relation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 Antonio Badia Toward the Automatic Derivation of XML Transformations . . . . . . . . . . . . 342 Martin Erwig VACXENE: A User-Friendly Visual Synthetic XML Generator . . . . . . . . . . 355 Khoo Boon Tian, Sourav S Bhowmick, Sanjay Madria A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Shiyong Lu, Yezhou Sun, Mustafa Atay, Farshad Fotouhi
XVI
Table of Contents
From XML DTDs to Entity-Relationship Schemas . . . . . . . . . . . . . . . . . . . . 378 Giuseppe Psaila Extracting Relations from XML Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 390 Eugene Agichtein, C.T. Howard Ho, Vanja Josifovski, Joerg Gerhardt Extending XML Schema with Nonmonotonic Inheritance . . . . . . . . . . . . . . . 402 Guoren Wang, Mengchi Liu
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Preface to eCOMO 2003
Today’s increasingly competitive and expanding global marketplace requires companies to cope more effectively with rapidly changing market conditions than ever before. In order to survive in these highly volatile business eco-systems, companies are organizing themselves into integrated (virtual) enterprises, e.g., according to an integrated value chain. Conceptual business, enterprise and process models, either at the level of isolated or integrated enterprises, are heralded as an important mechanism for planning and managing these changes and transitions as well as for designing and constructing the necessary enterprise information systems. However, effective modeling methods and tools still are an issue under research and development. In addition, research issues in the area of business policy specification and change management of enterprise component based models are of paramount importance. The eCOMO workshop series is devoted to these questions. It aims at bringing together experts from practice and academia who are working from several independent, but related perspectives on the same research questions, such as from a business modeling, enterprise application integration, semantic web, business meta-data and ontologies, process management, business reengineering and business communication language perspectives. eCOMO2003 continues three highly successful predecessor eCOMO workshops which were held during ER’2000 in Salt Lake City, ER’2001 in Yokohama, and ER’2002 in Tampere. The program of eCOMO2003 is the result of a thorough review process in which each of the submitted papers was assessed by three experienced reviewers. At the end of the review process, more than the six papers printed herein were rated worth to be published and presented at the workshop. However, the program committee had to make its final decision according to the rules of ER and LNCS. The selected six contributions mainly deal with business process modeling and management aspects in the context of agile web-application development, most of them adopting the emerging Service Oriented Computing (SOC) paradigm. This paradigm is nowadays principally manifested by web-service technology, and reflects the ongoing migration from object-oriented and component-based modeling and development to a novel way of conceptualizing, designing and constructing light-weighted and web-enabled software components on an as-needed basis. Many persons deserve appreciation and recognition for their contribution to make eCOMO’2003 a success. First of all we have to thank the authors for their valuable contributions. Similarly, we thank the members of the program committee, who spent a lot of time in assessing submitted papers and participating in the iterated discussions on acceptance or rejection. Special appreciation is due to Christian Kop, who organized and co-coordinated the whole preparation
4
Preface to eCOMO 2003
process including the composition of these proceedings. Last but not least we thank the ER organizers and the ER workshop co-chairs Manfred Jeusfeld and Oscar Pastor for their support in integrating eCOMO2003 into ER’2003.
October 2003
Heinrich C. Mayr Willem-Jan van den Heuvel
Managing Evolving Business Workflows through the Capture of Descriptive Information 1,2,3
Sébastien Gaspard 1
1
1
, Florida Estrella , Richard McClatchey , 2,3 and Régis Dindeleux
CCCS, University of the West of England, Frenchay, Bristol BS16 1QY UK [email protected] 2 LLP/ESIA, Université de Savoie, Annecy, 74016 CEDEX, France [email protected] 3 Thésame Mécatronique et Management, Annecy, 74000, France rd@thésame-innovation.com
Abstract. Business systems these days need to be agile to address the needs of a changing world. In particular the discipline of Enterprise Application Integration requires business process management to be highly reconfigurable with the ability to support dynamic workflows, inter-application integration and process reconfiguration. Basing EAI systems on model-resident or on a socalled description-driven approach enables aspects of flexibility, distribution, system evolution and integration to be addressed in a domain-independent manner. Such a system called CRISTAL is described in this paper with particular emphasis on its application to EAI problem domains. A practical example of the CRISTAL technology in the domain of manufacturing systems, called Agilium, is described to demonstrate the principles of model-driven system evolution and integration. The approach is compared to other modeldriven development approaches such as the Model-Driven Architecture of the OMG and so-called Adaptive Object Models.
1 Background and Related Works As the global marketplace becomes increasingly complex and intricately connected, organizations are constantly pressured to re-organize, re-structure, diversify, consolidate and slim down to provide a winning competitive edge. With the advent of the Internet and e-commerce, the need for coexistence and interoperation with legacy systems and for reduced ’times-to-market’, the demand for the timely delivery of flexible software has increased. Couple to this the increasing complexity of systems and the requirement for systems to evolve over potentially extended timescales and the importance of clearly defined, extensible models as the basis of rapid systems design becomes a pre-requisite to successful systems implementation. One of the main drivers in the object-oriented design of information systems is the need for the reuse of design artefacts or models in handling systems evolution. To be able to cope with system volatility, systems must have the capability of reuse and to adapt as and when necessary to changes in requirements. The philosophy that has been investigated in the research reported in this paper is based on the systematic M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 5–16, 2003. © Springer-Verlag Berlin Heidelberg 2003
6
S. Gaspard et al.
capture of the description of systems elements covering multiple views of the system to be designed (including processes and workflows) using common techniques. Such a description-driven approach [1, 2] involves identifying and abstracting the crucial elements (such as items, processes, lifecycles, goals, agents and outcomes) in the system under design and creating high-level descriptions of these elements which are stored and managed separately from their instances. Description-driven systems (DDS) make use of so-called meta-objects to store domain-specific system descriptions that control and manage the life cycles of metaobject instances or domain objects. The separation of descriptions from their instances allows them to be specified and managed and to evolve independently and asynchronously. This separation is essential in handling the complexity issues facing many web-computing applications and allows the realization of inter-operability, reusability and system evolution as it gives a clear boundary between the application’s basic functionalities from its representations and controls. In a description-driven system as we define it, process descriptions are separated from their instances and managed independently to allow the process descriptions to be specified and to evolve asynchronously from particular instantiations of those descriptions. Separating descriptions from their instantiations allows new versions of elements (or element descriptions) to coexist with older versions. In this paper the development of Enterprise Resource Planning (ERP) in flexible business systems is considered and the need for business process modelling in Enterprise Application Integration (EAI) [3] is established. Workflow systems are considered as vehicles in which dynamic system change in EAI can be catered for as part of handling system evolution through the capture of system description. A description-driven approach is proposed to enable the management of workflow descriptions and an example is given of an application of the CRISTAL descriptiondriven system developed at CERN [4] to handle dynamic system change in workflows. This approach has today two parallel implementations that are called CRISTAL for CMS and Agilium. The two applications are based on the same kernel called CRISTAL KERNEL which provides the DDS functionalities. Both applications inherit these functionalities even if the goals and the specifics of each application is radically different.
2 The Need for Integrated Business Process Modelling in EAI In recent years, enterprises have been moving from a traditional function-led organisation, addressing the needs of a “vertical” market, to a “horizontal” organisation based on business processes. The emergence of new norms such as the ISO 9001 V2000 and the development of inter-enterprise exchanges are further drivers towards process-led reorganisation. However, currently available information systems and production software are still organised following function models. Consequently, they are not well adapted to the exchange of information between enterprises nor to coping with evolving process descriptions. In modern enterprises organised following a horizontal structure, industrial EAI solutions are very dependent on process performance and on the ability of the underlying enterprise management to execute and automate the business processes. Furthermore the requirement for the support of enterprise activities is not only for the execution of
Managing Evolving Business Workflows
Processes ROUTING Data Transformation
Transport
7
Workflow Engine
Specific DB
External System
Fig. 1. The three basic layers of an Enterprise Application Integration (EAI) system
internal processes but also for external processes, as in the support of suppliercustomer relationships especially in supply chain management. Enterprise processes have to integrate increasingly more complex business environments including domain-dependent processes for managing both interapplication operation and inter-organisation operation where efficient communications is crucial. Integration sources across enterprises are numerous and multi-technological and can include ERP, human resource management, Customer Relation Management (CRM), administration software, Intranet /Internet, proprietary software and a plethora of office tools. The first step that an enterprise must make in order to move from a standard vertical organisation to a horizontal organisation is to chart its existing business processes and the interactions between these processes. Following this it must update and manage its internal processes based on existing information systems. For that, the enterprise may be confronted by a collection of different production software systems and their abilities to interact. Most of the software offerings that support ERP deal with the description of enterprises through its organisation by function and examples of these products include systems for purchase service, stock management, production management etc. However individual systems need to synchronise with each other and each normally has their own process management models. Most commercial software do not provide tools to aid in process description and evolution. Even when workflow engines (which can provide synchronisation between systems) are integrated within ERP systems, they are for the most part not synchronised with external applications of the ERP system. EAI [3] systems concentrate on an architecture centred on software which is dedicated to the interconnection of heterogeneous applications that manage information flows. The heart of EAI software is normally based on the concept of interface references where transformation, routing and domain dependent rules are centralised. Standard EAI architecture is normally built on three layers: processes, routing and transport layers as shown in figure 1. At the lowest layer are the so-called “connectors” which allow external applications to connect to the EAI platform. It is this level that manages the transport of messages between applications. This Transport layer can be based on basic technologies such as Message Oriented MiddleWare (MOM) [5], and on file reading,
8
S. Gaspard et al.
email and technologies such as HTTP. The middle layer level of standard EAI software (the Routing layer) manages the transformation of data and its routing between internal systems. More evolved technologies such as XML/XSLT/SOAP, Electronic Data Interchange (EDI), “home-made” connectors and a transition database are used to provide the routing capabilities in this layer. The function of this layer is to apply transformation rules to source application data and to route the new information to the required target application. The third layer of an EAI system is dedicated to system modelling. At this layer, a workflow engine managing domaindependent specific processes is often employed (when available). Technically this EAI model suffers from a number of problems: • The management and modelling of processes needs specific development. Where a workflow engine is used for this purpose, workflows are often based on a monolithic architecture using a matrix definition of workflows that is fixed for the lifecycle of the system. • Specific technologies are used. In MOM solutions data transformations are not normally based on non-generic tools but on internal developments. Even if XML is used, for most of the case the data dictionary is not defined. • Guidelines for implementing Connectors do not exist Connectors have to be fully specified, developed and adapted to the connected application. Any change in the EAI software or the connected software requires redevelopment of the Connectors. • Most of the time, the EAI software has to be placed on a single server, which manages all the processes and has to support three different applications (one for each layer of the EAI model) that have more or less been created to be used together. • An expensive application server or database management system (DBMS) needs to be already installed and maintained. As expressed by Joeris in 1999 [6], the support of heterogeneous processes, flexibility, reuse, and distribution are great challenges for the design of the next generation of process modelling languages and their enactment mechanisms. These process modelling technologies are important for business and can make the management of systems more reactive and efficient. However it is not sufficient to concentrate solely on process management and enactment to solve all the problems identified in EAI in the previous section. If workflow systems are not coupled to a comprehensive data management model, optimum functionalities cannot be realised [7]. Most of recent workflow research has been focused on the enactment and modelling of processes based on Petri nets [8], CORBA [9] or UML concepts [10] however when issues of reconfiguration1 are considered, these research solutions only provides a high level of workflow control that does not completely address the enterprise problems listed above. The research outlined in this paper proposes an approach that deals with a high level of management of processes with the ability to manage complex data. It is based on distributed technologies and allows the relative autonomy of activity execution, with an enactment model that is relatively similar to that of Joeris [11]. Coupling this technology with some abstraction of process description that can provide generic 1
Reconfiguration : The ability of a system to dynamically change executing instances of processes in line with a change in its description.
Managing Evolving Business Workflows
Meta-Meta-Model Layer
9
MOF
Is an instance of Meta-Model Layer UML Is an instance of Model Layer
Information Domain
Is an instance of Instance Layer
User Objects
Fig. 2. The four-layer architecture of the OMG
workflow models [12] is a suitable alternative to standard EAI solutions and more closely addresses the problems listed earlier.
3 Handling Evolution via System Description Approaches for handing system evolution through reuse of design artefacts have led to the study of reusable classes, design patterns, frameworks and model-driven development. Emerging and future information systems however require more powerful data modelling techniques that are sufficiently expressive to capture a broader class of applications. Compelling evidence suggests that the data model must be OO, since that is the model that currently maximises generality. The data model needs to be an open OO model, thereby coping with different domains having different requirements on the data model [13]. We have realised that object metamodelling allows systems to have the ability to model and describe both the static properties of data and their dynamic relationships, and address issues regarding complexity explosion, the need to cope with evolving requirements, and the systematic application of software reuse. To be able to describe system and data properties, object meta modelling makes use of meta-data. Meta-data are information defining other data. Figure 2 shows the familiar four-layer model of the Object Management Group, OMG, embodying the principles of meta modelling. Each layer provides a service to the layer above it and serves as a client to the layer below it. The meta-meta-model layer defines the language for specifying meta-models. Typically more compact than the meta-model it describes, a meta-meta-model defines a model at a higher level of abstraction than a meta-model. The meta-model layer defines the language for specifying models, a meta-model being an instance of a meta-meta-model. The model layer defines the language for specifying information domains. In this case, a model is an instance of a meta-model. The bottom layer contains user objects and user data, the instance layer describing a specific information domain. The OMG standards group has a similar
10
S. Gaspard et al.
architecture based on model abstraction, with the Meta-Object Facility (MOF) model and the UML [14] model defining the language for the meta-meta-model and metamodel layers, respectively. The judicious use of meta-data can lead to heterogeneous, extensible and open systems. Meta-data make use of a meta-model to describe domains. Our recent research has shown that meta modelling creates a flexible system offering the following - reusability, complexity handling, version handling, system evolution and inter-operability. Promotion of reuse, separation of design and implementation and reification are some further reasons for using meta-models. As such, meta modelling is a powerful and useful technique in designing domains and developing dynamic systems. A reflective system utilizes an open architecture where implicit system aspects are reified to become explicit first-class meta-objects [15]. The advantage of reifying system descriptions as objects is that operations can be carried out on them, like composing and editing, storing and retrieving, organizing and reading. Since these meta-objects can represent system descriptions, their manipulation can result in change in the overall system behaviour. As such, reified system descriptions are mechanisms that can lead to dynamically evolvable and reusable systems. Metaobjects, as used in the current work, are the self-representations of the system describing how its internal elements can be accessed and manipulated. These selfrepresentations are causally connected to the internal structures they represent i.e. changes to these self-representations immediately affect the underlying system. The ability to dynamically augment, extend and redefine system specifications can result in a considerable improvement in flexibility. This leads to dynamically modifiable systems, which can adapt and cope with evolving requirements. There are a number of OO design techniques that encourage the design and development of reusable objects. In particular design patterns are useful for creating reusable OO designs [16]. Design patterns for structural, behavioural and architectural modelling have been well documented elsewhere and have provided software engineers with rules and guidelines that they can (re-)use in software development. Reflective architectures that can dynamically adapt to new user requirements by storing descriptive information which can be interpreted at runtime have lead to socalled Adaptive Object Models [17]. These are models that provide meta-information about domains that can be changed on the fly. Such an approach, proposed by Yoder, is very similar to the approach adopted in this paper A Description-Driven System (DDS) architecture [1, 2], as advocated in this paper, is an example of a reflective meta-layer (i.e. meta-level and multi-layered) architecture. It makes use of meta-objects to store domain-specific system descriptions, which control and manage the life cycles of meta-object instances, i.e. domain objects. The separation of descriptions from their instances allows them to be specified and managed and to evolve independently and asynchronously. This separation is essential in handling the complexity issues facing many web-computing applications and allows the realization of inter-operability, reusability and system evolution as it gives a clear boundary between the application’s basic functionalities from its representations and controls. As objects, reified system descriptions of DDSs can be organized into libraries or frameworks dedicated to the modelling of languages in general, and to customizing its use for specific domains in particular. As a practical example of our approach the next section describes the DDS architecture developed in
Managing Evolving Business Workflows
11
the context of research carried out in the CRISTAL project at CERN and the Agilium project at Thesame.
4 CRISTAL – A Description-Driven System (DDS) The Compact Muon Solenoid (CMS) is a general-purpose experiment at CERN that will be constructed from around a million parts and will be produced and assembled in the next decade by specialized centres distributed worldwide. As such, the construction process is very data-intensive, highly distributed and ultimately requires a computer-based system to manage the production and assembly of detector components. In constructing detectors like CMS, scientists require data management systems that are able to cope with complexity, with system evolution over time (primarily as a consequence of changing user requirements and extended development timescales) and with system scalability, distribution and interoperation. No commercial products provide the capabilities required by CMS. Consequently, a research project, entitled CRISTAL (Cooperating Repositories and an Information System for Tracking Assembly Lifecycles [4]) has been initiated to facilitate the management of the engineering data collected at each stage of production of CMS. CRISTAL is a distributed product data and workflow management system, which makes use of an OO database for its repository, a multi-layered architecture for its component abstraction and dynamic object modelling for the design of the objects and components of the system. CRISTAL is based on a DDS architecture using metaobjects. The DDS approach has been followed to handle the complexity of such a data-intensive system and to provide the flexibility to adapt to the changing scenarios found at CERN that are typical of any research production system. In addition CRISTAL offers domain-independence in that the model is generic in concept. Lack of space prohibits further discussion of CRISTAL; detail can be found in [1, 2 & 4]. The design of the CRISTAL prototype was dictated by the requirements for adaptability over extended timescales, for system evolution, for interoperability, for complexity handling and for reusability. In adopting a description-driven design approach to address these requirements, the separation of object instances from object description instances was needed. This abstraction resulted in the delivery of a three layer description-driven architecture. The model abstraction (of instance layer, model layer, meta-model layer) has been adapted from the OMG MOF specification [18], and the need to provide descriptive information, i.e. meta-data, has been identified to address the issues of adaptability, complexity handling and evolvability. Figure 3 illustrates the CRISTAL architecture. The CRISTAL model layer is comprised of class specifications for CRISTAL type descriptions (e.g. PartDescription) and class specifications for CRISTAL classes (e.g. Part). The instance layer is comprised of object instances of these classes (e.g. PartType#1 for PartDescription and Part#1212 for Part). The model and instance layer abstraction is based on model abstraction and on the Is an instance of relationship. The abstraction based on meta-data abstraction and the Is described by relationship leads to two levels - the meta-level and the base level. The meta-level is comprised of meta-objects and the meta-level model that defines them (e.g. PartDescription is the meta-level model of PartType#1 meta-object). The base level is comprised of base objects and the base level model that defines them (Part is the base-level model of Part#1212 object).
12
S. Gaspard et al.
UML Meta-Model Is an instance of Base-Level Model Of
Is described by
ClassPart
ClassPartDescription
Is an instance of
Meta-Level Model Of
Is an instance of
Is described by
PartObject#1212
PartType#1
Base- Level
Meta-Level
Objects/ Data
Is an instance of
Meta-Objects/ Meta-Data
Fig. 3. The CRISTAL description-driven architecture.
The approach of reifying a set of simple design patterns as the basis of the description-driven architecture for CRISTAL has provided the capability of catering for the evolution of a rapidly changing research data model. In the two years of operation of CRISTAL it has gathered over 25 Gbytes of data and been able to cope with more than 30 evolutions of the underlying data schema without code or schema recompilations.
5 Agilium – A Description Driven Workflow System 5.1 Agilium Functionality In order to address the deficiencies in current EAI systems a research system entitled Agilium, based on CRISTAL technologies, has been developed by a collaboration of three research partners, CERN (the European Organisation for Nuclear Research, in Geneva Switzerland), UWE (the University of the West of England, Bristol UK) and Thésame (an innovation network for technology companies in mechatronic, computer-integrated manufacturing and management, based in Annecy France). The model and technologies used in Agilium make EAI tools accessible to middle-sized enterprises and to software houses and integrators that target the EAI market. The CRISTAL architecture is providing a coherent way to replace the three application layers of EAI (as shown in figure 1) by a single generic layer, based on Items and using common tools, processes, routing and transport. In order to provide an effective EAI architecture, Agilium combines Items described using the DDS philosophy. This approach provides a development free way to provide EAI functionality whilst managing behaviour through workflows. Items can be connectors or conceptual domain-specific objects such as order forms, supplies, commands etc.
Managing Evolving Business Workflows
13
&5,67$/$SSURDFK ITEM : Process History Data
Item 1
Item 5
Item 2
Process History Data
ROUTING ROUTAGE
Data transformation
Data transformation
Communication
Communication
External System
External System
Fig. 4. The CRISTAL approach to integrated EAI employed in Agilium.
In the Agilium system, a connector is managed as one single Item described with a specific graphically represented behaviour. A connector can transform data (using scripting or XML technologies) and is coupled to a communication method that can be any appropriate enabling technology. In this way, it is easy to connect applications that have any arbitrary communication interface. It is sufficient simply to describe a communication mode (CORBA, HTTP, SOAP, Web Service, text file, email...), a data format (will be converted in XML) and a behaviour (with the workflow graphical interface). Basing on the concept of the DDS, connectors are easily maintainable and modifiable and make the Agilium system easy to integrate and adapt to evolving environments prevalent in enterprise information systems. By combining the Items describing domain specific functionality and those that can connect external applications, the EAI architecture is complete and presents all the functionalities of the external architectures, and more. Using the facilities for description and dynamic modification in CRISTAL, Agilium is able to provide a modifiable and reconfigurable workflow. The workflow description and enactment elements of CRISTAL are correlated and each instance stores any modifications that have been carried on it. With an efficient model of verification at both levels (i.e. description and enactment) it is possible to validate if the migration from one description to another one within an instance is possible and to detect any modifications and changes and therefore to apply the migration. Ongoing research is being conducted to mathematically model the workflow concepts that could be directly applied to the CRISTAL technologies so as to complete the modification ability. 5.2 Advantages and Limitations of Agilium Innovating technologies used in the kernel of CRISTAL provide Agilium significant advantages when compared to standard EAI solutions:
14
S. Gaspard et al.
• Flexibility. Architecture independence allows Agilium to adapt to new domains and/or new enterprises without any specific development. This is an essential factor in helping to reduce maintenance costs and to minimise conversation costs, thus providing high levels of flexibility. • Platform independence. Use of JAVA/CORBA/XML/LDAP technologies allows CRISTAL to work on any preinstalled operating system (Linux, Windows, UNIX, Mac OS) and on any machine (Mac, PC, Sun, IBM...). • Database independence. XML storage with LDAP referencing makes CRISTAL autonomous and independent of any underlying database and of any application server. • Simplified integration of applications. XML is becoming the standard for interfacing applications within an enterprise. It presents a solution that supports multiple heterogeneous environments. With the limitation of the development of a translation/transport layer, connectors are based on a generic model. • Fully distributed. This functionality provides web usability through the Internet or an Intranet. It also makes data accessible from multiple databases via a unique interface. • CRISTAL’s powerful workflow management facilities provide the ability to model and execute processes of any type that can also evolve by dynamic modification or appliance of any new description. But there are some limitations: • Because it is based on graphical descriptions, it is not always simple to determine the way to code (describing a workflow) CRISTAL actions. Using a high level of abstraction can render simple things difficult to represent in code. • Providing complete flexibility to users to define elements in the system can compromise the integrity of enactment and the implementation is not yet sufficiently advanced to provide a fully secured system without requiring human intervention. • As the connectivity technologies such as BPML and BPEL4W [19] become more complex, complete, normative and numerous, the Agilium tool has to provide and maintain many connectors that can not only be defined by the user. 5.3 Future Work Ongoing research is being conducted into the mathematical approach of process modelling in CRISTAL which may ultimately include a decision-making model based on agent technology. It is planned that these agents would verify changes to the model and dynamically modify the instances of the workflow processes basing its calculation and decision on a base of user predefined constraints that must be respected. Another aspect that may be explored is to use an Architecture Definition Language to model Items and their interactions. This would provide an efficient and secure way to create descriptions for new domains to model. Another area that could be explored is the connector aspects for the Agilium EAI. A mathematical approach of connector specifications could be envisaged and defined which would enable connector development to be automated in CRISTAL.
Managing Evolving Business Workflows
15
6 Conclusions The combination of a multi-layered meta-modelling architecture and a reflective meta-level architecture resulted in what has been referred to in this paper as a DDS architecture. A DDS architecture, is an example of a reflective meta-layer architecture. The CRISTAL DDS architecture was shown to have two abstractions. The vertical abstraction is based on the is an instance of relationship from the OMG meta-modelling standard, and has three layers - instance layer, model layer and metamodel layer. This paper has proposed an orthogonal horizontal abstraction mechanism that complements this OMG approach. The horizontal abstraction is based on the meta-level architecture approach, encompasses the is described by relationship and has two layers - meta-level and base level. This description-driven philosophy facilitated the design and implementation of the CRISTAL project with mechanisms for handling and managing reuse in its evolving system requirements and served as the basis of the Agilium description-driven workflow system. The model-driven philosophy expounded in this paper is similar to that expounded in the Model Driven Architecture (MDA [18]) of the OMG. The OMG’s goal is to provide reusable, easily integrated, easy to use, scalable and extensible components built around the MDA. While DDS architectures establish those patterns, which are required for exploiting data appearing at different modelling abstraction layers, the MDA approaches integration and interoperability problems by standardizing interoperability specification at each layer (i.e. standards like XML, CORBA, .NET, J2EE). The MDA integration approach is similar to the Reference Model for Open Distributed Processing (RM-ODP) [20] strategy of interoperating heterogeneous distributed processes using a standard interaction model. In addition, the Common Warehouse Metamodel (CWM) specification [21] has been recently adopted by the OMG. The CWM enables companies to manage their enterprise data better, and makes use of UML, XML and the MOF. The specification provides a common metamodel for warehousing and acts as a standard translation for structured and unstructured data in enterprise repositories, irrespective of proprietary database platforms. Likewise, the contributions of this work complement the ongoing research on Adaptive Object Model (AOM) espoused in [17] and [22], where a system with an AOM (also called a Dynamic Object Model) is stated to have an explicit object model that is stored in the database, and interpreted at runtime. Objects are generated dynamically from the AOM schema meta-data that represent data descriptions. The AOM approach also uses reflection in reifying implicit data aspects (e.g. database schema, data structures, maps of layouts of data objects, references to methods or code). The description-driven philosophy has demonstrably facilitated the design and implementation of the CRISTAL and Agilium projects with mechanisms for handling and managing reuse in its evolving system requirements. Acknowledgments. The authors take this opportunity to acknowledge the support of their home institutes and numerous colleagues responsible for the CRISTAL & Agilium software.
16
S. Gaspard et al.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Kovacs Z, “The Integration of Product Data with Workflow Management Systems”, PhD Thesis, University of West of England, Bristol, England, April 1999. Estrella F, "Objects, Patterns and Descriptions in Data Management", PhD Thesis, University of the West of England, Bristol, England, December 2000. Mann J., “Workflow and Enterprise Application Integration” , 2001, analyst in Middleware. Estrella F et al., "Handling Evolving Data Through the Use of a Description Driven Systems Architecture". Lecture Notes in Computer Science Vol 1727, pp 1–11 ISBN 3540-66653-2 Springer-Verlag, 1999 Rao, B.R. "Making the Most of Middleware." Data Communications International 24, 12 (September 1995): 89–96. Joeris G., “Toward Flexible and High-level Modeling and Enactment of Processes”, University of Bremen, 1999. Sheth A. P., W.M.P van der Aalst, “Processes Driving the Networked Economy”, University of Georgia, 1999. W.M.P van der Aalst, “Making Work Flow: On the Application of Petri nets to Business Process Management”, Eindhoven University of Technology, 2001. Tari Z., Pande V, “Dynamic Workflow Management in CORBA Distributed Object Systems”, RMIT University, IBMG GSA, 2000 Torchiano M., Bruno G., “Domain-Specific instance Models in UML”, IDI NTNU, 2002 Joeris G., “Decentralized and Flexible Workflow Based on Task Coordination Agents”, University of Bremen, 2000 W.M.P van der Aalst, “Generic Workflow Models: How to Handle Dynamic Change and Capture Management Information?”, Eindhoven University of Technology, 1999 Klas W and Schrefl M, “Metaclasses and their Application. Data Model Tailoring and Database Integration”, Lecture Notes in Computer Science 943. Springer. 1995. The Unified Modelling Language (UML) Specification, URL http://www.omg.org/technology/uml/ Kiczales G, “Metaobject Protocols: Why We Want Them and What Else Can They Do?”, Chapter in Object-Oriented Programming: The CLOS Perspective, pp 101–118, MIT Press, 1993. Gamma E, Helm R, Johnson R and Vlissides J, "Design Patterns: Elements of Reusable Object-Oriented Software", Addison-Wesley, 1995. Yoder J, Balaguer F & Johnson R., "Architecture and Design of Adaptive Object-Models". Proc of OOPSLA 2001, Intriguing Technology Talk, Tampa, Florida. October 2001. OMG Publications., "Model Driven Architectures – The Architecture of Choice for a Changing World". See http://www.omg.org/mda/index.htm BPML: Business Process Modeling Language. See http://www.bpmi.org/, BPEL4W : Business Process Execution Language for Web Services. See http://www.ibm.com/developerworks RM-ODP. A Reference Model for Open Distributed Processing http://www.dstc.edu.au/Research/Projects/ODP/ref_model.html Common Warehouse Metamodel & Meta-Object Facility. See http://www.omg.org/technology/cwm/ Foote B and Yoder J “Meta-data and Active Object-Models”. Proc. of the Int. Conference on Pattern Languages Of Programs, Monticello, Illinois, USA, August 1998.
The Benefits of Rapid Modelling for E-business System Development Juan C. Augusto, Carla Ferreira, Andy M. Gravell, Michael A. Leuschel, and Karen M.Y. Ng Department of Electronics and Computer Science University of Southampton, {jca,cf,amg,mal,myn00r}@ecs.soton.ac.uk
Abstract. There are considerable difficulties modelling new business processes. One approach is to adapt existing models, but this leads to the difficult problem of maintaining consistency between model and code. This work reports an investigation into creating quick models that are nonetheless useful in providing insight into proposed designs.
1
Introduction
There are considerable difficulties modelling new business processes. One approach is to adapt existing models, but this leads to the difficult problem of maintaining consistency between model and code. In eXtreme Programming [Bec00], for example, we are advised to “travel light” – questions are answered by examining the code rather than trusting written designs which may be out of date. This work reports an investigation into creating quick “throw away” models that are nonetheless useful in providing insight into proposed designs. These models are not merely pictures, but can be “executed” through animation/simulation, and can be comprehensively checked, at least for specific configurations, by model checking. This answers the criticism that pictures “can’t give concrete feedback” [Beck 2000]. In sections 2 and 3 we first provide a brief description of two of the modelling frameworks we considered, Promela/SPIN and CSP(LP)/CIA. Then in section 4 a case study that we used as the basis for our experiment is introduced. In section 5 we give some details about the modelling approach we followed when using Promela and CSP(LP) as modelling languages. Section 6 explains how tool assisted development can provide the basis for rapid modelling with important expected benefits and section 7 explains the extent to which we experienced those advantages while using the above mentioned tools to our case study. Later, in section 8 we consider a more specific modelling language, called StAC for business modelling. An analysis of work in progress is given in section 9 while an account of some lessons learnt and the final conclusions are provided in section 10. ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 17–28, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
18
J.C. Augusto et al.
2
Promela/SPIN
SPIN [Hol97] has been a particularly successful tool that has been widely adopted to perform automatic verification of software specifications. SPIN offers the possibility to perform simulations and verifications. Through these two modalities the verifier can detect absence of deadlocks and unexecutable code, to check correctness of system invariants, to find non-progress executions cycles and to verify correctness properties expressed in propositional linear temporal logic formulae. Promela is the specification language of SPIN. It is a C-like language enriched with a set of primitives allowing the creation and synchronization of processes, including the possibility to use both synchronous and asynchronous communication channels. We refer the reader to the extensive literature about the subject as well as the documentation of the system at Bell Labs web site for more details: http://netlib.bell-labs.com/netlib/spin/whatispin.html We assume some degree of familiarity with this framework from now on.
3
CSP(LP)/CIA
CSP(LP) [Leu01] unifies CSP [Hoa85] with concurrent (constraint) logic programming. Elementary CSP, without datatypes, functions, or other advanced operators, was extended in CSP-FDR [Ros99] to incorporate this features, which we want for modelling business systems. Some of the remaining limitations on pattern matching were overcome in CSP(LP) (see [Leu01] section 2.2. for a more detailed account). A basic introduction to CSP(LP) syntax [Leu01] follows: Operator stop skip prefix conditional prefix external choice internal choice interleaving parallel composition sequential composition hiding renaming timeout interrupt if then else let expressions agent definition
Syntax STOP SKIP a→P a?x : x > 1 → P P 2Q P Q P |||Q P [|A |] Q P;Q P \A P [R] P Q P i Q if t then P else Q let v = e in P A=P
Ascii Syntax STOP SKIP a->P a?x:x>1->P P [] Q P || Q P ||| Q P [| A |] Q P ->> Q P \\ A P [[ R ]] P [> Q P / Q if T then P else Q let V=E in P A = P;
The CSP(LP) Interpreter and Animator, CIA, [LAB+ 01] can be used to animate and detect deadlocks in a CSP(LP) specification.
The Benefits of Rapid Modelling
4
19
The Travel Agency Case Study
As an example of an e-business system involving collaboration across organizations consider a travel agent [LR00]. A travel agent gets requests from the users that log into the travel agency system by using a browser. After selecting an operation (book or unbook) and a service (car or room), the operation is submitted to the travel agent. The travel agent will decide what service provider (Car Rental or Hotel) to contact on the basis of previous requests made by the user. The request is passed on to one of the service providers, which will decide if the operation can be accomplished or not. For example, it could be that the user requests to book a service that is not available or to unbook a service that was not previously booked to her/him. The shop will contact the travel agent to make explicit if the operation was or was not successful. The travel agent will pass this information to the user. If the operation was successful, the shop and the travel agent will keep records of that on their own databases. A sketch of a typical session can be seen as an appendix in [AFG+ 03]. We have built a prototype of this system using J2EE technology to experiment with the expected functionality and to uncover the basic operations and communications demanded by such e-businesses systems. On the other hand we built different models to experiment with different modelling languages, different tools, and to compare the support they offer to a development team.
5
Modelling Approaches
Our models are in widely used notations that have defined semantics and tool support. These notations are capable to deal with essential notions for e-business applications like concurrency and synchronous/asynchronous message passing. These frameworks allows the creation of simple and abstract models that can be simulated and rigorously checked. Due to space constraints we cannot offer complete models but, we provide a brief description of them as appendixes A and B to give the reader a flavor of how they look like. The complete models fully documented can be seen in the appendixes given in [AFG+ 03]. Next we provide a sketch of the basic structures we need and the functionality we expect from each major part of the system. Communication between the user, the travel agent, and the shops in the prototype is accomplished trough sessions and the underlying web connectivity message system. In our models that was modelled via synchronous channels. We considered (1) a channel to pass requests from the user to the travel agent (2) channels to pass requests from the travel agent to each shop, and (3) channels to get feedback from the shops about if the operation was or was not successful. Another important aspect has to do with the side effects of the interaction in the system. For example, as a result of a successful operation each shop will have to register a change on its database to remember that a resource was taken or released so we need in the models some structures to mimic the databases implemented in the prototype by using JDBC technology. The travel agent has
20
J.C. Augusto et al.
its own database, where all the operations are recorded and its content has to be consistent with all the shops databases except for the intermediate state where a record was made in a shop database but still was not transferred to the travel agent database. But, because the communication is assumed to be synchronous that will eventually occur and because decisions in the system are based only on the shop’s database contents this do not cause any harm in the system. Of course, the travel agent will know that if a request has not been answered then the information cannot be considered as an up to date account of the system.
6
Checking Techniques
After running this experiment we were able to collect some interesting experiences. On a higher level we can say that by building the models we were forced to revise and double check the relationships between all the important parts of the system. This also suggests that a realistic expectation is therefore that modelling a system is about four times quicker than prototyping it. While the prototype involved several weeks from a team of three programmers each model was about one and a half week for one person effort. In all cases the people involved had the same level of expertise required to use the necessary technology during both the prototyping and the modelling stages of the development. We do not of course propose developers should construct multiple models. We did so ourselves only to compare notations and tools. Both tools assisted the modelling stage with syntax and type checking, basic model checking, e.g. infinite loops and deadlock detection, and animation facilities. After no more basic errors were found some simulations were carried out to compare the behavior of the model with the behavior of the prototype and the one expected from the system. By building this models of the system we have been able to check behavioral properties that allowed us to pinpoint some interesting aspects of the system: Example 1 (Credit card loop) Part of the user interaction with the system involves providing an authorized credit card brand. The initial prototype allowed users to introduce an unbounded number of attempts to provided their credit card brand. Both tools, SPIN and CIA, allowed us to detect that. Example 2 (Deadlocks) Communications between user, travel agent and shops was implemented via synchronous channels. Sometimes during the construction of the model the interaction of the different processes was very important to detect how interdependent the different parts of the system were to each other. This was especially well considered in SPIN were there is a graphical interface focused on channel communication. Example 3 (detecting subtler errors). An error was introduced in purpose during the construction of the prototype to experiment how we were able to detect it at modelling time. The error is related to the strategy that the Travel Agency has to handle second reservations. This strategy was left unfinished so that when the
The Benefits of Rapid Modelling
21
travel agency is asked to book a room in a hotel for a second time by the same user, it tries to book the room in the same hotel used for the first booking. When failing to find another room available the travel agency will not try to book the room in another hotel. Instead it will consider the operation unsuccessful. We were able to detect the potential anomaly during simulation and then confirm it by model checking.
7
Relating Both Modelling Experiments
Some results emerged from this experience between Promela/SPIN and CSP(LP)/CIA as tools to guide the first stages of modelling: 1. Both demanded almost the same level of knowledge and effort. 2. CSP(LP) is more declarative and hence allows shorter models to be written. 3. Although Promela allows asynchronous channels, CSP(LP) has extra expressiveness due to the logic programming extension (see for example the database implementation provided in appendix B). The concept of queue can be implemented allowing for asynchronous messaging in CSP(LP). 4. SPIN currently offers more support for building the model. 5. Channel handling demands more work in CSP or CSP(LP) specifications which also have the positive side-effect of forcing the user to have a more detailed knowledge about that important side of the system. 6. Trace extraction is currently easier with SPIN. 7. CSP(LP) allows to complement CSP with the use of logic programming features which extends considerably the flexibility of the specification language. An evidence of the importance of this can be seen on [ALBF03] where the flexibility of the input language was a key feature in allowing model checking of a Business Specification Language. In summary, both tools proved to be very useful in terms of building a simplified version of the system with a slight advantage of SPIN, of being a system developed over more than one decade. In consequence it offers better interface and more information to the system but on the other side there is no impediment for the CSP(LP)/CIA combination to evolve in the same direction.
8
StAC, a More Specific Business Modelling Language
StAC (Structured Activity Compensation) is a language that, in addition to CSP-like operators [Hoa85], offers a set of operators to handle the notion of compensation. In StAC it is possible to associate to an action a set of compensation actions providing a way to repair an undesired situation. Compensations are expressed as pairs with the form P ÷ Q, meaning that Q is the compensation planned in case that the effect of P needs to be compensated at a later stage. As the system evolves, compensations are remembered. If all the activities are √ successfully accomplished then the operator accept, 2 , releases the compensations. If any activity fails then the operator reverse, , orders the system to
22
J.C. Augusto et al.
apply all the recorded compensations for the current scope. In some contexts the failure to accomplish an activity can be so critical that demands the abortion of a process, that is the role of the early termination operator. Both compensation and termination operators can be bounded to a scope of application. Definition 1 Let A represent an activity, b a boolean condition, P and Q two generic processes, x a variable and X a set of values. Then, we can define as follows the set of well formed formulas in StAC: Process ::= A (activity label) | 0 (skip) |b→P | rec(P) (recursion) | P;Q | P ||Q (parallel) | ||x ∈ X.Px | P []Q (choice) | []x ∈ X.Px | (early termination) | {P } | P ÷ Q (compensation pair) | [P ] √ | (reverse) |2
(condition) (sequence) (generalised parallel) (generalised choice) (termination scoping) (compensation scoping) (accept)
In the example below, processes written in boldface are intended to be basic activities. Each StAC specification is coupled with a B machine [Abr96] describing the state of the system and its basic activities. Basically a B machine is composed by a declaration of sets, variables, invariants, initialisation and operations over those structures. Each StAC activity in a specification will have associated an operation in the corresponding B machine explaining how that activity is implemented in logical terms. We address the reader who wants a more in-detail account of StAC to [CGV+ 02] and [BF00]. 8.1
Travel Agency Example
The travel agency example presented in this section extends the previous travel agency example. In this version the user requests a collection of services instead of a single service, and the travel agency will then try to provide all the requested services. In the StAC model we associate a compensation activity to each service reservation, as the recovery mechanism if any reservation fails or the client decides to cancel his/her requests. A trip is arranged by getting an itinerary, followed by verifying the client’s credit card, and depending on whether the card is accept or rejected the reservation is continued or abandoned: Trip = GetItinerary; VerifyCreditCard; (accepted → ContinueReservation [] ¬accepted → clearItinerary) Getting an itinerary involves continually iterating over offering the client the choice of selecting from a car or a hotel “until” (, defined by using recursion [Fer03]) EndSelection is invoked. GetItinerary = (SelectCar [] SelectHotel) EndSelection
The Benefits of Rapid Modelling
23
ContinueReservations starts by making the reservations on the client’s itinerary. If some of the reservations failed, the client is contacted; otherwise, the process ends. The car and hotel reservations are made concurrently. ContinueReservation = MakeReservations; (okReservations → EndTrip [] ¬okReservations → ContactClient) MakeReservations = CarReservations HotelReservations CarReservations = c ∈ CAR . CarReservation(c) HotelReservations = h ∈ HOT EL . HotelReservation(h) The CarReservation process reserves a single car using the Reserve activity. The travel agency uses two compensation tasks: compensation task S, representing compensation for reservations that have been booked successfully, and compensation task F , representing compensation for reservations that have failed. The choice between which task to add the compensation to is determined by the outcome of the ReserveCar activity. Since we use two compensation tasks, instead of having a compensation pair we have a compensation triple, with a primary process P and two compensations Q1 and Q2 . We model this triple with a construction of the form: P ; (c → (null ÷1 Q1 )) [] (¬c → (null ÷2 Q2 )) If P makes c true, this is equivalent to P ÷1 Q1 with Q1 being added to compensation task 1. If P makes c false, this is equivalent to P ÷2 Q2 with Q2 being added to compensation task 2. With this construction it is possible to organize the compensation information into several compensations tasks, where each one of those tasks can later be reversed or accepted independently. All the cars reservations are made concurrently. The car reservation and its compensations is defined as follows: CarReservation(f ) = ReserveCar(c); ((carIsReserved(c) → (null ÷S (CancelCar(c) RemoveCar(c)))) [] (¬carIsReserved(c) → (null ÷F RemoveCar(c))) The RemoveCar activity removes car c from the client’s itinerary, while the CancelCar activity cancels the reservation of car c with the car rental. If the activity ReserveCar is successful, then to compensate it one has to cancel the reservation with the car rental and also remove that car from the client’s itinerary. Otherwise, if the car reservation fails its only necessary to remove the car from the client’s itinerary in order to compensate, its not necessary to cancel the car reservation. The hotel reservations are defined similarly and are omitted here. The ContactClient process is called if some reservations failed. In this process the client is offered the choice between continuing or quitting: ContactClient = (Continue; F ; GetItinerary; ContinueReservations) [] (Quit; (S F )) In the case that the client decides to continue, reverse is invoked on compensation task F , the failed reservations. This has the effect of removing all failed
24
J.C. Augusto et al.
reservations from the clients itinerary. Compensation task S is preserved as the successful reservations may need to be compensated at a later stage. The client continues by adding more items to the itinerary, which are then reserved. In the case that the client decides to quit, reversal is invoked on both compensation threads. This has the effect of removing all reservations from the clients itinerary and cancelling all successful reservations. Finally, a successful trip reservation is S 2 F . ended by accepting both compensation tasks: EndT rip = 2 8.2
Executable Semantics
One benefit of using StAC is that it would not be possible to capture advanced aspects of the system with Promela (see [ABF03]) and CSP. Modelling with StAC will focus in the higher levels of the system. Any of the previously considered languages can be a good complement by using them to model some of the more low-level features of the system as the communication between processes. During modelling we have used an animator for StAC processes [LAB+ 01] based on the CSP(PL) animator described in [Leu01]. At the moment it supports step-by-step animation and backtracking of StAC processes, and it can also detect deadlocks. Animation has helped in the verification of the travel agency, just by comparing the animation execution traces with the expected behavior of the specification several error where found: 1. There is a potential infinite loop if any of the services requested by the client fails. In this case the client can then start choosing a new itinerary that may lead again to some of his/hers requested services to fail. 2. The use of two independent compensation threads for the successful and failed reservations uses a complex notation that is difficult to understand. All this is overcome by the animating the model, because the user can observe the evolution of the compensation threads. 3. The initial StAC model did not have the EndTrip process, but the animation showed that without EndTrip the compensation information would still be available after the client’s logout.
9
Future Work
The XTL model checker allows the user to model check a wide range of system specifications, (see for example [LM99] and [LM02]) the only requirement being that the specification is made by using high level Prolog predicates describing how the system makes transitions between its different states. In this section we describe some basic aspects of XTL and exemplify how to use it to model check StAC specifications. XTL has been implemented using XSB Prolog ( http://xsb.sourceforge.net/ ). Expressiveness and performance indicators are very encouraging for XTL in the sense that it has been able to model check case studies where other tools like SPIN failed and solved problems at similar performance levels. Some domains where XTL was applied successfully are CSP and
The Benefits of Rapid Modelling
25
B [LAB+ 01], Petri nets [LM02] and StAC [ALBF03]. The second phase of this research involves model checking both models by contrasting them against behavioral properties expressed in a formal language, LTL (Linear Temporal Logic) for SPIN and CTL (Computational Tree Logic) for XTL. Some properties had been checked by using SPIN and the next step will be to check equivalent or closely related properties in XTL. The comparison also highlights that part of SPIN success derives from a nice interface which can be even profitable for non-experts in model checking. Some of these services are available in the animators for CSP(LP) and for StAC while the others can be relatively easy added.
10
Conclusions
We conducted an experiment of modelling a prototype by using different languages which have tool support available. We considered Promela/SPIN and CSP(LP)/CIA which share many features in common but also more specific modelling languages like StAC. We left outside the models several details, e.g. all the web based communication was replaced by synchronous channels, the relation sessions/logins is simplified to a userID, the communication between the travel agency was simplified to a request and a response when in reality it is a two steps dialog. The models can be expanded in any of those directions as needed. A quick summary of our experience follows. It is also worth mentioning we have also applied these methodologies to other e-business related case studies: order fulfillment, e-bookstore and mortgage broker. Benefits of animation/simulation include a) demonstrating flow of information through the system b) exploring interaction between components c) extraction of traces that could be used for generating test cases. In general, however, animations produced by these tools are not of sufficient visual quality to be useful in end users or customer demonstrations. Benefits of model-checking include a) easy discovery of concurrency flaws, e.g., deadlock b) in depth understanding of protocols (process/object interactions) c) discovery of invariants (database consistency constraints) By comparison, benefits of prototyping include a) more realistic user interfaces b) evolution of a class structure that, we believe, would closely approximates that of the actual implementation c) opportunity to gain knowledge of the actual implementation technologies. Still, our experiments show that rapid modelling is possible (one or two weeks to develop a model, about four times faster than prototyping). Mature notations and tools such as Promela/Spin provide better automated support for modelling, animation, and model-checking. However, the higher level constructs in CSP(LP) allow more faithful modelling of, for example, database tables. Tool support for this notation is sufficiently mature to provide useful insight, but further improvements would be welcomed. Finally, application specific notations such as StAC allow the most rapid modelling of all. Given that long-running transactions are likely to be the basis of future e-business systems, we believe that it is worthwhile further developing such notations and tools to support them.
26
J.C. Augusto et al.
References [ABF03]
Juan C. Augusto, Michael Butler, and Carla Ferreira. Using spin and step to verify stac specifications. In Proceedings of PSI’03, 5th International A.P.Ershov Conference on Perspectives of System Informatics (to be published), Novosibirsk (Russia), 2003. [Abr96] J. Abrial. The B-Book: Assigning Programs to Meanings. Cambridge University, 1996. [AFG+ 03] J. Augusto, Carla Ferreira, Andy Gravell, Michael Leuschel, and Karen M. Y. NG. Exploring different approaches to modelling in enterprise information systems. Technical report, Electronics and Computer Science Department, University of Southampton, 2003. Technical Report, http://www.ecs.soton.ac.uk/∼jca/rm.pdf. [ALBF03] Juan C. Augusto, Michael Leuschel, Michael Butler, and Carla Ferreira. Using the extensible model checker xtl to verify stac business specifications. In Pre-proceedings of 3rd Workshop on Automated Verification of Critical Systems (AVoCS 2003), Southampton (UK), pages 253–266, 2003. [Bec00] Kent Beck. Extreme Programming Explained. Addison-Wesley, 2000. [BF00] M. Butler and C. Ferreira. A process compensation language. In IFM’2000 – Integrated Formal Methods, volume 1945 of LNCS, pages 61–76. Springer Verlag, 2000. [CGV+ 02] M. Chessell, C. Griffin, D. Vines, M. Butler, C. Ferreira, and P. Henderson. Extending the concept of transaction compensation. IBM Journal of Systems and Development, 41(4):743–758, 2002. [Fer03] C. Ferreira. Precise modelling of business processes with compensation. PhD Thesis (submitted), Electronics and Computer Science Department, University of Southampton, 2003. [Hoa85] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [Hol97] Gerard Holzmann. The spin model checker. IEEE Trans. on Software Engineering, 23(5):279–295, 1997. [LAB+ 01] M. Leuschel, L. Adhianto, M. Butler, C. Ferreira, and L. Mikhailov. Animation and model checking of csp and b using prolog technology. In Proceedings of the ACM Sigplan Workshop on Verification and Computational Logic, VCL’2001, pages 97–109, 2001. [Leu01] M. Leuschel. Design and implementation of the high-level specification language csp(lp) in prolog. In Proceedings of PADL’01, pages 14–28. editor I. V. Ramakrishnan, LNCS 1990, Springer Verlag, 2001. [LM99] M. Leuschel and T. Massart. Infinite state model checking by abstract interpretation and program specialisation. In Proceedings of Logic-Based Program Synthesis and Transformation (LOPSTR’99), pages 63–82. editor Annalisa Bossi, Venice, Italy, LNCS 1817, 1999. [LM02] Michael Leuschel and Thierry Massart. Logic programming and partial deduction for the verification of reactive systems: An experimental evaluation. In Proceedings of 2nd Workshop on Automated Verification of Critical Systems (AVOCS’02), Birmingham (UK), pages 143–150, 2002. [LR00] F. Leymann and D. Roller. Production Workflow: Concepts and Techniques. Prentice Hall PTR, 2000. [Ros99] A. W. Roscoe. The Theory and Practice of Concurrency. Prentice-Hall, 1999.
The Benefits of Rapid Modelling
A
Fragment of Promela Model
/* channels for communication between processes */ chan ch_ta = [queue_length] of {bit, byte, bit}; chan ch_car00 = [queue_length] of {bit, byte}; ... chan ch_car00_2_ta = [queue_length] of {bit, bit}; ... /* databases */ byte cars00[resources_length]; byte cars01[resources_length]; byte rooms10[resources_length]; byte rooms11[resources_length]; DBrecord taDB[ta_records]; proctype user() { i=0; do /* repeat choices */ :: (i < logins) -> if /* choose a user ID in {1..3} */ :: loginID = 1 :: loginID = 2 :: loginID = 3 fi; i++; checkCreditCard(ccbit1, ccbit2); if :: correctCreditCard -> if :: ch_ta!0, loginID, 0 /* unbook a car */ :: ch_ta!1, loginID, 0 /* book a car */ :: ch_ta!0, loginID, 1 /* unbook a room */ :: ch_ta!1, loginID, 1 /* book a room */ fi :: else -> atomic{ printf("Incorrect credit card !!"); fi :: (i >= logins) -> break od } ... proctype ta() { end: do :: ch_ta?0, userid,0 -> ch_car00!0,userid; CUnbooking(0, 0) :: ch_ta?1, userid,0 -> ch_car00!1,userid; CBooking(0, 0) :: ch_ta?0, userid,0 -> ch_car01!0,userid; CUnbooking(0, 1) :: ch_ta?1, userid,0 -> ch_car01!1,userid; CBooking(0, 1)
27
28
J.C. Augusto et al.
... (idem for Hotels) od } init{run user(); run ta(); run car00(); run car01(); run hotel10(); run hotel11() }
B
Fragment of CSP(LP) Model
agent User(integer) : {tadb, h11db}; User(_logins) = if (_logins > 5) then STOP else ((CheckCreditCard(1, _logins)) [] (CheckCreditCard(2, _logins)) [] (CheckCreditCard(3, _logins)) ); ... agent TA:{ch_ta,ch_car00,ch_car01,ch_hotel10,ch_hotel11}; TA = ch_ta?0?_userID?0 -> (((ch_car00!0!_userID -> SKIP) [| {ch_car00} |] CarRental00) [] ((ch_car01!0!_userID -> SKIP) [| {ch_car01} |] CarRental01) ); TA = ch_ta?1?_userID?0 -> (((ch_car00!1!_userID -> SKIP) [| {ch_car00} |] CarRental00) [] ((ch_car01!1!_userID -> SKIP) [| {ch_car01} |] CarRental01) ); ... (idem for Hotels) -- Travel Agent database agent TADB(multiset) : {tadb}; TADB(nil) = tadb!empty -> TADB(nil); TADB(_State) = tadb?member._x: (_x in _State) -> TADB(_State); TADB(_State) = tadb?add?_x -> TADB(cons(_x,_State)); TADB(_State) = tadb?rem?_x: _x in _State -> TADB(rem(_State,_x)); TADB(_State) = tadb?nexists?_x: not(_x in _State) -> TADB(_State); agent MAIN : {}; MAIN = (TADB(nil) [| {tadb} |] (C00DB(nil) [| {c00db} |] (C01DB(nil) [| {c01db} |] (H10DB(nil) [| {h10db} |] (H11DB(nil) [| {h11db} |] User(1) )))));
Prediction of Consumer Preference through Bayesian Classification and Generating Profile Su-Jeong Ko Department of Computer Science University of Illinois at Urbana-Champaign 1304 West Springfield Ave. Urbana, Illinois 61801 U.S.A. [email protected]
Abstract. Collaborative filtering system overlooks the fact that most consumers do not rate a preference; because of this oversight the consumer-product matrix shows great sparsity. A memory-based filtering system has storage problems and hence proves inefficient when applied on a large scale where tens of thousands of consumers and thousands of products are represented in the matrix. Clustering consumer into groups based on the web documents they have retrieved/fetched allows accurate recommendations of new web documents through solving the problem of sparsity. A variety of algorithms have previously been reported in the literature and their promising performance has been evaluated empirically. We identify the shortcomings of current algorithms for clustering consumer and propose the use of Naïve Bayes classifier to classify consumer into group. To classify consumer into group, this paper uses the association word mining method with weighted word that reflects not only the preference rating of products but also information on them. The data expressed by the mined features are not expressed as a string of data, but as an association word vector. Then, collaborative consumer’s profile is generated based on the extracted features. Naïve Bayes classifier classifies consumer into group based on association words in collaborative consumer’s profile. As a result, the dimension of the consumer-product matrix is decreased. We evaluate our method on database of consumer ratings for special computer study and show that it significantly outperforms previously proposed methods.
1 Introduction As the Web and its related technologies developed, various kinds of information are broadcast through the Web. But the retrieval tools on the Web provide one with useless information. Hence it is time for consumers to consider how to get their target information efficiently. If the problem is that consumer is swamped by too much information, the solution seems to lie in developing better tools to filter the information so that only interesting, relevant information gets through to the consumer[6]. Many present filtering systems are based on building a consumer profile[12]. These systems attempt to extract patterns from the observed behavior of the consumer to predict which products would be selected or rejected. However, these systems all suffer from a “cold-start” problem. New consumers start off with nothing in their profile and must train a profile scratch. Collaborative filtering system M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 29–39, 2003. © Springer-Verlag Berlin Heidelberg 2003
30
S.-J. Ko
overlooks the fact that most consumers do not rate a preference; because of this oversight the consumer-product matrix shows great sparsity[7]. A memory-based filtering system has storage problems and hence proves inefficient when applied on a large scale where tens of thousands of consumers and thousands of products are represented in the matrix[16]. Clustering consumer into groups based on the web documents they have retrieved/fetched allows accurate recommendations of new web documents through solving the problem of sparsity[3,4]. A variety of algorithms have previously been reported in the literature and their promising performance has been evaluated empirically[13,17]. EM is an obvious method for grouping consumer, but does not work because it cannot be efficiently constructed to recognize the constraint that web documents two consumers like must be in the same class each time. Kmeans clustering is fast but ad hoc. Gibbs sampling works well and has the virtue of being easily extended to much more complex models, but is computationally expensive. We identify the shortcomings of current algorithms for clustering consumer and propose the use of Naïve Bayes classifier to classify consumer into group. To classify consumer into group, this paper uses the association word mining method with weighted word that reflects not only the preference rating of products but also information on them. The data expressed by the mined features are not expressed as a string of data, but as an association word vector. Then, collaborative consumer’s profile is generated based on the extracted features. Naïve Bayes classifier classifies consumer into group based on association words in collaborative consumer’s profile. As a result, the dimension of the consumer-product matrix is decreased. The proposed method is tested on a database of consumer evaluated web documents, and the test result demonstrates that the proposed method is more effective than previous methods as a matter of recommendation.
2 Expression of Document Features In this paper, we use a more effective feature extraction method applying association word mining[8] to express the features of the document as either a bag-of–words[14] or a bag-of-associated-words. The association word mining method, by using Apriori algorithm[1,2], represents a feature for document not as single words but as association-word-vectors. Since the feature extraction method using association word mining does not use the profile, it needs not update the profile, and it automatically generates noun phrases by using confidence and support at Apriori algorithm without calculating the probability for index. Besides, since this method is representing document as an association word set, it prevents consumers from being confused by word sense disambiguation, and thus, it has an advantage of representing a document in detail. However, because this feature extraction method is based on a word set of association words, it makes an error of judging different documents identically. This problem decreases the accuracy of document classification. In the case of Inserting a new document into database, this method has a problem that the database should be updated each time. This paper proposes a method of giving the weight to a word that belongs to association word by using TF•IDF[11]. TF•IDF is defined to be the weight of the words in the document. We select the word that has the largest TF•IDF in
Prediction of Consumer Preference
31
association word. Both the association word and the typical word are selected as features, and it solves the problem, which is caused by using only association words. The Apriori algorithm is used to mine associated data from the words extracted from morphological analysis. The association word mining algorithm, Apriori, is used to find the associative rules of products out of the set of transactions. The mined data, or the set of associated words from each document, are represented as an associationword-vector model. As a result, documents are represented in Table 1 in the form of an association-word-vector model. Table 1. An example of features extracted from Web document Web document
Features
document1
game&participation&popularity operation&selection&match game&rank&name user&access&event
document2
data&program&music figure&data&program game&explanation&provision game&utilization&technology
The words that belong to association word at Table 1 are weighted using by TF•IDF. First, feature selection using TF•IDF makes morphological analysis of the document to extract features of the document, and then extracts only nouns from its outcome. TF•IDF of all extracted nouns can be obtained through Equation (1)[15]. n (1) W nk = f nk • [ log 2 DF + 1 ] Where fnk is the relative frequency of word nk against all words within the document, and n is the number of study documents, and DF is the number of training documents where word nk appeared. It extracts only higher frequency words as features by aligning them from higher TF•IDF words to lower ones. If feature of test document(D) is {n1,n2,…,nk,..,nm}, it is compared with words that belong to association word in Table 1. As a result, the words that belong to association word are weighted by TF•IDF. The word with the highest weight is selected as the typical word of the association word. If the typical word of (data&program&music) in Table 2 is ‘data’, we represent it as (data&program&music). Equation (2) defines the features of document dj that is composed of p association words.
dj={ AWj1 , AWj2,…, AWjk, …, AWjp}
(2)
In Equation (2), each of {AWj1, AWj2, AWjk, AWjp} means association word that is extracted from document dj. For the best results in extracting the association words, the data must have a confidence of over 85 and a support of less than 20[8].
32
S.-J. Ko
3 Collaborative Consumer Profile The collaborative filtering system based on web documents recommends a document to consumers according to {customer-product} matrix. The consumer in collaborative filtering systems doesn’t rate preference on all documents. Therefore, the missing value is occurred in {consumer-product} matrix. The missing value causes the sparsity of {consumer-product} matrix. In this section, the collaborative consumer profile generation is mentioned to reduce the sparsity of {consumer-product} matrix caused by the missing value. 3.1 The Composition of {consumer-product} Matrix If we define m products which are composed of p feature vectors and a group of n consumers, consumer group is expressed as U={cui}(i=1,2,…,n), document group is expressed as I={dj}(j=1,2,…,m). We define the consumers in collaborative filtering database to be ‘collaborative consumer’. And R={rij}(i=1,2,…,n j=1,2,…,m) is a matrix of {consumer-product}. The element in matrix rij means consumer cui’s preference to document dj. Table. 2 is the matrix of {consumer-product} matrix in collaborative filtering system. Table 2. {consumer-product} matrix in collaborative filtering system d1
d2
d3
d4
…
dj
…
dm
cu1
r11
r12
r13
r14
…
r1j
…
r1m
cu2
r21
r22
r23
r24
…
r2j
…
r2m
cui
ri1
ri2
ri3
ri4
…
rij
…
rim
…
…
…
…
…
…
…
…
…
cun
rn1
rn2
rn3
rn4
…
rnj
…
rnm
Collaborative filtering system uses an information that consumer rates the preference for web pages. Preference levels are represented on a scale of 0~1.0 in increments of 0.2, a total of 6 degrees, only when the value is higher than 0.5 is the consumer classified as showing interest. The web documents used in this paper are computerrelated documents gleaned by an http down loader. The features of web documents are extracted by association word mining described in section 2. rij in Table 2 is defined as a Equation (3). Namely, the element of matrix rij is in one of 6 degrees or no evaluation case.
r
ij
∈{φ,0,0.2,0.4,0.6,0.8,1}(i = 1,2,..., n)( j = 1,2,..., m)
(3)
In the Equation (3) φ means that collaborative filtering consumer i doesn’t rate document j. Table 3 shows the consumer’s preference to web documents in collaborative filtering system. The features of documents are composed of association words produced by the method described in section 2. In Table 3, ‘?’ means the automatic preference rating is required.
Prediction of Consumer Preference
33
Table 3. Consumer’s preference rating on web documents expressed in features d1 cu1
0.2
dj …
dm
1
…
0.4 0.6
cu 2
?
…
0.8
…
cu3
0.4
…
0.6
…
?
cun
0.4
…
?
…
?
3.2 Generating Collaborative Consumer Profile The profile of collaborative filtering consumer cui is generated based on the document features. In case a collaborative consumer rates preference low, the weight of rated document is given low. In case a collaborative consumer rates preference high, the weight of rated document is given high. Therefore, the preference of association words expressed in features is indicated various values according to the weight. As a collaborative consumer cui defines the preference rating rij on the document dj, the weight of each association word that is extracted from document dj is weighted as rij. The weight of association word AWijk are defined as c_wijk. In case that a collaborative th consumer cui rates for document dj ,AWijk is the k association word from Equation (2). Equation (4) is the equation that defines the initial weight AWijk , which are structural elements, to generate the consumer cui’s profile. The initial weight of AWijk of association words, c_wijk, is defined as the initial preference, the elements of {consumer-product} matrix. The preference that consumer rates directly is the most correct and important data for automatic preference rating.
c wijk = Preference ( AWijk ) = r
ij
(consumer:cui,1≤j≤m,1≤k≤p)
(4)
Table 4 shows the detailed calculating way to get the initial weight c_wijk obtained by the definition of Equation (4). Each of document d1, dj, dm is rated the value of 0.2, 0.8, 1 by a collaborative consumer cui in Table 4.The italic word in association word is a typical word. In Table 4 based on Equation (4), the weights of {AWi11…},{AWij1…},{AWim1…} are defined to be 0.2, 0.8, and 1, respectively. Although AWij1 ,AWim2, AWi1p in Table 4 are the same association word their initial weights are different like 0.2, 0.8, and 1. It needs combining these different initial weights to generate a collaborative consumer profile. The weight of the same association word is multiplied after retrieving all association words. Table 5 shows the detailed weighting ways and examples based on Table 4. For example, the final weights of {AWi12, AWijk}, c_w’i12 and c_w’ijk , are c_wi12xc_wijk because they are the same. Equation (5) based on Table 5 is the equation to change a weight according to the frequency of the association word extracted from all documents rated by consumer, after giving initial weight to association word by Equation (4) as in Table 4. All association words extracted from documents, which is rated by collaborative consumer cui, is saved in database(AWDB). Then, an association word(AWijk) is the same as another association word(Awij’k’) after retrieving AWDB , c_wijk is multiplied by c_wij’k’. The final weight of association word AWijk is defined to be c_w’ijk. In Equation (5), j ≠ j ′ork ≠ k ′ means that the same association word like AW111=AW111 is excluded from computing.
34
S.-J. Ko
c_wijk′ =
∏
AWijk ,AW ij’k’ ∈AWDB
c_wijk⋅c_wij′k′|(AWijk= AWij′k′ )(1≤ j, j′ ≤ m,1≤ k,k′ ≤ p) j ≠ j′ork ≠ k′,user : cui (5)
Table 6 defines the structure of a collaborative consumer profile to be CUi based on Table 5 and Equation (5). By definition in Equation (5), the final weight c_w’ijk is given to association word AWijk. Table 4. Giving initial weight for profile generation Document
Initial weight c_wi11 (0.2)
d1 (preference ri1=0.2)
c_wi12 (0.2) … c_wi1k (0.2) … c_wi1p (0.2) c_wij1(0.8)
AWij1 utilization&technology&development
dj (preference rij=0.8)
c_wij2 (0.8) … c_wijk(0.8) … c_wijp(0.8)
AWij2 game&organization&selection&rank … AWijk interior&newest&technology& installation … AWijp organization&user&rank
c_wim1(1.0)
AWim1provision&illustation&explanation
c_wim2(1.0)
AWim2utilization&technology&development
dm (preference rim=1.0)
Association word AWi11 game&configuration&user&selection AWi12 interior&newest&technology& installation … AWi1k figure&popularity&service&music … AWi1p utilization&technology&development
…
…
c_wimk(1.0) … c_wimp (1.0)
AWimkdevelopment&rank&sports … AWimpfigure&data&service&engine
Table 5. The final weight given to association words Association Word
Weight to association word c_w’ij1 <=c_wij1xc_wim2xc_wi1p c_w’im2<=c_wij1xc_wim2xc_wi1p c_w’i1p<=c_wij1xc_wim2xc_wi1p
AWij1 , AWim2, AWi1p
c_w’i12<=c_wi12 xc_wijk c_w’ijk<=c_wi12 xc_wijk
AWi12, AWijk
Table 6. The structure of a collaborative consumer profile CUi Consumer Weight ID CUi (consumer c_w’i11 :cui1≤j≤m, 1≤k≤p)
Association word
…
Weight
Association word
…
Weight
Association word
AWi11
…
c_w’ijk
AWijk
…
c_w’imp
AW imp
Prediction of Consumer Preference
35
Table 7. Giving the weight to association word in game class Association word
The estimated value 0.093750 0.012700 0.100386 0.016878 0.089494 0.100386 0.086614 0.093023
(1)game&organization&athlete&match&sports&participation (2)domestic&newest&technology&installation (3)game&participation&popularity&consumer&access (4)operation&selection&match&rank&rule (5)game&rank&name (6)operation&sports&committee&athlete (7)game&organization&selection&rank (8)game&schedule&athlete&participation&operation
4 Classification Using the Naïve Bayes Classifier The Naïve Bayes classifier goes through a training phase before it reaches a classification phase[9,10]. In the training phase, the estimated value is assigned to the association words in the collaborative consumer profile acquired by the Apriori algorithm and TF•IDF. Consumers in training set are classified into a class. Association words in training consumer profile are sorted and saved in association word set. Equation (6) assigns the estimated value within the classID of the th association word set to the k association word AWjk.
nk +1
P(AW jk classID)=
(6)
n+ AWS th
Where P(AWjk |classID) is the estimated value for the k association word in a classID, n is the total number of association words mined from the training collaborative consumer, and nk is the number of matches with n and the data from the association word set. Also the classID serves as a label for classes in the association word set, and |AWS| is the total number of elements in association word set. The training phase is divided into an accumulative phase and a phase for giving the estimated value. In the accumulative phase, the number of reoccurrences of words from the association word set is recorded. In the phase of giving the estimated value, the estimated value of association word is derived by inserting the result by Equation (6) from the accumulative phase. Table 7 shows association words extracted from the training collaborative consumer profile and the estimated value computed by using Equation (6). In Table 7, italic word means the typical word of association word. In the classifying phase, the Naïve Bayes classifier uses the association word set with the assigned values to classify the training collaborative consumer into group. The Apriori algorithm extracts association words from the training consumer profile as shown in Table 6. If { c_w’ AW ,…,c_w’ AW ,…, c_w’ AW } represents the features of test consumer cu in Table 6, the Naïve Bayes classifier will use Equation (7) to organize the test consumer into classes. i11
i11
ijk
ijk
imp
imp
i
N
m
classID=1
k =1
′ P(AW ijk classID) class = argmax P(classID) c wijk
(7)
36
S.-J. Ko
Of the total number of classes N in Equation (7), test consumer cu shall be classified as class. Also, Equation P(AWjk|classID) is the estimated value attained by Equation (6), while P(classID) is the probability it shall be classified as a classID. c_w’ijk is the final weight of AWijk association word in Table 6. i
5 Performance Evaluation The database for collaborative filtering recommendations was created from the data of 200 consumers and 1600 web documents. Consumers evaluated a minimum of 10 of the 1600 web documents. The database for content based filtering recommendations was created from 1600 web documents, which were collected from URLs related to computer by an http down loader, then hand-classified into 8 areas of computer information. The 8 areas were classified under the labels of the following classes: {Games, Graphics, News and media, Semiconductors, Security, Internet, Electronic publishing, and Hardware}. The basis for this classification comes from search engines such as AltaVista and Yahoo that have statistically analyzed and classified computer association web documents. Of the 200 consumers, 100 were used as the training group, and the remaining consumers were used as the test group. In this paper, Mean Absolute Error(MAE) and Rank Score Measure(RSM), both suggested by John. S. Breese[5] are used to gauge performance. MAE is used to evaluate single product recommendation system. RSM is used to evaluate the performance of system that recommends products from ranked lists. The accuracy of the MAE, expressed as Equation (8), is determined by the absolute value of the difference between the predicted value and real value of the consumer rating.
s
a
=
1
m
∑
j∈
a
pa
|
p
a, j
− va, j |
(8)
In Equation (8), pa,j is the predicted preference, va,j the real preference, and ma the number of products that have been evaluated by the new consumer. The RSM of a product in a ranked list is determined by consumer rating or consumer visits. RSM is measured under the premise that the probability of choosing a product lower in the list decreases exponentially. Suppose that each product is put in a decreasing order of value j, based on the weight of consumer preference. Equation (9) calculates the expected utility of consumer Ua’s RSM on the ranked product list
Ra = ∑ j
max(V a, j − d,0) 2 (j −1)/(
− 1)
(9)
Where d is the mid-average value of the product, and is the halflife. The halflife is the number of products in a list that has a 50/50 chance of either review or visit. In the rating phase of this paper, the halflife value of 5 shall be used. In Equation (10), the RSM is used to measure the accuracy of predictions about the new consumer.
R = 100 ×
∑ R ∑ R u
u
u
max u
(10)
In Equation (10), if the consumer has rated or visited a product that ranks highly in a is the maximum expected utility of the RSM. ranked list, R max
u
Prediction of Consumer Preference
37
For evaluation, this paper uses the following methods: the proposed method using Bayesian classification and generating profile(P_B), the former memory based methods using the Pearson correlation coefficient(P_Corr), the recommendation method using clustering by EM algorithm(EM), and the recommendation method using K_means clustering(K_means). They were used to compare the performance by changing the number of clustered consumers. Table 8 shows the MAE and RSM of the P_B, the P_Corr, EM, and K_means, based on the change of consumer numbers by using Equation (8) and (10). Fig. 1 and Fig. 2 show the MAE and RSM of the number of consumers based on Table 8. In Fig. 1 and Fig. 2, as the number of consumer increases, the performance of the P_B, EM, and K-means increases, whereas the P_M method shows no notable change in performance. In terms of the accuracy of prediction, it is evident that P_B method, which uses customer clustering based on profile is more superior to the others. Table 8. The MAE and the RSM based on the change of consumer numbers MAE
Consumers’ number
P_Corr K_means
RSM EM
P_B
10
0.210
0.210
0.230
0.190
56
52
55
56
20
0.210
0.206
0.221
0.180
56.1
53
54
57
30
0.208
0.200
0.221
0.183
56.2
55
54
58
40
0.207
0.198
0.219
0.182
56.8
56
55
59
50
0.204
0.191
0.219
0.181
57
57
54
61
60
0.203
0.188
0.218
0.178
57.1
58
55
62
70
0.202
0.185
0.218
0.175
58
58.5
55
64
80
0.201
0.180
0.218
0.170
58.1
59
53
65
90
0.201
0.175
0.218
0.165
58.1
59.5
54
65.5
100
0.200
0.170
0.218
0.160
58.1
60
54
66
P-Corr
EM
P_B
K_means
P-Corr K_means
EM
P_B
0.240
MAE
0.220 0.200 0.180 0.160 0.140 10
20
30
40
50
60
70
80
90 100
Users’ number
Fig. 1. The MAE based on the change of consumer numbers
38
S.-J. Ko
Rank scoring
P-Corr
K_means
EM
P_B
68 66 64 62 60 58 56 54 52 50 10
20
30
40
50
60
70
80
90 100
Users’ number
Fig. 2. The RSM based on the change of consumer numbers
6 Conclusion This paper proposed the use of Naïve Bayes classifier to classify consumer into group. To extract the feature of products, this paper uses the association word mining method with weighted word that reflects not only the preference rating of products but also information on them. The data expressed by the mined features are not expressed as a string of data, but as an association word vector. Then, collaborative consumer’s profile is generated based on the extracted features. Naïve Bayes classifier classifies consumer into group based on association words in collaborative consumer’s profile. As a result, the dimension of the consumer-product matrix is decreased. In addition, the proposed method allowed accurate recommendations of new web documents through solving the problem of sparsity. We evaluated our algorithm on database of consumer ratings for special computer study and showed that it significantly outperformed previously proposed algorithms. In the future, if the proposed method is to be used with data for several fields, it is expected that accuracy will be enhanced.
References [1] [2] [3]
R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994. R. Agrawal and T. Imielinski and A. Swami, "Mining association rules between sets of products in large databases," In Proceedings of the 1993 ACM SIGMOD Conference, Washington DC, USA, 1993. C. Basu and H. Hirsh and W. W. Cohen, “Recommendation as classification:Using social and content-based information in recommendation,” In proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 714–720, Madison, WI, 1998.
Prediction of Consumer Preference [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
[17]
39
D. Billsus and M. J. Pazzani, “Learning collaborative information filters,” In proceedings of the International Conference on Machine Learning, 1998. John. S. Breese and C. Kadie, “Empirical Analysis of Predictive Algorithms for Collaborative Filtering,” Proceedings of the Conference on Uncertainty in Artificial Intelligence, Madison, WI, 1998. W. W. Cohen and W. Fan, “Web-collaborative filtering:recommending music by crawling the Web,” Computer Networks–The International Journal of Computer & Telecommunications Networking, V.33 N.1–6, 2000. J. Delgado and N. Ishii, “Formal Models for Learning of Consumer Preferences, a Preliminary Report,” In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI–99), Stockholm, Sweden, July, 1999. S-J. Ko and J. H. Lee, “Feature Selection using Association Word Mining for Classification,” In Proceedings of the Conference on DEXA2001, LNCS2113, pp. 211– 220, 2001. S-J. Ko, J-H. Choi, and L-H. Lee, “Bayesian Web Document Classification through Optimizing Association Word,” In Proceedings of IEA/AIE2003, LNCS, 2003. T. Michael, Maching Learning, McGraw-Hill, pp. 154–200, 1997. Inha university, Consumer focused Intelligent Information Retrieval System, Technical research report, 1997. M. Pazzani, D. Billsus, Learning and Revising Consumer Profiles: The Identification of Interesting Web Sites, Machine Learning, Kluwer Academic Publishers, pp. 313–331, 1997. Bansal, N., Blum, A., Chawla, S., “Correlation clustering,” Proceedings of the 43rd Annual Symposium on Foundations of Computer Science , 238–247 , 2002. V. Rijsbergen and C. Joost, Information Retrieval, Butterworths, London-second edition, 1979. G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGrawHill, 1983. B. M. Sarwar, J. A. Konstan, Al Borchers, J. Herlocker, B. Miller, and J. Riedl, ”Using Filtering Agents to Improve Prediction Quality in the GroupLens Research Collaborative Filtering System,” Proceedings of the 1998 Conference on Computer Supported Cooperative Work, 1998. L. H. Ungar and D. P. Foster, “Clustering Methods for Collaborative Filtering,” AAAI Workshop on Recommendation Systems, 1998.
Developing Web Applications from Conceptual Models. A Web Services Approach ´ Vicente Pelechano, Joan Fons, Manoli Albert, and Oscar Pastor Departamento de Sistemas Inform´ aticos y Computaci´ on Universidad Polit´ecnica de Valencia Cam´ı de Vera s/n, E-46022, Spain {pele,jjfons,malbert,opastor}@dsic.upv.es
Abstract. In this work, an object-oriented software production method is discussed for the development of ”web applications” based on Web services. The method takes as an input functional, navigational and presentation requirements and provides a systematic method to develop dynamic web applications. This method proposes a multi-tier architectural style taking into account the nature of the web services, and defines a set of correspondences between the conceptual abstractions and the software elements that implement each tier of the architecture, making intensive use of design patterns and giving support to some of the characteristics that are provided by the Web services.
1
Introduction
The advance of Internet and the emerging technologies associated to the web are universalizing information systems, allowing access to any connected potential user. The term ”Web Application” refers to the new family of software applications specially designed to be executed on the web. Similarly, the term ”Web Engineering” [1] refers to the methods, techniques and tools that should be used to undertake the development of such applications. In this context, it becomes indispensable to have development methods that provide solutions to the problem of modelling and implementing web applications in order to assure the quality of the final product. Approaches of this kind introduce new models and abstraction mechanisms to capture the essentials of web applications, and give support for the full development of a web solution. Following this strategy, many different approaches have appeared to deal with the web application development process using the so called model-driven application generators [2]. Basically, the set of specification mechanisms that a web development method should provide are: requirements specification, conceptual modelling of system structure and behavior, navigation and presentation specification, personalization issues, user models, etc. Some representative efforts to introduce web features into classical conceptual modelling approaches are OOHDM [3], WebML [4], UWE [5] and WSDM [6].
This work has been suported by the MCYT Project with ref. TIC2001-3530-C02-01.
´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 40–51, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
Developing Web Applications from Conceptual Models Conceptual Modelling
Dynamic Model REQUIREMENTS ELICITATION Use Cases and Scenarios
41
Structural Model
Navigational Model
Presentational Model
System Specification
Functional Model
Implementation Presentation Tier (WEB Environments (HTML,XML/XSLT)
Solution Development
Application Tier Persistence Tier
(XML WEB Services (Java, .NET, EJB, COM+))
(SQL Server, Oracle)
Architecture
Fig. 1. Methodological Approach
The proposal discussed in this paper provides specific contributions to the implementation area. We present a model-driven code generation approach that enriches the OOWS approach [7] and provides systematic code generation. Taking conceptual models as an input, a precise guide is defined for going systematically from the problem space to the solution space. This is possible by defining a set of mappings between the conceptual modelling abstractions and the final software components in a Web service-based architecture [8]. The structure of the work is the following: Section 2 briefly introduces our methodological approach explaining each step of the development process and the techniques that are applied. Section 3 proposes a multi-tier architectural style which is based on Web services. Section 4 provides a design pattern-based strategy that defines a set of mappings between the conceptual abstractions and the software elements that implement each tier of the architecture. In order to better understand our approach, a case study of a University Research Groups Management web application (URGM) has been developed. Finally, Section 5 presents conclusions and future work.
2
Methodological Approach
A software production method for developing ”web applications” which are based on Web services is introduced in this section. This proposal defines the three essential steps (and their associated techniques) to follow in order to develop applications of this kind: (1) Conceptual Modelling (developed in OOWS [7]), (2) Architectural Design and (3) Implementation (see Figure 1). This proposal is built over the OO-Method [9], an OO software production method that includes model-based code generation capabilities and integrates formal specification techniques and conventional notations like UML. The real contribution of the method is a precise methodological guide for systematically going from conceptual models to final web applications. The steps and their associated techniques are the following (see Figure 1):
42
V. Pelechano et al.
1. Conceptual Modelling: Conceptual models are built to appropriately capture the requirements of web applications. The modelling tools that are used in this step allow us to specify the functional, navigational and presentation requirements of dynamic web applications. The modelling process is divided into two steps: a) functional requirements elicitation. Techniques based on use cases and scenarios [10] are applied to build a conceptual schema. b) navigation and presentation modelling. A navigational model is built in order to model navigational requirements from the class diagram. Once the navigational model is built, presentation requirements are specified using a presentation model which is based on the navigational model. 2. Architectural Design: A multi-tier architectural style is proposed taking into account the nature of the web services. Each tier of the architecture is determined, and the essential characteristics that each tier should provide are identified. 3. Implementation: A set of correspondences (transformation rules) between the conceptual abstractions and the software elements that implement each tier of the architecture are defined, making intensive use of design patterns. This implementation strategy gives support to some of the new characteristics that Web services provide. This paper is focused on the Architectural Design and Implementation steps that are introduced in the following sections of the work. Each section describes how each step should be achieved making use of a real case study modelled following the OOWS approach.
3
Architecting Web Service Applications
The first step in our design/implementation process consists of determining the architectural style to implement web applications. This section analyzes the nature of Web services and their associated technologies, and identifies how specific characteristics should be implemented by systems which are based on Web services. In a generic way, an Web service can be considered as a software functionality that can be accessed through standard web technology. 3.1
Design Constraints for Web Services Architectures
In terms of interaction models [11], the middleware that provides the communication infrastructure to the web services is supported by service-oriented architectures (SOA) [8] based on SOAP, WSDL and UDDI. The design of service-oriented architectures should take into account the nature and the purpose for which web services were conceived. Their interaction model is message-based, but not necessarily connection-oriented. The implementation strategy use distributed objects and encapulate them through web services. It should take into account certain design aspects: (1) Loose coupling between the client and the service requests,
Developing Web Applications from Conceptual Models
43
(2) Managing object life-cycle. Client applications must interact with an object factory to create the session object that the client should later destroy or liberate (explicit control of the object life-cycle on behalf of the client) and (3) Stateless objects. For scalability and performance reasons it is recommended to create objects without state (stateless) that encapsulate the session-oriented interaction. 3.2
The Proposed Architectural Style
A multi-tier architectural style has been selected for the implementation of web applications. Web services technology promotes the adoption of architectures of this kind. Multi-tier architectures clearly structure web applications, being easily adaptable and extensible and providing larger scalability. It is advisable to use this architectural style in applications of great size and high number of concurrent users taking into account scalability vs. performance requirements. The tiers of the selected architectural style are the following: – Presentation Tier. This includes graphical user interface components (in this case, web pages and visual widgets) for interacting with the user (visualizing the data, providing access to services and facilitating navigation). – Application Tier. This implements the business logic in the form of Web services. This tier is divided into two subtiers: (1) Business Facade. This gives support to the implementation of the web services providing simple and loosely coupled interfaces to the presentation tier and (2) Business Logic. This implements the structure and the functionality of the classes in the conceptual schema. – Persistence Tier. This implements the persistence and the access to persistent data in order to hide the details of data repositories to upper tiers. This tier is constituted by two subtiers: Data Access Tier. This is implemented by software components that are responsible for accessing the database to read and write the business objects’ state, making the business logic tier independent of the specific software components that development environments provide to access databases and Data Tier. This is responsible for storing and retrieving the application data in a specific DBMS. The following section presents how to map (1) the modelling information specified in the conceptual schema, (2) the navigational model and (3) the presentation model into this multi-tier architecture using Web services.
4
Implementing Web Applications
Once the architectural style has been selected and the responsibilities of each tier in the architecture clearly determined, the final step of the method is to obtain the detailed design and the implementation of each tier. This can be done, in a structured way, by defining precise correspondences between the conceptual abstractions and the software structures that implement each tier, obtaining a
44
V. Pelechano et al.
Member
Anonymous
E «context» Activities E «context» Groups Member
Guest
E «context» Resources
E «context» Projects
E «context» ResearchLines E «context» Publications
E «context» Members
S «context» Guests
Administrator
Fig. 2. URGM system User Diagram and Navigational Map of the Member user
Fig. 3. Home page of the web application for the Member kind of user
completely operative web application. Design patterns are used in an intensive way to obtain quality designs and to facilitate the transformation of the conceptual structures into software elements. In the following subsections, we are going to present step by step (following the case study), how each tier of the architecture can be systematically obtained. 4.1
Presentation Tier
Starting from the navigational and presentation models, a group of connected web pages for each kind of user can be obtained in a systematic way. These web pages define the web application user interface for navigating, visualizing data and accessing to web application functionality. Let’s show an example of how web pages are obtained for the Member (of a Research Group) kind of user. A user diagram is specified (see Figure 2) to express which kind of users can interact with the system. Figure 2 presents the navigational map specification for this kind of user. This map structures its access to the system by providing a group of nodes (navigational contexts) to manage and query information about projects, publications, activities, resources, etc. A web page is created for each navigational context in the map. This web page is responsible for retrieving the specified information in its navigational context. Marking as home one exploration context causes that the user will
Developing Web Applications from Conceptual Models
45
Fig. 4. Index for the web page of the Publications context
automatically navigate to that web page when the user logs into the system. Otherwise, a web page that provides a link to each navigational exploration context (page) is created to play the role of the home page. Figure 3 shows an example of a created home page. Clicking on one of these links, the application navigates to the web page that represents the target context. For instance, if we select the Publications link in the home page, we reach the Publications web page. As this context has defined an index, when we reach the web page (by “exploration”), the index gets activated, creating a list of instances with the specified information in the index. Figure 4 shows the effect of this indexation. Selecting one of these indexed instances (using the name of the publication as the link attribute), the web page shows all the information specified in the context (see Figure 5). The web page of Figure 7 implements the Publications context. It shows the specified information about publications, its signers and other information depending on the subtype of publication). The strategy that we use to generate web pages (see as an example the Figures 4 and 7) divides them into two logical areas: – The information area presents the specific system view defined by a context. This area is located at the right side of the web pages (see box number 1). The presentation model specification is applied to obtain the layout of this area in the following way: the instances of the manager class are shown as their layout pattern determines, applying (if defined) the ordering criteria and the information paging. The instances of navigational classes related by a navigational relationship follow the same strategy. Figure 7 shows the selected instance of a publication. The manager class and all the navigational relationships are presented applying the register layout pattern. The context relationship between Publication and ResearchLine classes defines a navigation capability to the ResearchLine page using the name attribute of the line of research as the anchor. – The navigation area provides navigation meta-information to the user:
46
V. Pelechano et al. E << context >> Publications «view» Signer -signAs
«view» Publication -title -publicationYear -publicationFile -abstract
«view» Member
Signer.signAs [ Members ]
ResearchLine.name
«view» ResearchLine -name
[ ResearchLines ]
«view» Thesis -department
«view» Book -isbn -issn -pags -chapters -editorial -isInternationalEditorial -editor -series -volume
«view» BookChapter -bookISBN -bookTitle -bookAutors -chapterNumber -initialPage -endPage
«view» Journal -name -issn -isNewsletter -isInternational -isSCI -isJCR
«view» InJournalArticle -initialPage -endPage -isReviewArticle
«view» InConferenceArticle -initialPage -endPage
«view» JournalVolume -volume -num -volumeMonth -volumeYear -period
«view» ConferenceEdition -edition -city -country
«view» TechnicalReport -department
«view» MasterThesis -entity
«view» Proceedings -editor -publisher -series -volume -isbn -issn
ATTRIBUTE INDEX References ATTRIBUTES Signer.signAs, title, initialPage, endPage, publicationYear, InJournalArticle.JournalVolume.Journal.name, InJournalArticle.JournalVolume.volume, InConferenceArticle.ConferenceEdition.edition, InConferenceArticle.ConferenceEdition.Conference.name, Thesis.department, Book.editorial, BookChapter.bookTitle, TechnicalReport.department, MasterThesis.entity, publicationYear LINK ATTRIBUTES title ATTRIBUTE INDEX Abstracts ATTRIBUTES title, publicationYear, abstract LINK ATTRIBUTES title FILTER Title ATTRIBUTE title TYPE APROXIMATE
«view» Conference -name -acronym -isInternational
FILTER Year ATTRIBUTE publicationYear TYPE RANGE
Fig. 5. Publications Navigational Context for the Member user
<< context >> Publications
«view» Signer
«view» Member
Pattern: Master-Detail (Tabular) Order: signerPosition (ASC)
Pattern: Register «view» Order: publicationYear (DESC) Publication Pagination: Static Cardinality 1 Sequential Access
«view» Thesis
«view» Book
«view» BookChapter
Pattern: Master-Detail (Register)
«view» InJournalArticle
Pattern: Register «view» Journal
«view» JournalVolume
«view» InConferenceArticle
«view» ResearchLine
«view» TechnicalReport
Pattern: Register «view» ConferenceEdition
«view» MasterThesis
«view» Proceedings
Pattern: Register
Pattern: Register
Pattern: Register
«view» Conference
ATTRIBUTE INDEX Abstracts Pagination: Static Cardinality 10 Sequential access
Fig. 6. Publications Context Presentation Model
• Where the user is (see box number 2). It is stated which web page (context) is being currently shown to the user. • How the user reached here (see box number 3). It is shown the navigational path that has been followed to reach that page. • Where the user can go to (see box number 4). A link to any exploration context appears in this area. • Which filters and index mechanisms can be used by the user (see boxes number 5 and 6, respectively). If the context has defined any filter or index mechanism, it is shown in this logical area. The figure shows the two specified filters (search members by name and publications between years) and the two available indexes (References and Abstracts).
Developing Web Applications from Conceptual Models
47
Fig. 7. Web page of the Publications context
• Applicational links (see box number 7). Additional applicational links are provided to navigate to the home page and to log into the system. 4.2
Application Tier
Starting from the classes of the conceptual schema, mappings are defined between domain classes and design classes that implement the web services at the application tier. Two techniques have been jointly applied to design application web services: (1) functional decomposition (one class per web service) and (2) entity-oriented (one class per domain entity). This simple strategy assigns all the necessary functionality to manage the objects of a class of the problem domain to a web service. The implementation of web services in the application tier should take into account the design aspects previously discussed. To give appropriate support to these aspects, we divide this tier into two subtiers: (1) Business Facade and (2) Business Logic. Business Facade and Web Services. In this subtier, we create a web service for each domain class. This service is implemented by a class that combines a set of design patterns (facade, singleton and factory [12]). These patterns give support to the following design requirements: – Loose coupling. The facade pattern provides a simple interface to complex subsystems making them easier to use, reducing the complexity of the Business Facade tier and minimizing dependencies (decoupling the system of its clients). These classes have knowledge of the classes that implement their immediate lower tier and delegates client requests to their objects.
48
V. Pelechano et al.
– User Control. The facade pattern allows us to filter client requests. Depending on the kind of user the facade pattern allows or disables the access to some services of the business logic objects. – Object Creation. The factory pattern provides a simple interface to create and destroy objects hiding any detail of the creation process. The facade and the factory patterns should be usually singletons assuring that only one instance exists. In order to explain the implementation of each tier of the proposed architecture, we take as an example the Publication domain class and its related classes (see Figure 8). In this case, a web service called Publication is defined (using the same name as the domain class). A class called PublicationService (ClassName1 + Service) implements the operations associated to the web service, providing the functionality of the Publication class in the conceptual schema. The methods of the PublicationService class can automatically be obtained from its corresponding domain class specification and are the following: – – – – – –
–
–
–
create: creates a business object, initializing its state destroy: destroys a business object modify: modifies the state (attribute values) of a business object retrieveClassObject: retrieves a business object given an identifier (in the example, this method is called retrievePublication) retrievePopulation: retrieves all the objects of the class find: finds all the objects that hold a specific condition that can be optionally passed as an argument. It is used to implement the search mechanisms and ordering criteria specified in the navigational contexts retrieveRelated: given an object of the class and a relationship or role name, it retrieves those objects that are related to it. In the example, a publication is related with the signer and the research line of a group assignRelated: relates (creates a link) an object of the class with another object (passing its identifier as an argument) that belongs to a class related through an association or aggregation relationship (the relationship name or role is passed as an argument) releaseRelated: the inverse to the previous one. It releases an object of a relationship (destroys the link) from the relationship
A method is generated for each operation of the domain class that appropriately implements its specified functionality and updates the state of the object. All the operations of the web service have an extra argument, which is the kind of user that is currently requesting the provided services. This is due to the fact that the business facade tier checks if the user has permission (access) to use the requested service (implementing the access control specified for each user). The set of methods that we propose in order to implement the functionality of the web service is intended to be simple and powerful at the same time. 1
Name of the class in the conceptual schema
Developing Web Applications from Conceptual Models
49
Publication -title -publicationYear -publicationMonth -publicationfile -abstract -series -validated -status -bibtex +create() +destroy()
1
Signer Signs
*
-signerPosition -signAs +create() +destroy()
Fig. 8. Aggregation relationship between Publication and Signer
Business Logic and Domain Classes. The Business Logic subtier implements the classes of the conceptual schema following the Domain Model pattern [13] and its variations. These classes implement the functionality specified in the conceptual schema and delegate the saving, retrieving and updating of their objects’ state in the data access objects of the Persistence tier. A class of the conceptual schema (i. e. Publication) is implemented by a class in the business logic tier called PublicationDomain (ClassName+Domain). This class has those attributes specified in the schema. It also has an object-valued attribute for each univalued association or aggregation and a collection of objects for each multivalued association or aggregation. Its methods are the following: – create: creates a new business object, initializing its state. – destroy: destroys this object Read and write methods are generated for each attribute of the class in the conceptual schema (readTitle and writeTitle methods for the Title attribute of a PublicationDomain). These methods are called by the modify method of the business facade. A method that implements the specified functionality is generated for each operation of the domain class. The following class methods are also included (considering a class as an object factory): – retrieveClassObject: retrieves a business object given an identifier (in the example of Member this method is called retrieveMember) – retrievePopulation: retrieves the population of the class – find: finds all the objects that hold a specific condition that can be optionally passed as an argument. It is used to implement the search mechanisms and ordering criteria specified in the navigational contexts For each association and/or aggregation relationship that is related to a domain class the following methods are generated: – retrieve[ RelationshipName | RoleName ] RelatedClassName: retrieves those objects of the RelatedClassName class that are related to the object through the RelationshipName | RoleName. In the example (see Figure 8), one of the relationships of the Publication class is the Signs aggregation that relates the Publication and the Signer classes. In this context, the retrieveSignsSigner method is generated following this step
50
V. Pelechano et al.
– assign[ RelationshipName | RoleName ] RelatedClassName: relates the object to an object of the RelatedClassName class relationship (association or aggregation) whose name is RelationshipName | RoleName. In the previous example, the assignSignsSigner method is generated to assign a Signer to a Publication – release[ RelationshipName | RoleName ] RelatedClassName: it releases an existing relationship (RelationshipName | RoleName) between the object and an object of the class RelatedClassName. In the previous example, the releaseSignsSigner method is generated to release a Signer that Signs a Publication 4.3
Persistence Tier
The following software artifacts are generated from the conceptual schema: (1) classes that encapsulate the access to the data server and the specific technology (classes) of the target development environment, and (2) the database. Data Access Classes. This subtier implements the classes that allow the business logic objects to access and manipulate the relational repository (in our case) where the application data is stored. To implement these classes, we follow the patterns Table Data Gateway, Data Mapper, Association Table y Foreign KEY Mapping, Single/Concrete/Class Table Hierarchy [13], State and Role [14] adapted to implement classes that have direct correspondence to a table of the database, or are related through an association, aggregation or specialization/generalization relationship. We create a class for each conceptual schema class that we call ClassName + DataAccess (we generate the class PublicationDataAccessfor the Publication class). Next, the following methods are generated: – insert: inserts a tuple in the database with the attribute values that are passed by the arguments – update: modifies a tuple with the values that are passed as arguments – delete: deletes a database tuple whose identifier is passed as argument – find: retrieves the tuple that corresponds to an specific identifier. It is overloaded to retrieve all the tuples that hold a certain filtering condition Data. In this tier, the logical and physical design of the database is carried out by defining a series of correspondences between the classes of the conceptual schema (and their relationships) and the database tables.
5
Conclusions and Future Work
The work presented in this paper presents a model-driven code generation strategy that provides systematic code generation of web applications modelled in OOWS. The method proposes a multi-tier Web service architecture to build web solutions in a systematic way. This is done by defining correspondences
Developing Web Applications from Conceptual Models
51
between the conceptual abstractions and the software elements that implement each tier of the architecture using design patterns. Currently, we are applying this method to several web projects. These experiences are providing us the necessary feedback to extend conceptual modelling primitives related to web applications (personalization and adaptation features, security requirements, specification reuse mechanisms) and allow us to define more complex and coarse grained web services that will provide sofisticated services and minimize communication between client and servers, which improves the performance of our generated web applications.
References 1. Muruguesan, S., Desphande, Y.: Web Engineering. Software Engineering and Web Application Development. Springer LNCS – Hot Topics (2001) 2. Fraternali, P.: Tools and approaches for developing data-intensive Web applications: a survey. ACM Computing Surveys, ACM Press 31 (1999) 227–263 3. Schwabe, D., Rossi, G., Barbosa, S.: Systematic Hypermedia Design with OOHDM. In: ACM Conference on Hypertext, Washington, USA (1996) 4. Ceri, S., Fraternali, P., Bongio, A.: Web Modeling Language (WebML): a Modeling Language for Designing Web Sites. In: Proc. of WWW9, Elsevier (2000) 137–157 5. Koch, N., Wirsing, M.: Software Engineering for Adaptive Hypermedia Applications. In: 3rd Workshop on Adaptive Hypertext and Hypermedia. (2001) 6. De Troyer, O., Leune, C.: WSDM: A User-centered Design Method for Web sites. In: Proc. of the 7th International World Wide Web Conference. (1997) 85–94 7. Pastor, O., Abrah˜ ao, S., Fons, J.: An Object-Oriented Approach to Automate Web Applications Development. In: Proc. of EC-Web’01. Volume 2115 of Lecture Notes in Computer Science., Munich, Germany, Springer (2001) 16–28 8. Austin, D., Barbir, A., Garg, S.: Web Services Architecture Requirements. W3C Working Draft. Technical report, W3C Consortium (2002) 9. Pastor, O., G´ omez, J., Insfr´ an, E., Pelechano, V.: The OO-Method Approach for Information Systems Modelling: From Object-Oriented Conceptual Modeling to Automated Programming. Information Systems 26 (2001) 507–534 10. Insfr´ an, E., Pastor, O., Wieringa, R.: Requirements Engineering-Based Conceptual Modelling. Requirements Engineering 7 (2002) 61–72 11. Vinoski, S.: Where is Middleware? IEEE Internet Computing 6 (2002) 83–85 12. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading, MA (1994) 13. Fowler, M.: Patterns of Enterprise Application Architectures. Addison-Wesley (2003) 14. Pelechano, V., Pastor, O., Insfr´ an, E.: Automated Code Generation of Dynamic Specializations. An Approach Based on Design Patterns and Formal Techniques. Data and Knowledge Engineering 40 (2002) 315–354
A Framework for Business Rule Driven Web Service Composition Bart Orri¨ens, Jian Yang, and Mike P. Papazoglou Tilburg University, Infolab PO Box 90153 5000 LE, Tilburg The Netherlands {b.orriens,jian,mikep}@uvt.nl
Abstract. With web services emerging as a promising technology for supporting open and dynamic business processes, it is witnessed that standards for business process specification in the context of web services composition have been fast developed in recent years, e.g. WSFL, XLang, BPEL. However, none of the proposing specifications really address the issues of dynamic business process creation, e.g. a vast service space to search, a variety of services to compare and match, and different ways to construct business processes. One of the assumptions these standards make is that the business process is pre-defined. Obviously this assumption does not hold if the business needs to accommodate changes in applications, technology, and organizational policies. We believe business processes can be dynamically built by composing web services if they are constructed based on and governed by business rules. In this paper we analyze the basic elements in business modelling and how they relate to the web service composition process. As a result a rule driven mechanism is developed to govern and guide the process of service composition in terms of five broad composition phases spanning abstract definition, scheduling, construction, execution, and evolution to support on demand and on the fly business process building.
1
Introduction
The Web has become the means for organizations to deliver goods and services and for customers to search and retrieve services that match their needs. Web services are self-contained, Internet-enabled applications capable not only of performing business activities on their own, but also possessing the ability to engage other web services in order to complete higher-order business transactions. The platform neutral nature of web services creates the opportunity for building composite services by combining existing elementary or complex services, possibly offered by different enterprizes. For example, a travel plan service can be developed by combining several elementary services such as hotel reservation, ticket booking, car rental, sightseeing package, etc., based on their WSDL descriptions. We use the term composite service to signify a service that employs and synthesizes other services. The services that are used in the context of a composite ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 52–64, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
A Framework for Business Rule Driven Web Service Composition
53
service are called its constituent services. Some standards are emerging, e.g., BPEL [3], which specifies business processes and business interaction protocols in the context of web service composition. Unfortunately, composite web service development and management is currently a manual activity that requires specific knowledge about composing web services in advance and takes a lot of time and effort. This even applies to the applications that are currently being developed on the basis of available standards, such as BPEL. The difficulty is that service composition is simply too complex and too dynamic to handle manually. With a vast service space to search, a variety of services to compare and match, and different ways to construct composed services, the only alternative capable of facilitating dynamic service composition development and management is an automated process of service composition governed by rules and administrated by rule engines. In this paper we investigate a rule based approach for service composition that combines best practices from rule base systems and software engineering to support parameterization, dynamic binding, and flexible service compositions. Rules are logical statements about how a system operates. Some of these rules may be expressed in the language of the business, referring to real-world business entities, and are therefore called Business Rules. Business rules can represent among other things typical business situations such as escalation (”send this document to a supervisor for approval”), managing exceptions (”make sure that we deal with this within 30 min or as specified in the customer’s service-level agreement”). Our conviction is that business rules can be used in the context of service composition to determine how the composition should be structured and scheduled, how the services and their providers should be selected, and how run time service binding should be conducted. In our framework a rule driven mechanism is used to steer the process of service composition in terms of five broad composition phases spanning definition, scheduling, construction, execution, and evolution. Based on these phases we analyze and classify business rules and determine how they impact service composition. Although previous work on business rules, such as that in [14], introduces a simple classification scheme for business process that classifies rules as relationship, constraint, authorization, choice, and action rules, these are of a general nature and do not consider service composition requirements. It is not clear for example how we can use these types of rules for the specification of constraints on scheduling, the criteria and conditions of task and resource selection, run-time constraints for service execution, and time, cost and quality concerns regarding the selection of service providers. Nevertheless, they can form a sound basis for extension and application to the service composition life cycle phases. Service composition and business rule evolution can also be handled in accordance with the phased approach we are developing. Once we know how business rules affect the various service composition phases, we can then define transformation rules to handle changes in a consistent manner. The paper is structured as followed: we begin in Section 2 with analyzing how we can realize business processes using service composition. Next, in Section 3
54
B. Orri¨ens, J. Yang, and M.P. Papazoglou
we introduce a business rule driven framework for service composition. We define a classification for business rules in the context of service composition in section 4. We present our conclusions and discuss future research in Section 5.
2
Service Composition for Business Processes
We mentioned in Section 1 that service composition can be used to describe and realize business processes. In this section we identify the basic elements of a service composition. Subsequently we introduce an architecture to realize business processes through web service composition. 2.1
Service Composition Elements
The nature of Business Processes varies in terms of complexity and scope. There have been some definitions developed in the past to analyze business processes [6,9,7,10,1]. However, none of them provided a full coverage of a business process. In our view the best way to design a business process is to analyze its role in business. Business can be viewed from various perspectives, as is done for example in [12], where the data, function, organization and resource view are distinguished. In [13] a similar distinction is made, although the data view is referred to as the informational perspective. [4] differentiates between a functional, organizational and informational view. It also includes a behavioral perspective. To find the basic elements for defining and analyzing a Business Process we adopt the framework used in [18], which includes how, why, when, who, what and where aspect. The how aspect explains how things are done in the business, the why aspect provides the rationale, the when aspect tells us when things happen, the who aspect gives information about the people and resources involved, the what aspect addresses the impact on the informational structures, whereas the where aspect describes the geographical location of the departments involved. By studying the current standard business process specification languages (e.g., BPEL, BPML) we identify the following composition elements representing these aspects: activity, condition, event, flow, message, provider and role elements. We briefly discuss each composition element as followed: Activity An activity represents a well-defined business function and is part of the how aspect. For example, the booking of a flight is an activity in the travel plan business process. Each activity is associated with a role that is responsible for its execution. Also, an activity may be related to messages, defining its data prerequisites. Condition The behavior of business processes is governed by business rules. They are statements that define or constrain some aspect of the business, which are intended to assert business structure or to control or influence the behavior
A Framework for Business Rule Driven Web Service Composition
55
Fig. 1. Architecture for a service composition
of the business [2]. These rules as such provide the rationale behind the business process, representing the why aspect. Business rules are expressed in service composition in the form of conditions. Examples include pre -and post conditions for activities, and message integrity constraints. Event Events in a service composition represent business events, thus originating from the when aspect. A business event is an occurrence of some sort. An example of an event is ”No seat available on flight”. Events have an impact on business processes, influencing their behavior. They can trigger new activities or change the result of running activities. Flow Flow expresses the how aspect and is used to express the choreography of complex activities (such as business processes). Possible types include sequential, parallel, conditional and iterative flow patterns. Depending on the style of control flow modeling, e.g. block [1] or flow-transition ([3], [15]) based, a service composition can contain one or more control flow elements respectively. In both cases control flows can be nested at different levels of granularity, allowing the specification of arbitrarily complex structures.
56
B. Orri¨ens, J. Yang, and M.P. Papazoglou
Message To represent the information exchanging behavior of a business process in a composition we utilize messages. Messages represent the what aspect and are associated with the composition as a whole, expressing the interactions of the business process with the outside. They are also linked to activities to model the distribution of information within the process. Finally, messages may be correlated to express data dependencies between activities. Provider The people and resources participating in a business process are depicted in a composition as providers. A provider belongs to the who and where aspect and describes a concrete web service. Role Roles are part of the who aspect and define the expected behavior of participants in the business in an abstract manner, i.e. without providing any specifics resource or person responsible for actually performing the task. In the context of service composition these provide abstract descriptions of the services involved in the composition. 2.2
Business Process Realization with Service Composition
In the previous subsection we identified and discussed a set of elements with which we can represent business processes in terms of service composition. Another important aspect related to service composition addresses the question of how we can use it to realize business processes. For this purpose we adopt a phased approach to service composition, developed in [17] and [16]. The purpose of these phases is to first describe services in the abstract and then generate executable service processes from these abstract specifications using business rules. In this approach five broad phases are distinguished, spanning composition definition, scheduling, construction, execution, and evolution. These phases, which together constitute the service composition life cycle, are supported by the architecture, depicted in Fig. 2. There are four main components in the architecture that will be addressed as followed: – Definer: it focuses on the specification of service composition definitions in an abstract manner. These definitions employ WSDL in conjunction with an orchestration language for web services to express business processes, e.g. the earlier mentioned BPEL. The tasks of the Definer include activity, constraint, event, message and role specification. – Scheduler: the result of the Definer, the abstract composition, is passed on to the Scheduler. The task of the Scheduler is to concretize it by replacing the defined roles with concrete providers. For this purpose potentially available and matching service providers are located for each role through the interaction with an UDDI registry via the Enquiry API. Subsequently, the user
A Framework for Business Rule Driven Web Service Composition
57
Fig. 2. Service composition life cycle architecture
selects a provider from the retrieved set. Note that it is possible to derive multiple implementations, allowing the user to choose between alternatives. – Constructor: once the user has selected an alternative, it is passed on to the Constructor. The Constructor utilizes this composition to set up the execution environment. This is done through the generating of executable software. Optionally the composite service can be published as a new service in an UDDI registry to make it available to others. Alternatively, it may also be stored in the Service Composition Repository for reuse. – Executor: the Executor is responsible for executing and managing the constructed composition. Management means monitoring of the composition behavior during execution as well as dealing with changes, e.g. when the governing rules or constituent services are subject to change. In the latter case the composition needs to be transformed to incorporate the change. The above clearly demonstrates the idea behind the phased service composition development approach, being to start with an abstract definition and gradually make it concrete and executable.
58
3
B. Orri¨ens, J. Yang, and M.P. Papazoglou
Business Rule Driven Service Composition
In the previous section we analyzed the influential aspects for business process and how they can be expressed in the context of service composition. As a result we identified a set of composition elements. It is our conviction that interactions between these elements can be expressed in business rules. Examples include rules specifying how activities should be structured and scheduled, which roles are played by which providers and how run time service binding should be conducted. The notion of using rules (referred to as composition rules) to link composition elements, enables us to define a rule mechanism to drive service composition. This is shown in Fig. 3, which shows an overview of our framework for business rule driven service composition. The advantage of the framework is that it makes the definition, scheduling, construction and execution of compositions flexible and controllable, since composition elements can be combined in a plug and play manner by using composition rules. This makes it possible to generate compositions on demand and on the fly in response to a user request. Furthermore, the framework allows service composition to become more dynamic, because changes in the business can be easily incorporated and effectuated through the redefinition of the appropriate composition elements and rules. Additionally, plug and play service composition significantly reduces the time and effort involved in composition development and management, whilst at the same time increases the quality of the service composition process (e.g. in terms of syntactic and semantic composition correctness). As can be seen Fig. 3 the framework consists of two main components: – Service Composition Manager (SCM): it is responsible for assisting the user in developing, executing and managing service compositions. Its internal structure covers the service composition life cycle architecture discussed in Section 2. – Service Composition Repository (SCR): is responsible for maintaining composition elements and rules. The Composition Engine (CE) facilitates storage and retrieval of these elements and rules, which are contained in the Composition Element Repository (CER) and the Composition Rule Repository (CRR) respectively. We briefly outline how these components interact with one another to drive the service composition life cycle. 1. The User sends a request to the SCM. This request gives a brief description of the business activities that the User wants to perform, for example to arrange a travel plan. 2. The SCM receives the User request, determines what functionality is required and subsequently enters the service composition life cycle. The Definer begins with the definition of an abstract composition. We illustrate how this is done using the specification of activities for a travel plan as an example.
A Framework for Business Rule Driven Web Service Composition
59
Fig. 3. Framework for business rule driven service composition
a) The Definer sends a check request to the CE. The purpose of this request is to find out whether there is any information available on activities for a travel plan. The CE receives this request and searches the CRR to see if there are rules specifying travel activities. If such a composition rule exists, then the CE uses it to retrieve the specified activities from the CER, which are then send to the Definer. Otherwise, no activity elements are returned. b) In case relevant activities are found, the Definer will ask the User whether to use them in the composition. The User can then make a selection of which activities should be included in the composition. Optionally the User can also define and add other activities to the composition. In this case the Definer will not only add these activities, but also stores them in the CER as well as updates the composition rules in the CRR concerning travel plan activities. c) When the User does not want to use the proposed activities (or if no relevant activity elements were found), then the User is obliged to indicate which activities should be included in the travel plan. The Definer subsequently uses the provided information to add activities to the composition. Similarly as in the previous step this new information is stored in the SCR. The other activities in the Definition phase, such as the specification of task constraints, message exchanges, runtime behavior and resource use, are performed in a similar fashion as described above.
60
B. Orri¨ens, J. Yang, and M.P. Papazoglou
3. When the Definer has developed an abstract composition, it is passed on to the Scheduler. The task of the Scheduler is, as explained in Section 2, to concretize the composition. This is done in a similar manner as the Definer performs its tasks. For each role in the composition the Scheduler contacts the CE to locate relevant rules regarding to the use of existing provider elements. If they are found, the specified providers are retrieved and proposed to the User. In case none are suitable according to the User or when no providers were found, the Scheduler initiates a search in e.g. an UDDI registry to locate suitable service providers. The User can then select a provider, which is subsequently added to the composition. The provider element is also stored in the SCR. 4. After the Scheduler has developed a set of concrete alternatives, the user is approached to select one. The selected composition is passed on to the Constructor, which generates executable software to enable execution. The composition is then executed by the Executor. The Executor monitors the running composition until it has completed its activities. The SCM subsequently presents the results to the User, for example a flight ticket and a hotel room reservation for a travel plan.
4
Business Rule Classification for Service Composition
In Section 3 we outlined a business rule driven framework for service composition. In the context of this framework it is useful to determine the types of business rules that will be required to facilitate the different phases in the service composition life cycle. There has been substantial work on business rule classification, e.g. [14] [2] [5]. The scheme in [8] provides a high-level classification, distinguishing between terms, facts and rules. Terms and facts are statements that contain sensible, business relevant observations, whereas rules are statements used to discover new information or guide the decision making process. The problem with this (and other) classification is that it is generic and cannot be directly applied to service composition. It is not clear for example how the distinguished types of business rules can be used to specify scheduling constraints, criteria and conditions for task and resource selection, run-time constraints for service execution, and time, cost and quality concerns regarding the selection of service providers. Therefore, we introduce our own classification scheme for business rules. In this scheme we classify business rules along two dimensions. The first dimension specifies whether we are dealing with a composition element or a composition rule, analog to the distinction between terms and facts, and rules. The second dimension positions the elements and rules in terms of to which aspect of a business process they belong, resulting in structure, role, message, event and constraint related business rules. Structure Related Rules These rules address the how aspect of a business process and facilitate the specification of the way in which the service composition is to be carried out in terms
A Framework for Business Rule Driven Web Service Composition
61
of activities. As discussed in Section 2 activity and flow elements are used to express this information. To combine these elements we identify flow grouping and activity dependency rules as relevant composition rules. These rules indicate respectively how activities are to be grouped in a composition and what dependencies exist between activities. To illustrate let us look at the following activity dependency rule (in pseudo code) if (FlightBookingActivity and HotelActivity depended) then (HotelActivity performedAfter FlightBookingActivity)
which specifies a prioritization rule relevant for a travel plan, stating that first a flight must have been booked before an attempt should be made to reserve a hotel room. Role Related Rules These rules govern the participants that are to be involved in the service composition, thus controlling the who and where aspect of the business process. These aspects are represented using and provider elements (see Section 2). The interactions of these elements with other composition elements include the assignment of a role to an activity, the raising of an event and the binding of a role to a provider. We can use role assignment, event raiser and role binding rules respectively to create these interactions. For example, suppose we have an activity for flight ticket booking. Then we may specify the role assignment rule if (FlightBookingActivity is performed) then (Role type is airline)
depicting that only airlines are capable of performing this activity. This will ensure that when concrete providers are selected only airlines are taken into consideration. Message Related Rules These rules regulate the use of information in the service composition, as such governing the what aspect of a business process. To express this aspect we defined message elements in Section 2. Please recall that messages are associated with a composition as a whole, depicting its use of information. Also, the information is internally distributed to the different activities. We can govern this distribution using message distribution rules. The actual assignment of messages to an activity we can do through message assignment rules. To derive message dependencies we utilize message dependency rules. We can illustrate using the following example if (FlightBookingActivity has Input) then (Message contains departureDate,arrivalDate,from,to)
showing a message assignment rule for the input of the flight booking activity, expressing that the input of the activity must include a departure date, arrival date, and from and to where the flight is to take place.
62
B. Orri¨ens, J. Yang, and M.P. Papazoglou
Event Related Rules These rules govern the behavior of service compositions in reaction to (un-) expected events. They thus regulate the when aspect of a business process. Which events affect which activities is regulated by activity influence rules. To handle an event usually some sort of activity is performed. Which activity this is exactly we can depict in event handler rules, which specify the activities that should be performed in case events occur. An example is the following if (SeatUnavailableEvent occurs) then (Stop the composition)
indicating that if there is no seat available (something which can occur e.g. in the context of a flight booking activity), the composition should be stopped. Constraint Related Rules These rules steer the use of constraints in a business process, represented by conditions in service composition. Conditions are associated with activities, specifying pre -and/or post-conditions. To specify the latter we utilize pre-condition assignment and post-condition assignment rules respectively. They can also control event occurrences and effectuate integrity constraints using event control and message constraint rules. The following example illustrates a post-condition assignment rule if (FlightBookingActivity completed) then (Seat must be reserved)
constraining the result of the flight booking activity by specifying that after the activity has been completed a seat must have been reserved. Otherwise, the performance of the activity cannot be considered to have been successful. In the previous we have briefly outlined how business rules can be used to govern the different aspects of a business process in the context of service composition. It should be noted that the above is preliminary and that further work is required to identify the exact rules that we require to steer the service composition development process.
5
Conclusions and Future Research
It is clear that current standards in service composition, such as BPEL, are not capable of dealing with the complex and dynamic nature of developing and managing composite web services. The challenge is therefore to provide a solution in which dynamic service composition development and management is facilitated in an automated fashion. In this paper we showed how business processes can be expressed in terms of service composition through the use of various types of composition elements. Subsequently we explained our phased approach to service composition, in which five broad phases are distinguished to realize a business process, spanning composition definition, scheduling, construction, execution, and evolution. In order to cater the need for flexible and dynamic service composition, we introduced
A Framework for Business Rule Driven Web Service Composition
63
a rule driven approach that describes how business rules, referred to as composition rules, can be used to steer these five composition phases, e.g. to specify scheduling constraints for activities, criteria and conditions for task and resource selection, run-time constraints for service execution, and time, cost and quality concerns regarding the selection of service providers. We also defined a classification scheme for business rules that details how they can be applied in the context of service composition, an issue that has not been addressed in previous work on business rule classification. We argue that the approach we presented in this paper not only makes service composition more flexible and dynamic compared to current standards, but also reduces time and effort involved in composition development and management. Nevertheless, the work reported in this paper is at its very early stage. Future work will be focused on the following issues: – the specification and formalization of composition elements and rules; – the design of a rule mechanism to manage and apply composition rules, e.g. rules as components, as services or as specifications; – the architecture for the rule framework, e.g. centralized versus decentralized architecture; and – the development of a change management system to manage the evolution of composition elements and rules and defined service compositions.
References 1. Business Process Modelling Initiative; “Business Process Modeling Language”, June 24, 2002, http://www.bpmi.org 2. Business Rules Group; “Defining business rules, what are they really?”, July 2000, http://www.brcommunity.com 3. F. Curbera, Y. Goland, J. Klein, F. Leymann, D. Roller, S. Thatte, S. Weerawarana; “Business Process Execution Language for Web Services”, July 31, 2002, http://www-106.ibm.com/developerworks/webservices/library/ws-bpel/ 4. B. Curtis, M.I. Kellner, J. Over; “Process Modeling”, Communications of the ACM, Vol. 35, No. 9, pp. 75–90, 1992 5. C.J. Date; “What Not How: The Business Rule Approach to Application Development”, Addison & Wesley Longman Inc, 2000 6. T.H. Davenport, J.E. Short; “The new industrial engineering: Information technology and business process redesign”, Sloan Management Review , Vol. 31, No. 4, pp. 11–27, 1990 7. U. Dayal, M. Hsu, R. Ladin; “Business Process Coordination: State of the Art, Trends, and Open Issues”, Proceedings of the 27th VLDB Conference, Rome, Italy, 2001 8. B. von Halle; “Business rules applied: Building Better Systems Using the Business Rule Approach”, Wiley & Sons, 2002 9. H.J. Harrington; “Business Process Improvement: The Breakthrough Strategy for Total Quality, Productivity, and Competitiveness”, McGraw-Hill, New York, NY, USA, 1991 10. M. Koubarakis, D. Plexousakis; “A Formal Framework for Business Process Modeling and Design”, Information Systems, Vol. 27, pp. 299–319, 2002
64
B. Orri¨ens, J. Yang, and M.P. Papazoglou
11. T. Morgan; “Business Rules and Information Systems: Aligning IT with Business Goals”, Addison & Wesley, 2002 12. A.W. Scheer; “Architecture for Integrated Information Systems, Foundations of Enterprise Modeling”, Berlin, Germany, 1992 13. F. Vernadat; “CIMOSA, A European Development for Enterprise Integration Part 2, Enterprise Modelling”, Proceedings of the First International Conference on Enterprise Integration Modelling, Austin, TX, USA, 1992 14. R. Veryard; “Rule Based Development”, CBDi Journal, July/August 2002 15. F. Leymann; “Web Service Flow Language”, May 2001 http://www.ibm.com/software/solutions/ webservices/pdf/WSFL.pdf 16. J. Yang, M.P. Papazoglou; “Service Component for Managing Service Composition Life-Cycle”, Information Systems, June, Elsevier, 2003 (forthcoming) 17. J. Yang, M.P. Papazoglou; “Web Components: A Substrate for Web Service Reuse and Composition”, Proceedings of the 14th International Conference on Advanced Information Systems Engineering, Toronto, Canada, 2002 18. J.A. Zachman; “A framework for information systems architecture’, IBM Systems Journal, Vol. 26, no. 3, pp. 276–292, 1987
Virtual Integration of the Tile Industry (VITI) 1
1
2
Ricardo Chalmeta , Reyes Grangel , Ángel Ortiz , and Raúl Poler 1
2
Grupo IRIS, Dpto. Lenguajes y Sistemas Informáticos, Universitat Jaume I, Campus Riu Sec s/n, 12071 Castellón, España {rchalmet,grangel}@uji.es 2 Centro de investigación (CIGIP), Universidad Politécnica de Valencia, Camino Vera, s/n, 46022 Valencia, España {aortiz,rpoler}@cigip.upv.es
Abstract. A virtual enterprise (VE) can be considered a temporary alliance between enterprises located in different parts of the world that intervene in the diverse phases of the life cycle of a product or service, and work to share skills, resources and costs. However, designing and managing an efficient and integrated virtual enterprise that presents the semblance of a single enterprise to customers is a very complex task. To support this, new methods enabling the integration of virtual enterprises must be developed and their use must be popularised through examples and application experiences. This paper reports the practical results of the Virtual Integration of the Tile Industry (VITI) research project. This is primarily a systematic methodology, a set of models with the best work practices and a set of software applications for business integration adapted to the particularities of virtual enterprises. The collection of data and the validation and application of the results was made possible thanks to the collaboration of one of the leading virtual tile enterprises.
1 Introduction A virtual enterprise also known as an extended enterprise consists of a temporary alliance of independent, globally distributed companies that share a view of what is being demanded by the market in the different phases of the life cycle of a product or service and which operate in a cooperate manner, sharing skills, and resources, in order to successfully accomplish a corporative strategy [1]. As a consequence of this high level characterization, a virtual enterprise should also be an integrated enterprise, which means that changes in the internal or external environment should be dynamically reflected in its objectives, its actions, and its own composition as early as possible, while ensuring that the activities of all the components contribute to the overall objective in a coordinated way. The design and construction of an integrated organisation and functioning system in a virtual enterprise, using this approach, is an extremely complex process that involves different technological, human [2], and organisational elements. Hence, to carry out this objective, it is necessary to use methodologies, reference models of best business practices, information infrastructures, and computer enterprise engineering tools that help throughout the life cycle of a virtual enterprise.
M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 65–76, 2003. © Springer-Verlag Berlin Heidelberg 2003
66
R. Chalmeta et al.
2 VITI Project Different standards and technologies can be used in different phases of the life cycle of a virtual enterprise [3]. For example STEP for product data, SGML/XML for documents and EDI for electronic commerce are the three major data exchange standards in VE environments. On the other hand, there are also examples of use of Java agents, CORBA ORB, and XML for the development of a web interface of metadata for sharing product data [4]. However, there is a shortage of tools to aid the integration of a virtual enterprise during all its life cycle, since the main methodologies that exist [5], [6], [7] focus on the problem of integrating a single enterprise and are still not completely adapted to the needs of virtual enterprises [8], [9]. Finally, there are no examples of real projects involving the integration of virtual enterprises in the literature, which hinders their becoming more widespread. In this context, the IRIS Group of Universitat Jaume I in Castellón (Spain) has been working on the VITI (Virtual Integration of the Tile Industry) project since 1999. The objective is to develop and validate a step forward in the state of the art of Enterprise Integration by developing a methodology and a set of techniques, reference models, and software applications that enable all the elements (organisational, technological, human resources, etc.) of a virtual enterprise to be coordinated and integrated. One of the leading virtual tile enterprises in its sector collaborated in data collection, validation and application of results. To achieve its goal, the VITI project set out the following objectives: • Methodology for Virtual Enterprise Integration. The design of a methodology for directing the process of constructing a generic virtual enterprise, from the transactions between possible partners as part of the strategy management up to the design and implementation of the business processes. • Application of the Methodology to integrate the virtual tile enterprise that collaborated in the project in order to enhance its competitiveness, while at the same time the results were validated and a case study was obtained. • Development of a Reference Model that described the best work practices in the internal processes of the different enterprises forming part of the virtual tile enterprise that collaborated in the project, such as Warehouse, Production, Administration, Human Resources, etc. The model also had to include, among other aspects, the activities that are carried out, the decisions that are made, the role of human resources, the technology used and the information required. • Development of a Reference Model that enabled the dynamic relations within the whole virtual tile chain to be represented. The model had to show (1) the management system of the virtual enterprise, and (2) the cross-organisational business processes, that is, the processes in which the inputs and outputs are generated in other enterprises (for example, Sales, Purchases, Complaints, etc.) and (3) the supporting information system. • Technological Infrastructure. Application of information technologies to integrate and optimise the functioning of the relations among the members of the virtual tile enterprise. Information technology developments would be integrated within the Enterprise Resource Planning (ERP) of each enterprise and would be oriented towards Knowledge management, workflow and e-commerce. Tables 1, 2 and 3 present and justify the scientific and technological approaches selected to accomplish the goals of the project.
Virtual Integration of the Tile Industry (VITI)
67
Table 1. Goal 1: Methodology for Virtual Enterprise Integration
Objective: Methodology for Virtual Enterprise Integration Approach Justification GERAM (result of Some of the benefits of using GERAM were: (1) it unifies the a IFIP/IFAC Task concepts used by researchers concerning the methods, models Force and the ISO and tools needed to construct an integrated enterprise; (2) it organises the different components needed for the integration Working group of the enterprise and (3) it defines the requirements that a tc184/sc5/wg1) reference architecture must fulfill in order to be considered complete. Disciplines for the planning and implementation of VE
Such as: national and international business relations, systems theory, knowledge management and techniques enabling the representation of tacit knowledge, holistic systems, etc.
Table 2. Goal 2: Reference Models
Objective: Reference Models Approach Justification Benchmarking To identify best work practices. Process reengineering
The business process should be used as the structural unit underlying the four basic aspects that constitute enterprise activity (i.e. functional, resources, informational and decisional). ISO 9000, ISO These standards must be borne in mind when developing the 14000 Standards; models of internal and cross-organisational business processes. Framework for To structure and represent the reference models, the IDEF0 enterprise technique should be used for the operating processes and the modelling: IDEF0; GRAI-GIM technique should be applied in the case of the GRAI-GIM; management system. These techniques were chosen because UML/Agents; they allow us to represent the different components of the Entity/ processes (activities, decisions, information and resources) Relationship (E/R) with an adequate level of detail, while at the same time their graphic representation is simple and easy to understand. The Workflow model of the Workflow modelling technique oriented towards UML objects and agents Management should be used in the modelling of the flow of the information Coalition (WMC) system. The virtual enterprise database should be represented with the E/R model, due to the ease with which it can be matched to the relational databases the enterprises currently work with. The WMC model must serve as a guide in the correct organisation and structuring of the workflow system in the VE (roles, rules, routes, tracking, integration with IT, system management, metrics and statistics, etc.).
68
R. Chalmeta et al. Table 3. Goal 3: Technological Infrastructure
Objective: Technological Infrastructure for Knowledge Management, Workflow, and e-business Approach Justification Computer tools Documents contain a large proportion of the explicit knowledge in the enterprise, and also enable the person who possesses that for document knowledge to be identified. Thus, software applications had to be management developed which allowed all stages of the life cycle of the and control documents to be managed, i.e. their creation, revision, approval, distribution, and storage. A workflow model must be implemented in computerised form WMC if it is to be of any use. To do so, computer applications integrated development with the ERP system had to be developed [10]. software The development of software for the automation of the processes Development and the information flow among the enterprises involved in the software and virtual tile logistics chain. The software must be multiplatform so integration of as to allow connection between the diverse computer systems. e-business applications
3 Methodology for Virtual Enterprise Integration The starting point of the project was the determination of the methodology, which would allow us to define the master plan to execute an integration programme in a virtual enterprise. From a practical point of view, the result can be summarised as follows (see Fig. 1): 1. Definition of the conceptual aspects of the virtual enterprise (search for partners, tender formation, negotiation/agreements, contract awarding and management, definition of the virtual enterprise, mission, vision and values, virtual enterprise strategy, objectives, and general policies) and of each single enterprise (mission, vision, strategy, politics, and enterprise values). 2. Redesign of the new process map (internal business processes and crossorganisational business processes that are affected by changes), according to the previously defined concepts. 3. Implementation of the VE new process map. To organise and to manage human resources according to the process map. 4. To extend the information system (and the technological infrastructure) to support the process map of the virtual enterprise, considering the different levels of decision and the support technology. In the following section each of the steps of the methodology are described together with the results of their application to a virtual tile enterprise.
Virtual Integration of the Tile Industry (VITI)
VIRTUAL ENTERPRISE CONCEPTUAL ASPECTS
SINGLE ENTERPRISES CONCEPTUAL ASPECTS
• Partners transactions • VE strategy • VE objectives
RE-DESIGN OF CROSS ORGANISATIONAL & INTERNAL BUSINESS PROCESSES
IMPLEMENTATION OF THE VIRTUAL ENTERPRISE Implementation of Improvements
• Strategy • Objectives
Modelling and Analysis
AS-IS
Diagnosis and Proposals of improvements
TO-BE
New CrossOrganisational & Internal Processes Map
INFORMATION SYSTEM Cross Organisational
Strategic
E.I.S.
Strategic Information System
Business Process
Tactical
Decision Support System ......
PMS
Insurance (ISO-9000)
......
HR
Continuous Improvement
69
• Human Structure: who, how, when • Knowledge Management • Cultural Change
Information System for the Daily Activities Workflow
.......
Information Technologies
.......
Operative
.......
Manual
Internal Business Process
Fig. 1. Methodology for Virtual Enterprise Integration
4 Application of the Methodology to a Virtual Tile Enterprise Once the methodology for the integration of a generic virtual enterprise is defined, it was used as a guide for the integration of a virtual tile enterprise. It is a kind of VE known as Core Business, where an organisation (tile manufacturers) concentrates the main activities on the value chain of a product, and it looks for temporary complementary enterprises like suppliers, customers, distributors, etc. The results obtained are described below according to the different phases of the methodology. 4.1 Definition of the Conceptual Aspects of the Virtual Enterprise and of Each Single Enterprise In accordance with the above methodology, a broad definition of the principal objectives of the virtual tile enterprise was established, beginning with the identification of the business, mission, vision, values, and an outline of customer groups and market segments it was focused on. The next task was to define the strategy of the virtual enterprise, not in terms of the current situation of the company, but as an opportunity to improve. To do so it was necessary to (1) analyse the strategic problems the virtual enterprise faced with respect to its competitors; (2) define the main deficiencies with respect to the
70
R. Chalmeta et al.
formulation of strategies, proposing improvements, and (3) establish the strategic objectives for the virtual enterprise. The definition of the conceptual aspects of the virtual enterprise is closely related to the conceptual aspects of each individual component company. In this case, the tile manufacturing company occupied a dominant position in the chain. Hence, the strategy and vision of this company determined those of the virtual enterprise; this in turn conditioned the conceptual aspects of the other companies in the chain, which had to modify their strategy and objectives. 4.2 Redesign of the Process Map The participation of an enterprise in a virtual enterprise involves redesigning the way it traditionally carries out its processes, since the possibilities of enhancing efficiency through collaboration with other enterprises will also be taken into account. In this way, an enterprise’s processes are performed in synchrony with the other participants, with more frequent and better exchanges of information and knowledge. Tables 4 and 5 show two of the virtual-tile enterprise-critical macroprocesses”. The microprocesses that go to make up each macroprocess, the different enterprises that take part in each of them and the traditional business process (internal and crossorganisational) that are affected in each enterprise are also indicated. The symbol ∅ refers to new processes that appear in an enterprise as a result of the need to carry out new exchanges of information and knowledge, due to its participation in the virtual enterprise. To define the best work practices in the critical business processes of the virtual tile enterprise, the current situation (AS-IS) was analysed and redesigned (TO-BE) in order to achieve the strategic objectives. This phase of the methodology could be associated to a reengineering project if significant changes based on information technologies have been identified. If small changes are obtained, it may be associated to a continual improvement project. These improvement approaches are clearly complementary and not at all exclusive. As an example, the description of the virtual logistics chain management is shown below. This process can be defined as all the activities related with the management of the final customer’s orders. It covers all the stages from obtaining the raw materials, planning production, the manufacture of the tile products and its storage, up to transport and distribution to the final customer. The integration of this process was performed on two levels: • Integration of the planning and management system. This involves coordinating the planning of production and services in all the enterprises involved in the logistics chain and is based on the forecasts of sales to final customers. The benefits to be obtained in this area are an increased reliability in the different levels of planning (aggregated, programming and sequencing), which in turn leads to reductions in stock, costs, breakages, etc., and better customer service. • Integration of the process of dealing with orders, using e-commerce technologies. The benefits in this case lie in the reductions obtained in the time and costs involved in collecting and processing customers’ orders and in submitting orders to suppliers, as well as all the other commercial documents.
Virtual Integration of the Tile Industry (VITI)
71
Table 4. Sales business process
Macroprocess: Sales Virtual Enterprise Individual Enterprise Microprocess Participation of the… Traditional process Virtual logistics chain Distributor Sales order management Purchases order management Manufacturer Sales order management Production Purchases order management Supplier Sales order management Production Purchases order management Carrier Service order management Transport service Purchases order managements Table 5. Virtual Enterprise management business process
Macroprocess: Management of the Virtual Enterprise Virtual Enterprise Individual Enterprise Microprocess Participation of the… Traditional process Coordination of the Distributor ∅ enterprises Manufacturer Supplier Carrier Inter-enterprise work Distributor Human Resources Management teams (HR) Manufacturer Human Resources Management Supplier Human Resources Management Carrier Human Resources Management Technological Distributor Information Systems Management infrastructure Manufacturer Information Systems Management Supplier Information Systems Management Carrier Information Systems Management Quality Management Distributor Quality Management Manufacturer Quality Management Supplier Quality Management Carrier Quality Management Customer accounting Administration Distributor Supplier accounting Manufacturer Financial management Supplier Tax management Carrier One of the most important requirements for implementing this new process was the interconnection of the computer systems so as to be able to transfer the information generated in the different enterprises from the proposals for the manufacture of end
72
R. Chalmeta et al.
products to the needs for materials and services of each supplier, for each level of planning and, in addition, the use of electronic documents to replace the flow of documents and telephone calls associated with these orders. Implementing the best practices in this microprocess gave rise to important changes in the traditional processes of purchase, manufacture, storage, and sales management used by the different participants in the virtual enterprise. The most notable changes can be seen in Table 6. For the analysis and detailed description of the best work-practices of each (macro or micro) process, different modelling techniques were used and the results of benchmarking with other enterprises were also taken into account. Thus, the activities that should be carried out, the decisions that were to be taken, the resources used, the information that was needed and generated, the sequence logic and the roles and the knowledge of the human resources involved were all identified in the models of each of the processes. It must be highlighted that the difficulty in generating the reference models of the virtual enterprise does not lie in the modelling languages, which are essentially the same as those used in a single enterprise, but in reflecting in the models the possibilities of improving efficiency through inter-enterprise cooperation. Fig. 2 shows the different modelling techniques used according to the view of the process that was sought. Table 6. Changes in the microprocess: Virtual logistics chain management
Suppliers
These obtain information about the sales forecasts and about the production planning of the manufacturers through an extranet. The documentation used throughout the order-invoice-bill cycle is exchanged by computer using EDI messages and documents in XML format.
Manufacturers
Carriers
Distributors and final customers
The computer system integrates all the structured data of the enterprise (transactional systems: legacy data, ERP, spreadsheets, etc.), as well as all the unstructured information (emails, videos, external databases, etc.), through a data warehouse system and sends this information on to the end users (staff, customers, suppliers) through a Web format and in real time, so they can analyse it and are able to generate knowledge. Use of a Customer Relationship Management (CRM) system, in which the following aspects are reinforced: complaints management, enquiries about tile products placement, tailor-made project design. These use EDI messages so that the planning of delivery orders is managed by the manufacturer and not imposed at random by customers. In addition the use of EDI messages also allows more information about consignment arrangements to be transferred from the manufacturer to the carrier. The exchange of commercial documents (orders, delivery notes, invoices, pro forma invoices) is performed from computer to computer with the subsequent saving in time, mistakes and costs. Information is also supplied through an extranet so that customers can always check the state of their orders and other matters, such as the availability of material for future orders.
Virtual Integration of the Tile Industry (VITI)
View
Technique
)ORZRIDFWLYLWLHV LQIRUPDWLRQUHVRXUFHV DQGGHFLVLRQV :KROHGHILQLWLRQRI VHTXHQFHRIWKH SURFHVVZRUNIORZ 6WDWHV 'LDJUDP
Example Datos reclamación cliente
Datos Cliente
Riesgo del cliente
GESTIÓN DE RECLAMACIONES DE DISTRIBUCIÓN
GESTIÓN DE RECLAMACIONES DE PROVEEDORES
Reclamación calidad
1
Información reclamación proveedor
3
,'()
73
Pruebas
GESTIÓN DE RECLAMACIONES DE GNK
Reclamación cerrada Información reclamación transporte
2 Sistema informáticos
GESTIÓN DE RECLAMACIONES DE TRANSPORTES
Personal Distribuidores Reclamación logística
4 Personal Proveedor
Personal GNK
Personal Transportes
Introducir datos
1.1.2
& Estudiar calidad
J1
Aviso Calidad (Automatizado) 1.1.6
Calidad
1.1.3
& J2
Aviso D.Comercial (Automatizado) ¿Tipo Reclamación? 1.1.7
,'()
Resolver 1.1.13 1.1.4 Logistica Estudiar logistica Avisar S.Comercial (Automatizado)
Cerrar
& J3
1.1.5
1.1.11 1.1.8
& Aviso D. Comercial (Automatizada)
J4
1.1.12
)LQLWHVWDWHDXWRPDWH
Gestión de reclamaciones
2UJDQL]DWLYH VWUXFWXUH
D.Comercial
2UJDQLVDWLRQ FKDUW
Secretaria Comercial
Comerciales
7HPSRUDOHYROXWLRQRI WKHSURFHVV
Centros de venta
Departamento de calidad
Departamento de logistica
Disaceres
80/± 7HPSRUDO VHTXHQFHGLDJUDP
Fig. 2. Techniques used to model process
4.3 Implementation Once the best work practices models had been defined, the processes currently used by the virtual tile enterprise taking part in the project were checked and, following the methodology and taking advantage of the possibilities offered by the software, a set of improvement projects were defined. The projects were prioritised according to a feasibility study and, within the VITI Project, those that were feasible in the short term and were related to the software and the inter-enterprise process were implemented. The users of the virtual enterprise were also trained so that (1) they were capable of implementing medium and long-term improvement processes, with which the integration programme would have concluded, and (2) they could define a continuous improvement procedure, in order to allow the VE to evolve with its environment. Therefore, communication and training play an important role in this process of change, which affects inter-enterprise relations and changes the traditional workpractices between companies. 4.4 Human Resources All the workers and managers have to know what their activities and responsibilities are, not only for the management of each enterprise but also for the management
74
R. Chalmeta et al.
system among enterprises (who, what, where, how and when). This forced them to create a less departmentalised company with a less hierarchical structure which was more oriented towards cross-organisational business process management. Furthermore, the models that were developed identify the roles and capabilities of the employees involved and this, with the help of a computer system, will enable the identification, extraction, processing, storage, and distribution of the knowledge of all the component companies, thus allowing them to share the experience and skills of the human resources. 4.5 Information Systems The models with the best work practices in the virtual tile enterprise determined the requirements of the technological infrastructure supporting the internal and crossorganisational processes of each company [11]. This mainly affected the ERPs and the CRM systems, and meant using technologies that allowed the efficient exchange of information in electronic form, and the development of the software needed to support the models developed in phase 2 of the methodology. This infrastructure takes into account the different technologies that can be utilised to: • Distribute and exchange among enterprises the information needed for the execution of activities and decision making, which includes the knowledge and abilities of human resources. Human knowledge can refer to the company employees (EKM, Employment Knowledge Management), other companies (BKM, Business Knowledge Management), and customers (CKM, Customer Knowledge Management). • Establish a workflow system which allows the automated exchange of information, guaranteeing both the quality of the processes that take place within the virtual company and ensuring that information flows in the right circuits. • Establish e-commerce relationships in three fields: B2C (Business to Costumer), B2B (Business to Business) and B2E (Business to Employee). Moreover, from the point of view of the virtual enterprise, the technological infrastructure was based on interconnection and integration. On the one hand, it was necessary to connect the distinct intranets of the participating enterprises so as to create the extranet of the virtual tile enterprise with suitable communication protocols. On the other hand, the software used by the different enterprises taking part was also integrated. While interconnection means overcoming a technical problem, integration was a far more complex task, as this involves determining which output streams from the different external processes will become the input of others, and what format they have. The following are the solutions that were adopted: • Knowledge management: explicit knowledge was automated by the use of document management software and work is currently being done to identify, process and store tacit knowledge. • Workflow: this was implemented using Lotus Notes. With this tool it was possible to execute both the modelling and implementation. Implementation involves drawing up a series of forms which, through a set of warnings and appended documents, guide company employees in the execution of the time sequence of the process. • E-commerce: the exchange of documents was carried out by exchanging EDI messages across the virtual enterprise’s extranet.
Virtual Integration of the Tile Industry (VITI)
75
5 Results / Discussion Some of lessons learned from the VITI project are: • A cultural transformation must be brought about in companies in order to overcome resistance to sharing information and knowledge. • The coordination of companies with differing objectives, strategies and resources in order to manage them as though they were one company presents huge difficulties. However, the companies taking part in the project gradually began to see that this change enabled them to achieve a competitive advantage over the rest of their competitors, and their resistance became weaker. The technical results can be summed up as follows: • The proposal of a formal methodology that can be used as a guide during the design and implementation of an organisation and functioning system in a virtual enterprise. The methodology can be broken down into a series of stages and for each of these stages it indicates which support techniques, models, procedures and computer tools are to be used. • A real case of the application of the concepts of business integration in a virtual enterprise. In this way it is possible to study the problems involved in introducing these new organisational methods in the enterprises. • Enhanced competitiveness in the virtual tile enterprise that was participating in the project, by (1) designing and implementing high quality internal and external business processes, and defining the activities and flow of information and associated materials in a suitable fashion, (2) setting up an integrated management system which has all the levels properly configured with the aim of managing the virtual tile enterprise efficiently, (3) putting to good use the possibilities offered by the information technologies in order to optimise work procedures (knowledge management, workflow, e-commerce, etc.) and (4) improving the collaboration among each of the members of the chain, so that every member becomes more involved in the place of the others. This builds the virtual tile enterprise culture and makes it easier to generate a common front that works jointly to increase the market share for tile substitute materials, and on research into innovations in the production process and in the applicability of the products.
6 Conclusions The VITI Project provides guidelines for (1) the flexible and pro-active cooperation between the organisations involved in the value chain of a product or service in order to obtain an efficient and quick response to or anticipation of market changes, (2) the design of high performance business processes, supported by the new information technologies, and (3) the reorganisation of a VE structure, providing tools that enable new organisations to enter or leave the value chain in a dynamic manner. Lastly, these results enable the state of the art in business integration to take a leap forward, as no formal, systematic methodologies for business integration or literature about the practical execution of virtual enterprise projects existed.
76
R. Chalmeta et al.
Acknowledgements. The VITI project was founded by CICYT and several other enterprises.
References 1.
Bernus, P.: Business Evolution and Enterprise Integration. In: Kosanke, K., Nell, J. (eds.): Proceedings ICEIMT97. Springer Verlag, Berlin (1997) 140–151 2. Hawa, M., Ortiz, A., Lario, F., Ros, L.: Improving the role played by humans in the development of enterprise engineering and integration projects. International Journal of Computer Integrated Manufacturing. 15–4 (2002) 335–344 3. Camarinha-Matos, L.M., Afsarmanesh, H.: Elements of a base VE infrastructure. Computers in Industry. (Accepted to be published in 2003) 4. Yoo, S.B., Kim, Y.: Web-based knowledge management for sharing product data in virtual enterprises. International Journal of Production Economics. 75 (2002) 173–183 5. AMICE: CIMOSA: Open System Architecture for CIM. ESPRIT Consortium AMICE. Springer-Verlag, Berlin (1993) 6. Doumeingts, G., Vallespir, B., Zanettin, M., Chen, D.: GIM, GRAI Integrated Methodology, A Methodology for Designing CIM Systems. A technical report of the IFAC/IFIP Task Force on Architectures For Integrating Manufacturing Activities And Enterprises, GRAI/LAP, Version 1.0, Bordeaux (1992) 7. Williams, T.: The Purdue Enterprise Reference Architecture. Proceedings of the Workshop on Design of Information Infrastructure Systems for manufacturing. Elsevier Science, Tokyo (1993) 8. Poler, R., Lario, F.C.: Dynamic Model of Decision Systems (DMDS). Computers in Industry. 49 (2002) 175–193 9. Ortiz, A., Lario, F., Ros, L.: Enterprise Integration-Business Processes Integrated Management: a proposal for a methodology to develop Enterprise Integration Programs. Computers in Industry. 40 (1999) 155–171 10. Ortiz, A., Lario, F., Ros, L., Hawa, M.: Building a production planning process using an approach based on CIMOSA an workflow management systems. Computers in Industry. 40 (1999) 207–219 11. Chalmeta, R., Grangel, R.: ARDIN Extension for Virtual Enterprise Integration. Journal of Systems and Software. (Accepted to be published in 2003)
Preface to IWCMQ 2003
This section of the ER’03 workshop proceedings includes the papers accepted for the Second International Workshop on Conceptual Modeling Quality (IWCMQ’03) held in Chicago, Illinois, USA, on October 16, 2003. Conceptual modeling has been recognized as a key task that lays the foundation of all later design and implementation work. The early focus on conceptual modeling may help building better systems, without unnecessary rework at later stages of the development when changes are more expensive and more difficult to perform. Quality in conceptual modeling has been a topic of research since the early nineties but recently a stronger emphasis is given to the assessment, evaluation, and improvement of the models produced in the early phases of the system development life cycle. IWCMQ’03 invited papers that explored the foundations of conceptual modeling quality, and methods for assessing, evaluating, and improving conceptual modeling quality. The workshop provided a forum for researchers and practitioners working in these and other areas to meet to discuss and push the boundaries of this very important and active area of research. We received a total of sixteen papers which were reviewed each by at least three program committee members. After a rigorous review process, seven papers were accepted. These papers, which are included in this section, fall into two broad categories. The first category explores theoretical aspects of conceptual modeling quality with explorations into web conceptual design models (Franca Garzotto and Vito Perrone of the Politecnico di Milano, Italy), reference models (Peter Fettke and Peter Loos of the Johannes Gutenberg-University Mainz, Germany), UML behavioral diagrams (Marcela Genero, David Miranda, and Mario Piattini of the University of Castilla-La Mancha, Spain), and multidimensional schemas (Samira Si-Said Cherfi of CEDRIC-CNAM and Nicolas Prat of ESSEC, France). The second describes conceptual model quality in practice in the fields of model consistency checking (Monique Snoeck, Cindy Michiels, and Guido Dedene of the Katholieke Universiteit Leuven, Belgium), query formulation (Hannu Jaakkola of Tampere University of Technology, Finland, and Bernhard Thalheim of the Brandenburg University of Technology at Cottbus, Germany), and the modeling of accounting systems (Geert Poels of Ghent University, Belgium). We would like to express our thanks to the program committee members for their rigorous reviews of the papers, the ER’03 organizing committee -especially the Conference Chair and the Workshop Chairs- for their help and support, all the authors who submitted papers, the invited speaker, and the attendees
October 2003
Jim Nelson Geert Poels
Multiperspective Evaluation of Reference Models – Towards a Framework Peter Fettke and Peter Loos Johannes Gutenberg-University Mainz, Lehrstuhl für Wirtschaftsinformatik und Betriebswirtschaftslehre, ISYM – Information Systems & Management D-55099 Mainz, Germany {fettke,loos}@isym.bwl.uni-mainz.de http://www.isym.bwl.uni-mainz.de
Abstract. Within the information systems field, reference models have been known for many years. Despite the relevance of reference model quality, little research has been done on their systematic evaluation. Based on an analysis of prior work on reference model quality, we propose a framework for the multiperspective evaluation of reference models. The framework comprises 15 perspectives. Each perspective is discussed with respect to its strengths and limitations. As well, we provide examples of the types of research that have already been undertaken on each perspective.
1 Introduction Within the information systems field, information modeling is a vital instrument to analyze, design, implement, and deploy information systems [1, 2, 3, 4]. However, the modeling process is often resource consuming and faulty. The concept of reference modeling has been introduced as a way to improve and accelerate the modeling process [5, 6, 7]. There is a great deal of terminological confusion in the modeling literature. For example, the term “model” is often used for different purposes. To avoid confusion, we use the following definitions: A grammar “provides a set of constructs and rules that show how to combine the constructs to model real-world domains” [3, p. 364]. In the remainder of this paper, we always refer to analysis grammars, e. g. the Entity Relationship Model (ERM) or the Unified Modeling language (UML). And while modeling method “provides procedures by which a grammar can be used” [3, p. 364], scripts are the product of the modeling process. “Each script is a statement in the language generated by the grammar” [3, p. 364]. A script is a representation of a realworld domain using a particular grammar. A reference model is a script representing a class of domains (e. g. Scheer’s reference model for production planning and control system [8]). It is a conceptual framework which could be used as the blueprint for information system development. Reference models are also called universal models, generic models, or model patterns. To use reference models, they must be adapted to the requirements of a specific enterprise. We refer to such an adapted model as an application model. M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 80–91, 2003. © Springer-Verlag Berlin Heidelberg 2003
Multiperspective Evaluation of Reference Models – Towards a Framework
81
We assume that the effectiveness and efficiency of the application of a reference model is strongly determined by the quality of the model. However, the quality of a reference model comprises several aspects: From a teach- and learn-oriented point of view, facts about enterprises represented in reference models should be understandable, so the knowledge about enterprise systems can be communicated easily. From a user-oriented point of view, a reference model should be flexible and adaptable, so its fitness for the intended application is very high. From an enterprise-oriented point of view, the purchase and usage of a reference model should make the development of enterprise systems more efficient. From an economic-oriented point of view, reference models should increase the productivity of the national economy. From a science-oriented point of view, a reference model should represent a specific industry-type completely, precisely, consistently and correctly. The purpose of an evaluation is to determine value and usefulness of a reference model. The evaluation of a reference model validates and checks both its suitability and demands claimed by the model. Results of known approaches to reference model and script evaluation show that an objective evaluation is problematic for several reasons [9], e. g. a reference model can be used in various application areas such as software development or business process reengineering. These application areas have different quality requirements. Such problems should not tempt to refrain from the evaluation of reference models, so that reference model will be assessed with respect to their “plausibility” intuitively. Rather the evaluation of a reference model should be undertaken systematically because of its high relevance: First, an evaluation leads to a better understanding of the nature and characteristics of a reference model. Second, an evaluation of reference models allows to identify similarities and differences of reference models. Such investigations prevent redundant work and can show new application areas of reference models. Third, since no reference model is suitable for all situations, we need to know which reference model needs to be chosen in which situation. Evaluation and comparison of reference models provide a viable means to gather this information. Fourth, an evaluation allows to assess the quality of the research outcome “reference model” that can be understood as a theory in the information systems field. In doing so, the evaluation of reference models guarantees the quality of research. Although several reference models are known [10, pp. 619-707, 11] and their quality is of high theoretical and practical relevance, little research has been undertaken on their systematic evaluation [6]. This critical finding lead to the following research question: How can known reference models be prepared, described and assessed in a comparable manner? This paper contributes to the mentioned research question by proposing a research framework. The framework comprises 15 perspectives for the evaluation of reference models. Each perspective is discussed with respect to its strengths and limitations. As well, we provide examples of the types of research that have already been undertaken on each perspective. The examples given investigate reference models primarily, although prominent approaches to the evaluation of scripts are referenced too. For reasons of brevity, we cannot discuss each perspective in detail. Instead the discussion in this paper is broad and provides a starting point for future research.
82
P. Fettke and P. Loos
The remainder of this paper is structured as follows: An overview of the proposed framework is given in Section 2. The evaluation perspectives of the framework are described in Section 3. Finally, section 4 presents conclusions and directions for further research.
2 Overview of the Framework To systemize approaches to model evaluation we use two main criteria which are derived from a scientific philosophical point of view (Fig. 1). On the one hand, the research method can be distinguished by analytical and empirical evaluation approaches. Analytical approaches are based on logical conclusions, while empirical approaches are based on experiences. On the other hand, the way in which the quality criteria used by an evaluation approach is introduced can be either ad hoc or theorydriven. Theory-driven quality criteria are derived from and founded on a specific theory, a so-called reference theory [12]. Whereas ad hoc quality criteria are just introduced for the purpose of the evaluation approach without referring to a specific reference theory. The two proposed criteria constitute four different groups of evaluation approaches which in the following will be further refined. First, the two empirical groups can be refined with respect to the applied research method. In the following we distinguish survey, laboratory experiment, field study, case study and action research. Because there exist just a few empirical approaches on reference model evaluation, we do not separate empirical evaluation perspectives with respect to the referred reference theory. However, we admit that a more sophisticated analysis of empirical research on reference model evaluation is possible. Second, we refer to analytical and ad hoc approaches to the evaluation of reference models as descriptive evaluation perspectives. Descriptive evaluation perspectives are further refined with respect to the degree of their operationalization. Here, we distinguish plain text-based, featurebased and metric-based evaluation. Third, the group of analytical and theory-driven approaches are separated in terms of scientific origin of the reference theory used. We differentiate Information Systems (IS) theory-based and non-Information Systems (non-IS) theory-based perspectives. IS theory-based perspectives are meta modelbased, master reference model-based and paradigmatic evaluation. Non-IS theorybased perspectives are contingency theory-, ontology-, cognitive psychology- and economic-based evaluation. The resulting framework consists of 15 perspectives which are structured in four groups.1 We refer to this systematization as a framework for the multiperspective evaluation of reference models. Note, that each perspective of the framework is introduced by definition. The term “multiperspective evaluation” points out that reference models can be evaluated with respect to different perspectives. The need to examine research results from different perspectives is pointed out by [14, pp. 646647] too. The next Section describes each perspective in more detail.
1
The proposed framework is inspired by the survey of approaches to evaluation of modeling methods described in [13].
foundation of quality criteria ad hoc theory-driven
Multiperspective Evaluation of Reference Models – Towards a Framework
83
non-IS theory-based perspectives contingency theory basesd evaluation ontology-based evaluation cognitive psychology-based evaluation economic-based evaluation
IS theory-based perspectives meta model-based evaluation master reference model-based evaluation paradigmatic evaluation
empirical perspectives survey laboratory experiment field study case study action research
descriptive perspectives metric-based evaluation feature-based evaluation plain text-based evaluation
analytical
empirical research method
Fig. 1. Framework for multiperspective evaluation of reference models
3 Multiperspective Evaluation 3.1 Descriptive Perspectives 3.1.1 Plain Text-Based Evaluation A plain text-based evaluation discusses the characteristics, strengths and weaknesses of a reference model verbally. The evaluation is more or less structured and does not use a specific set of evaluation characteristics. Furthermore, no specific reference theory is referred. Examples of plain text-based evaluations are [10, pp. 615-707, 15, 16, pp. 111-209]. This evaluation approach is subjective. It is often unclear why specific aspects are discussed and others are left out, so the evaluation is not so systematic. The strength of this approach is that it can be performed easily. It is simple to point out specific characteristics of a reference model. Furthermore, this approach can be undertaken even if no general accepted quality criteria is agreed upon. 3.1.2 Feature-Based Evaluation Feature-based evaluation approaches define a specific set of features which can be used to describe and characterize reference models. The feature set is introduced ad hoc and not derived from a theory. Evaluation criteria may comprise the used modeling grammar, the represented industry-type or the purpose of the reference model. Examples of this approach are [6, 11, 17]. Furthermore, we point to the
84
P. Fettke and P. Loos
guidelines for reference modeling (GoM) proposed in [18].2 We characterize these guidelines as a feature-based approach because they are not founded on a sound theory, e. g. it cannot be argued why the proposed six guidelines are used and not others such as the “guideline of stability” or “guideline of flexibility”. We are unaware of any existing approaches to evaluate reference models based on these GoM. Prominent feature-based approaches to evaluate scripts are [20, 21]. However, these two approaches do not address the evaluation of reference models. The problem with the feature-based evaluation is that the development and selection of a specific feature set is often a subjective issue. Furthermore, the used criteria is often ambiguously defined. Like the plain text-based perspective, the strength of this approach is that it is relatively easy to perform. 3.1.3 Metric-Based Evaluation Following the concept of software metrics, metrics for reference models define how to measure characteristics of a reference model. Normally, we can distinguish between metrics for the modeling process and metrics for the product of the modeling process, e. g. the reference model and the application model respectively. Unlike a feature-based evaluation this approach assumes that each metric is operationalized and can be determined objectively. We are not aware of any existing metric-based evaluation approaches. However, the author of [18, pp. 134-137] describes the potential to operationalize the GoM though he refutes this approach. General metrics for process and data models are proposed by [22, 23]. The problem of this kind of evaluation approach is the objective measurement of the introduced metrics. While determining some metrics is easy (e. g. number of entity-types as a measure for the model size), the measurement of other aspects is difficult (e. g. adaptability or usability of a reference model). Further problems can arise when interpreting metrics because normally they do not allow absolute but relative conclusions. 3.2 IS Theory-Based Evaluation 3.2.1 Meta Model-Based Evaluation An evaluation of reference models can be based on meta models. A meta model is loosely speaking - a model of a model and can be conceptualized in two ways [24]. A meta grammar model can be defined as a model of the modeling grammar used for representing the reference model. A meta grammar model-based evaluation analyzes the set of modeling constructs and rules that show how to combine the constructs of a grammar. As a result, it is possible to check whether the reference model is syntactically correct. Another conceptualization refers to the term meta model as a model of the process of constructing the reference respectively the application model (meta process model). An evaluation based on this process-based conceptualization allows to analyze the structure of the developing and application process of reference models.
2
This PhD thesis is written in German, an overview of this approach in English is given in [19].
Multiperspective Evaluation of Reference Models – Towards a Framework
85
We are not aware of any existing meta model-based evaluations. However, we believe the strength of this approach is that it is based on clear quality criteria, though it allows just a technical evaluation. If it is presumed that product quality is positively correlated with process quality, then a meta process model-based approach allows to evaluate the semantic quality of reference models indirectly (cf. quality approaches in Software Engineering such as the Capability Maturity Model (CMM) [25]). 3.2.2 Master Reference Model-Based Evaluation Whereas a reference model allows to evaluate application models, a master reference model allows to evaluate reference models. We define a master reference model as a reference model that does not represent a specific class of domains, but all classes of enterprise domains. As such, a master reference model represents all industry-types. This evaluation approach analyzes whether the facts represented in a reference model are equivalent to the representation in the master reference model. Furthermore, a reference model can be classified according the master reference model. The main limitation of this approach is that currently no master reference model for enterprise systems is known. In the meantime this practical problem can be addressed by an arbitrary declaration of a specific reference model as a master reference model. A further problem of this approach is how the quality of the master reference model is ensured. However, this approach allows a systematic and wellfounded evaluation of reference models. 3.2.3 Paradigmatic Evaluation A paradigmatic evaluation analyzes the meta-theoretical assumptions of a reference model. Meta-theoretical assumptions comprises four aspects [26, 27]: (1) ontological aspects, (2) epistemological aspects, (3) linguistic aspects, and (4) contextual aspects. We are not aware of any existing paradigmatic-based evaluations of reference models. However, different meta-theoretical assumptions of quality frameworks for information models are discussed in [9]. On the one hand, we expect that a paradigmatic evaluation can show important theoretic differences of reference models. On the other hand, from a practitioner’s point of view, these differences are minor and reveal less information about the application domain of a reference model. In addition, we point out that no specific meta-theoretical assumptions are inherently superior [13]. 3.3 Non-IS Theory Based Evaluation 3.3.1 Contingency Theory-Based Evaluation This evaluation approach is based on contingency theory. One main idea of contingency theory is that there is no universal or one best way to manage an organization. Instead the design of an organization and its subsystems must ‘fit’ with the environment. This evaluation approach analyzes the characteristics of factors in the organization or its environment that determine the suitability of a reference model. Today, we are just aware of sporadic approaches of this type of evaluation [28, 29]. However, we think this approach is of high relevance because its outcomes can be used directly by practitioners. Indeed the evaluation results have a hypothetical character and must be tested empirically.
86
P. Fettke and P. Loos
3.3.2 Ontology-Based Evaluation We refer to the term ontology as a branch of philosophy that deals with what exists, or is assumed to exist in the world [30, pp. 5-6]. Hence, ontology provides a foundation for conceptual modeling if it is presumed that reference models or scripts represent some real-world system [31]. Reference models can be analyzed with respect to their ontological correctness. To do this, it is necessary to map the constructs of a reference model onto the constructs of an ontological model. This mapping is called an ontological normalization [32]. An example ontological evaluation is given in [32]. Furthermore, there exist some work on the ontological evaluation of modeling grammars (e. g. [33]) and modeling practices (e. g. [34]). This evaluation approach allows a strong theoretical foundation. Though from a theoretical point of view it is not strictly comprehensible why Bunge’s ontology should be chosen as a foundation and not e. g. the ontology proposed by Chisholm [35]. 3.3.3 Cognitive Psychology-Based Evaluation A cognitive psychology-based evaluation analyzes to what degree reference models support or impede processes of human information processing. This approach stresses that reference models are used to communicate knowledge of the problem domain among project team members [36]. The evaluation of reference models have to consider cognitive aspects of modeling, e. g. number and color of modeling elements and their positioning in a diagram. So cognitive psychology theories can be used as a foundation for evaluation of reference models. We are unaware of any existing cognitive psychology-based evaluations of reference models, though some authors use cognitive theories to evaluate modeling scripts [36, 37]. We argue that this evaluation approach leads to a strong foundation and is important to overcome the intuitive design and layout of reference models. However, it is necessary to test and verify the referenced cognitive theories in the information systems field intensively because e. g. the information processing processes of experienced practitioner may be conditioned in a specific way. 3.3.4 Economic-Based Evaluation The usage of a reference model should make the development of enterprise systems more efficient. Hence we postulate an economic-based evaluation of reference models. Such an evaluation can be investigated both from the business management and national economic standpoint, e. g. investment theory or market theory may be used as an evaluation foundation. We are not aware of any existing economic-based evaluations of reference models. From our point of view this approach leads to little concrete results because the assessment is deficient, e. g. a priori the benefits of a reference model can only be estimated and not exactly determined. 3.4 Empirical Perspectives 3.4.1 Survey The purpose of a survey is to poll opinions of human subjects on different aspects of the construction and application of reference models via questionnaires. Structure and
Multiperspective Evaluation of Reference Models – Towards a Framework
87
process of a survey can vary with respect to the degree of standardization of the questionnaires and to the kind of contact between the researcher and surveyed human subjects. Some authors use surveys to gather information on the usage of reference models [18, pp. 75-80 & pp. 367-402, 38, pp. 262-263]. The main problem of using surveys is the typical low response rate, e. g. the response rates of the cited surveys are 5.6 and 15.21 percent. Further problems arise because surveys do measure subjective (quality) characteristics of reference model and especially new reference models are not wellknown in practice. However, surveys are a good tool to evaluate reference model quality if it can be presumed that a rich experience base using these models exists in practice. 3.4.2 Laboratory Experiment The objective of a laboratory experiment is to analyze specific aspects of the construction and application of reference models in an environment in which it is possible to eliminate intervening and confounding influences - the laboratory. Therefore, the researcher measures the influence of independent variables on dependent variables. For instance, a laboratory experiment allows to investigate the influence of different modeling grammars to represent the same modeling domain on performance measures. Another example of this kind of evaluation is to test whether modeling problems can be solved more efficiently or effectively if reference models are used. The authors of [39] performed a laboratory experiment in using design patterns (design patterns are comparable to reference models). Furthermore, some authors use laboratory experiments to evaluate modeling scripts in general [34, 37, 40, 41, 42, 43]. On the one hand, the strength of this evaluation perspective is its high internal validity. On the other hand, laboratory experiments are often artificial because e. g. the investigated modeling problem is rather simple or the human subjects performing the experimental tasks are not experienced. 3.4.3 Field Study Field study try to overcome the limitations of laboratory experiments by investigating modeling situations in real-world settings. Like laboratory experiment, the objective of this kind of evaluation is to measure the influence of independent variables on depend variables. The researcher has to manage and control intervening and confounding influences while he/she should not manipulate the research setting. We are unaware of any existing field studies to evaluate reference models. The strength of this approach is its high external validity. However, the internal validity of the research may be low because the researcher does not succeed in controlling intervening and confounding variables satisfactorily. Further problems may arise because the researcher may not gain access to enterprises to conduct the research, e. g. for information security reasons. 3.4.4 Case Study The objective of a case study is to investigate a specific reference modeling situation in an organization at one point of time. Unlike in the laboratory experiment and the field study, the researcher does not try to measure influences of independent variables
88
P. Fettke and P. Loos
or to control intervening and confounding variables. Instead the practical solving of a specific reference modeling problem is observed. The case study describes the addressed reference modeling problem, proposed solution and gained results. Examples of case studies evaluating reference models are [44, 45, 46]. The main limitations of the case study is its less objectivity and weak generalizability because the investigated cases are often not representative and the results of the case study are interpretations of the researcher and therefore subjective. However, this evaluation approach allows to gain a lot of useful information on reference model quality, e. g. information on consistency and applicability of a new reference model. 3.4.5 Action Research Action research is a qualitative research approach [47] that can be applied to evaluate reference models. One characteristic of this approach is that research object and subject (the researcher) are not strictly separated, so the researcher as a part of the research object influences the research object and is him-/herself influenced by the research object. The role of the researcher is comparable with a consultant. The authors of [48] conducted action research to test and refine their data modeling quality framework. Furthermore, we think that some of the known reference models, e. g. [28], are constructed by applying action research although this is not stated explicitly. The main problem of this approach is its less objectivity and unknown effect of the researcher involvement. On the other hand, this evaluation approach can provide fruitful findings that cannot be achieved by other perspectives.
4 Conclusions and Further Work Reference models have been used for many years in the information systems field. Despite the relevance of reference model quality, little research has been done on their systematic evaluation. In this paper we propose a framework for the evaluation of reference models. The structure of the framework is not driven by theoretical arguments, it is similar to a feature-based evaluation on a meta-level. However, we believe that the framework is useful from a pragmatic point of view because it provides some criteria to structure approaches in this research field. Further investigations should justify the perspectives of the framework in a deeper way. None of the introduced evaluation perspective is inherently superior to other approaches because each perspective has its specific strengths and limitations. Hence we argue that reference models should be evaluated from different points of view. Further work can investigate the evaluation perspectives in more detail and identify reasonable combinations of different perspectives. For instance, we point out that metric-based and cognitive psychology-based evaluations should be complemented by empirical perspectives. Furthermore, the proposed evaluation methods should be applied to evaluate known reference models, so their quality is systematically examined and guaranteed. This work can be guided by the proposed framework. We believe it is reasonable to start the evaluation of reference model with descriptive perspectives. This approach can be easily undertaken and the results give an overview about known reference
Multiperspective Evaluation of Reference Models – Towards a Framework
89
models. Further investigations can examine particular reference models based on a specific theory. Empirical approaches should start with explorative research designs. Surveys, case studies or action research seem appropriate research methods for such early studies. The verification of specific hypotheses on the quality of reference models can be conducted with laboratory experiments or field studies. To conclude, we think that the mentioned work provides a better understanding of reference model quality and insights that lead, in the long-term, to a theory of enterprise modeling.
References 1. 2. 3. 4. 5. 6.
7.
8. 9.
10. 11. 12. 13. 14. 15.
Frank, U.: Conceptual Modelling as the Core of the Information Systems Discipline – Perspectives and Epistemological Challenges. In: Proceedings of the Fifth Americas Conference on Information Systems (AMCIS 1999), August 13–15, 1999 (1999) 695–697 Mylopoulos, J.: Information Modeling in the Time of the Revolution. Information Systems 23 (1998) 127–155 Wand, Y., Weber, R.: Research Commentary: Information Systems and Conceptual Modelling – A Research Agenda. Information Systems Research 13 (2002) 363–377 Scheer, A.-W., Hars, A.: Extending Data Modeling to Cover the Whole Enterprise. Communications of the ACM 35 (1992) 166–172 Mertins, K., Bernus, P.: Reference Models. In: P. Bernus, K. Mertins, and G. Schmidt, (eds.): Handbook on Architectures of Information Systems. Springer (1998) 615–617 Mišic, V. B., Zhao, J. L.: Evaluating the Quality of Reference Models. In: A. H. F. Laender, S. W. Liddle, and V. C. Storey, (eds.): Conceptual Modeling – ER 2000 – 19th International Conference on Conceptual Modeling, Salt Lake City, Utah, USA, October 9– 12, 2000 Proceedings. Springer (2000) 484–498 Scheer, A.-W., Nüttgens, M.: ARIS Architecture and Reference Models for Business Process Management. In: W. v. d. Aalst, J. Desel, and A. Oberweis, (eds.): Business Process Management – Models, Techniques, and Empirical Studies. Springer (2000) 376– 389 Scheer, A.-W.: Business Process Engineering – Reference Models for Industrial Companies. 2. edn. Springer, Berlin et al. (1994) Schütte, R.: Architectures for Evaluating the Quality of Information Models – A Meta and Object Level Comparison. In: J. Akoka, M. Bouzeghoub, I. Comyn-Wattiau, and E. Métais, (eds.): Conceptual Modeling – ER '99 – 18th International Conference on Conceptual Modeling, Paris, France, November 15–18, 1999 Proceedings. Springer (1999) 490–505 Bernus, P., Mertins, K., Schmidt, G.: Handbook on Architectures of Information Systems. Springer (1998) Fettke, P., Loos, P.: Classification of reference models – a methodology and its application. Information Systems and e-Business Management 1 (2003) 35–53 Vessey, I., Ramesh, V., Glass, R. L. A Unified Classification System for Research in the Computing Disciplines. Available: http://www.bus.indiana.edu/ardennis/wp/tr107–1.doc Siau, K., Rossi, M.: Evaluating of Information Modeling Methods – A Review. In: Proceedings of the 31th Hawaii International Conference on Systems Science (HICSS '98) (1998) Moody, D. L., Shanks, G. G.: Improving the quality of data models: empirical validation of a quality management framework. Information Systems 28 (2003) 619–650 Mišic, V. B., Zhao, J. L. (2003) Reference Models for Electronic Commerce. Available: http://www.bm.ust.hk/~zhao/HKDC-misiczhao.pdf
90
P. Fettke and P. Loos
16. Wisse, P.: Metapattern – Context and Time in Information Models. Addison-Wesley, Boston et al. (2001) 17. Rising, L.: The Pattern Almanac 2000. Addison-Wesley, Boston et al. (2000) 18. Schütte, R.: Grundsätze ordnungsmäßiger Referenzmodellierung – Konstruktion konfigurations- und anpassungsorientierter Modelle. Gabler, Wiesbaden (1998) 19. Schuette, R., Rotthowe, T.: The Guidelines of Modeling – An Approach to Enhance the Quality in Information Models. In: T. W. Ling, S. Ram, and M. L. Lee, (eds.): Conceptual Modeling – ER '98 – 17th International Conference on Conceptual Modeling, Singapore, November 16–19, 1998, Proceedings. Springer (1998) 240–254 20. Lindland, O. I., Sindre, G., Sølvberg, A.: Understanding Quality in Conceptual Modeling. IEEE Software (1994) 42–49 21. Krogstie, J.: Conceptual Modeling for Computerized Information Systems Support in Organizations. University of Trondheim (1995) 22. Daneva, M., Scheer, A.-W.: Benchmarking Business Process Models. Institut für Wirtschaftsinformatik, Universität des Saarlandes, Arbeitsbericht 136 Saarbrücken (1996) 23. Moody, D. L.: Metrics for Evaluating the Quality of Entity Relationship Models. In: T. W. Ling, S. Ram, and M. L. Lee, (eds.): Conceptual Modeling – ER '98 – 17th International Conference on Conceptual Modeling, Singapore, November 16–19, 1998, Proceedings. Springer (1998) 211–225 24. Mühlen zur, M.: Evaluation of Workflow Management Systems Using Meta Models. In: Proceedings of the 32th Hawaii International Conference on Systems Science (HICSS '99) (1999) 25. Paulk, M., C., Curtis, B., Chrissis, M. B., Weber, C. V.: Capability Maturity Model for Software, Version 1.1. Software Engineering Institute – Carnegie Mellon University, CMU/SEI-93-TR-024 Pittsburgh, Pennsylvania (1993) 26. Klein, H. K., Lyytinen, K.: Towards a New Understanding of Data Modelling. In: C. Floyd, H. Züllighoven, R. Budde, and R. Keil-Slawik, (eds.): Software Development and Reality Construction. Springer (1992) 203–219 27. Hirschheim, R., Klein, H. K., Lyytinen, K.: Information Systems Development and Data Modeling – Conceptual and Philosophical Foundations. Press Syndicate for the University of Cambridge, Cambridge (1995) 28. Loos, P.: Produktionslogistik in der chemischen Industrie – Betriebstypologische Merkmale und Informationsstrukturen. Gabler, Wiesbaden (1997) 29. Malone, T. W., Crowston, K., Lee, J., Pentland, B., Dellarocas, C., Wyner, G., Quimby, J., Osborn, C. S., Bernstein, A., Herman, G., Klein, M., O'Donnell, E.: Tools for inventing organizations. Toward a handbook of organizational processes. Management Science 45 (1999) 425–443 30. Bunge, M.: Ontology I: The Furniture of the World. D. Reidel, Dordrecht, Holland (1977) 31. Wand, Y., Monarchi, D. E., Parsons, J., Woo, C. C.: Theoretical foundations for conceptual modelling in information systems development. Decision Support Systems 15 (1995) 285–304 32. Fettke, P., Loos, P.: Ontological evaluation of reference models using the Bunge-WandWeber-model. In: Americas Conference on Information Systems (accepted) (2003) 33. Opdahl, A. L., Henderson-Sellers, B.: Ontological Evaluation of the UML Using the Bunge-Wand-Weber Model. Software and Systems Modeling 1 (2002) 43–67 34. Bodart, F., Patel, A., Sim, M., Weber, R.: Should Optional Properties Be Used in Conceptual Modelling? A Theory and Three Empirical Tests. Information Systems Research 12 (2001) 384–405 35. Milton, S., Kazmierczak, E., Thomas, L.: Ontological Foundations of Data Modeling in Information Systems. In: Proceedings of the Sixth Americas Conference on Information Systems (AMCIS 2000), August 10–13, 2000 (2000) 1537–1543 36. Siau, K.: Information Modeling and Method Engineering: A Psychological Perspective. Journal of Database Management 10 (1999) 44–50
Multiperspective Evaluation of Reference Models – Towards a Framework
91
37. Kim, J., Hahn, J., Hahn, H.: How Do We Unterstand a System with (So) Many Diagrams? Cognitive Integration Processes in Diagrammatic Reasoning. Information Systems Research 11 (2000) 284–303 38. Maier, R.: Qualität von Datenmodellen. Gabler, Wiesbaden (1996) 39. Prechelt, L., Unger, B., Tichy, W. F., Brössler, P., Votta, L. G.: A Controlled Experiment in Maintenance Comparing Design Patterns to Simpler Solutions. IEEE Transactions on Software Engineering 27 (2001) 1134–1144 40. Burton-Jones, A., Meso, P.: How Good are these UML Diagrams? An Empirical Test of the Wand and Weber Good Decomposition Model. In: Twenty-Third International Conference on Information Systems (2002) 101–114 41. Kim, Y.-G., March, S. T.: Comparing data modeling formalisms. Communications of the ACM 38 (1995) 103–115 42. Shanks, G., Tansley, E., Nuredini, J., Tobin, D., Weber, R.: Representing Part-Whole Relationships in Conceptual Modeling: An Empirical Evaluation. In: Twenty-Third International Conference on Information Systems (2002) 89–100 43. Weber, R.: Are Attributes Entities? A Study of Database Designer's Memory Structures. Information Systems Research 7 (1996) 137–162 44. Schwegmann, A.: Objektorientierte Referenzmodellierung – Theoretische Grundlagen und praktische Anwendung. DUV, Wiesbaden (1999) 45. Buchwalter, J.: Elektronische Ausschreibungen in der Beschaffung – Referenzprozeßmodell und prototypische Realisierung. Eul, Lohmar, Köln (2002) 46. Becker, J., Kugeler, M., Rosemann, M.: Process Management. Springer (2003) 47. Avison, D. E., Lau, F., Myers, M. D., Nielsen, P. A.: Action Research. Communications of the ACM 42 (1999) 94–97 48. Moody, D. L., Shanks, G. G.: Improving the Quality of Entity Relationship Models: An Action Research Programme. The Australian Computer Journal 30 (1998) 129–138
On the Acceptability of Conceptual Design Models for Web Applications Franca Garzotto and Vito Perrone HOC – Hypermedia Open Center Department of Electronics and Information, Politecnico di Milano (Italy) Piazza Leonardo da Vinci 32, I-20133 Milano, Italy
{garzotto,perrone}@elet.polimi.it
Abstract. A possible measure of quality for any model or methodology is the degree of acceptance and usage. This paper discusses the factors that contribute to the industrial acceptability of conceptual models for web application design. We present an empirical study that examined 62 companies or institutions (in America and Europe) involved in large-scale web application development. By investigating the “desiderata” of industrial “practitioners” (developers, designers, or project managers of web applications), we aimed at identifying the requirements that a web design model should satisfy in order to be accepted and used at industry level. The paper describes the design of the study and its main results. Keywords: web conceptual design, quality, acceptability, user requirements, questionnaire-based study
1 Introduction “Quality, like beauty, is very much in the eyes of the beholder” [1] In an arena where large scale, in-house, information intensive web applications dominate the field, industry need comprehensive, well structured development methods and techniques. Conceptual design models have the potential of playing an important role in this scenario. They enable developers to describe a web application at the proper level of abstraction, and support a systematic approach to design. They promote the evolution of web practice from a craft to a structured discipline, and improve the quality and cost effectiveness of the entire development process. Unfortunately, in spite of the proliferation of conceptual design models for the web produced in the academic world since mid 1990s [5-10], a number of studies [2-4] highlight that, with very few exceptions, practitioners in industry aren’t using them. Perhaps it is time to face the fact that, except our students and the partners of our research projects, the rest of the “real” world does not adopt our methods. Why did we fail? What went wrong? One obvious answer might be that our models simply did not meet the requirements and expectations of industrial users. If “fitness to requirements” is a quality indicator (as suggested by N. Fenton [1] and Dix et al [13]), we must admit that our models do not have the appropriate level of quality from
M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 92–104, 2003. © Springer-Verlag Berlin Heidelberg 2003
On the Acceptability of Conceptual Design Models for Web Applications
93
an industrial perspective (although perhaps they are good and successful from an academic viewpoint.) Understanding industrial user requirements represents therefore the first step towards improving the quality of our conceptual models for the web. The goal of the empirical study reported in this paper goes along this direction: investigating the industrial user needs and identifying some properties that a web conceptual model should have in order to be acceptable in the real world, and potentially used in practice. The rest of this paper describes the design of our study (section 2) and discusses its main findings (section 3).
2 Approach of Our Study Our work is carried on in the context of the EC (European Commission) funded project UWA- “Ubiquitous Web Applications” - IST 2000-25131. The UWA purpose is the development of models and tools to support the design of multichannel web applications. Within this project, the EC explicitly requires a three steps validation activity: 1) to identify some factors for the industrial acceptability of the UWA models and tools; 2) to compare the UWA “products” against these factors, and 3) to identify the guidelines to improve them. The questionnaire study is developed to implement step 1. We therefore carried on a questionnaire-based study, involving companies and organizations which carry on large-scale web application development. To design the questionnaire, our approach is to hypothesize a set of potentially important requirements for a design model, asking users to judge their relevance. These requirements arise both from our experience in building and using design models in many (over 25) industrial and academic development projects, and from some studies reported in literature [2-4]. We adopt an “holistic” view of conceptual design models, looking at them within “the organizational context in which they have to work” [2]. Our general assumption is that in order to be accepted and used in an industrial environment, a design model alone is not enough. Even if of excellent intrinsic quality, a model should be supported by a number of complementary features, including a proper methodology, an accurate documentation, and a set of support tools1. A methodology defines how to use the model. It identifies the design process that helps designers structure the design activity and carry on the different design tasks in a systematic way. Acceptability factors that we want to verify include the availability of methodologies that are flexible, adaptable to the specific needs of a company, integrated with the whole development process, and able to support, at some degree, the human task of translating design choices into implementation solutions2. In addition, we want to explore if a methodology should take into account the managerial aspects of the design process, assisting project management – a task that is crucial to any commercial production.
1
2
Among other works, the results of the survey reported in [2] highlights that the industry needs models coupled with methodologies and support to learning and using them. The Entity relationship model, for example, addresses this aspect by providing a set of “rules” or guidelines to map ER schemas into relational schemas.
94
F. Garzotto and V. Perrone
High quality documentation about the model and the design process, is crucial for learnability, which in turn is a fundamental factor for the usability of any complex method. Software tools are needed to assist the design activity, relieving designers from all the tedious tasks dealing with producing design specifications and delivering good quality design documentation. Ideally, the tools should also provide some (semi) automated support to translate of design specification into implementation structures. To verify the above assumptions, our questionnaire includes various types of questions: questions addressing the requirements for a design model per se, questions addressing the requirements for a methodology and a design process, questions addressing the requirements for documentation, and questions about support tools. The questionnaire is organized in three main sections, discussed in the rest of this section: “Requirements for the Model and the Design Process”, “Requirements for Documentation”, “Requirements for Support Tools”3. The first section (“Requirements for the Model and the Design Process”) considers a conceptual design model both “in isolation” and in the context of the design process, and focuses on some general characteristics of both a model and its methodology complementary feature. We first introduce some general, potentially relevant characteristics of a design model and a design methodology, as described in the table of figure 1. Respondents are asked to fill the table by marking with an X the degree of relevance of each characteristic. The table also includes a generic question concerning software tools, to verify, at a very general level, the assumption that a CASE tool is perceived as important for the model acceptability – issue which is investigated in depth in section three. Additional sets of questions explore in detail each specific characteristic of the model and the design methodology, as exemplified in figure 2. Characteristic
Not relevant at all
Relevant
Strongly relevant
Absolutely necessary
Ease to learn Ease to use Being a standard Documentation support Process Customisation Support for Iterative and Incremental Design Lifecycle Project Management Support Fast prototyping CASE tools support
Fig. 1. Investigating general requirements on a design model 3
Each section includes questions, their explanation, and a brief definition of the terminology used (when needed). The questionnaire also includes a section “General Overview of Methodologies Usage” (not discussed in this paper), which investigates the current industrial practice of web design methods and the adoption of the different approaches proposed in literature.
On the Acceptability of Conceptual Design Models for Web Applications
95
For a design model “alone”, table in figure 1 and the complementary detailed questions aim to identify the relevance of the following factor: being ease to learn and to use, being a standard, and being effective for project documentation purposes4. Ease to learn “Facility in learning the proposed model and notation, composed by primitives, concepts and graphical elements”. Our objective is to identify how crucial is learnability (table 1) and which are constraints imposed by industry for spending time and resources in learning a design approach (detailed questions in figure 2). This information, combined with the findings of the second section of the questionnaire “Requirements for Documentation”, is important for understanding which type of “training” and didactical documentation is required as a pre-requisite for the acceptance and adoption of a model. Ease to use “Facility in applying rapidly the concepts and notation in order to produce design specifications for the application under design”. The objective of this set of questions is to identify the degree of relevance of easiness of use with respect to other characteristics, and to investigate the attributes that, in the industry expectation, contribute to make a model easy to use. Some aspects we suggest in the detailed questions are: i) the provision of design patterns as high level modeling primitives; ii) flexibility and customizability, i.e., the possibility of using the model in multiple ways, according to the different practices and styles which may be in use in an organization. Being a standard “The need (or commitment) in the company company to use either an officially standard method (e.g., an IEEE or OMG standards), or a de-facto standard (e.g., UML)”. The objective of these questions is to know the degree at which the use of a standard methodology is important, or even mandatory, in the industry field. Effectiveness for Documentation Purposes, or “Documentation Support” “The effectiveness of documenting the design choices using the design model concepts and notations”. The objective of this set of questions is to verify the relevance of the communication power of a model - how important is to use the model to document the design choices and to communicate them among the various members of the design and development team. The questions concerning the properties of a design model in the context of the design process address the following characteristics related to a design methodology: Process customization These questions investigate the relevance of “being able to adapt and to customize the design process with respect to different situations of use induced by different application fields or different design and development practices used in the company”. Support for Iterative and Incremental Design Lifecycle The objective of this set of questions is to verify whether an iterative and incremental process model (which “defines a set of design steps that can be applied iteratively, in 4
Ease to learn and to use, and being a standard, are generally acknowledged usability principles (usability in turn is a fundamental acceptability factor) [13-14]. Effectiveness for design documentation is a requirement explicitly addressed by many successful software engineering methodologies, such as UML [15].
96
F. Garzotto and V. Perrone
order to produce incremental versions of the application design specifications, and to help designers improve their design solutions in an progressive, incremental way”) is the preferred one for a design methodology or, in alternative, which is the desired one (or the one which is currently used in the company). Fast prototyping “The support offered by the methodology for producing an application prototype in a rapid way, in order to come to the client with fast results and to obtain early feedback”. This set of questions (see also figure 2) aims at investigating the need for deriving early prototypes of the application once various versions of the design are produced, at understanding why fast prototyping is required; and at identifying the desired “type” of prototype. Project Management Support “The managerial features that a methodology should support for planning, communication, resources and client management, configuration management, etc”. The objective is to verify the need for project management support, and to identify the key aspects of project management that industry people consider important in a methodology in order to improve the control and effectiveness of the design process. Regarding Project Management a. Which of the following activities concerning project management are considered important to be supported in a design methodology? i. Time planning. ii. Assignment of workers to specific work activities. iii. Change management. iv. Client management. v. Stakeholder management. vi. Configuration management. Please put here specific comments and suggestions about project management support Regarding Ease to Learn a. Which is the time (in months) expected to be spent in order to learn how to use a methodology? vii. At most 1 week. viii. 2 to 4 weeks. ix. More than 4 weeks. b. Which type of training is preferred in order to learn a methodology? x. On-line courses. xi. Mentoring. xii. Theory/Practice courses. Regarding Fast Prototyping a. Which of the following aspects are considered a motivation for fast prototyping? i. Requirements validation. ii. Rapid client satisfaction. iii. Design validation. . . . .
Fig. 2. Detailed questions about model and methodology requirements
The second section of the questionnaire focuses on “Requirements for Documentation”. It aims at establishing the role of good quality explanatory and selftraining documents to make a model and a methodology easier to learn. Other objectives of this section are to identify the most useful types of documentation
On the Acceptability of Conceptual Design Models for Web Applications
97
required by industry people, their format and structure, and their intended use. We propose different types of documentation (Hand Book, User Guide or Manual, Cookbook, A book on the methodology, On-line hypermedia documentation) and ask a four-levels ranking (Not Desired at all, Desired, Strongly Desired, Absolutely Necessary). For each documentation type, we also require respondents to provide an estimation of the size. In section three – “Requirements for Support Tools” - we focus on the requirements for the tools that support the design activity and their efficient integration with the entire development process. We ask the respondents to evaluate the following characteristics for design tools (using four values: Not Desired at all, Desired, Strongly Desired, Absolutely Necessary): Flexible models management. For example, the flow of design activities is not strictly sequential. A designer may need to switch to navigation design before completing information design, to define some presentation solutions, and to return back to complete navigation specifications, and similar. The objective of this set of questions is to verify whether it is important that the tool allows designers easily switch back and forth among the different design tasks, and what is the expected degree of flexibility. Model Versioning. This set of questions aims at verifying the need for support in the management of different versions of the design specifications of an application which are produced by different authors or at different design stages, and to identify the best way to meet this requirement. Code derivation. This expression denotes the tool ability to generate source code fragments in a specific implementation language from the design specifications with the tools (e.g., class templates, method templates, and similar). The objective of these questions is to understand several aspects: how crucial is this feature for industrial production? In which cases is it more valuable and in which ones is it less important? Which trade-offs is the designer ready to pay to get some source code, in terms of effort during the design specification phase5? Semi-Automatic generation of prototype. by this we mean the provision of special tool features for creating an application prototype from the design models. Integration with design methodology. This set of questions explores the expectations concerning the degree of adherence of the tool to the specific design model and methodology. We want to understand whether the interviewees need a tool which is strongly tailored to the model and methodology, or rather prefer a general purpose CASE tools which can be personalized to the specific features of the model. We also explore the degree of degree of model-tailoring respondents require, and which amount of personalization effort they can accept. Multiple view of the same design artifact. These questions investigate the need for view features that allow designers dynamically restructure a set of design artifacts according to different perspectives. For example, to view the different portions of design specification that addresses the needs of different user categories, or the constraints of different devices; to view the design specification (in terms of information structures, navigation and publishing structures) for a set of content objects, etc. 5
Derivation of source code requires in fact a very detailed and formal design specification.
98
F. Garzotto and V. Perrone
MS-Windows look and feel. This part of the questionnaire aims at discovering
whether industry people desire a standard MS Windows-like look and feel, or rather prefer different interface paradigms for the design tools interface. Semi-automatic derivation of documentation. This part of the questionnaire addresses the requirements on the production of design documentations. In our experience, good quality design documentation is crucial both for managerial reasons (being sometimes the contractual basis for discussing the development follow-up with the customer) and for implementation (to avoid misunderstandings with the implementers). Through these questions, we want to understand whether companies share our point of view on the role of design documentation, and to verify how much they expect that a design tool supports the (possibly partial) generation of well structured design reports from design specifications built using the tool. Consistency Check. These questions verify the relevance of tool features for checking the consistency of the design specification, and for reporting consistency violations (such as a missing cardinality in a relation, a missing attribute in a information structure, a dangling or partially defined link, and similar).
3 Findings Analysis The questionnaire was sent via e-mail to 62 companies involved in large-scale web application development: 11 organizations in North and South America, and 51 in Europe (from 8 different Countries). We purposefully excluded academic and research institutions from the sample of inspected subjects. The questionnaire was filled in by web project managers, developers, or designers. We had a 44% percent of responses. The findings, based on the statistic analysis of the answers and crosstabulations, are discussed in the rest of this section. 3.1 Requirements for the Model and the Design Process Figure 3 summarizes the answers to the questions presented in figure 1.
Fig. 3. Summary results concerning general requirements for a design model and design process (see questions of figure 1)
On the Acceptability of Conceptual Design Models for Web Applications
99
The most evident results highlighted by the above diagram are that: 40.74% marks as absolutely necessary to provide Support for Iterative and Incremental Design Lifecycle 48.15% marks as strongly relevant the fact that the methodology should be Easy to Use 44.44% considers as relevant that the methodology is Easy to Learn 66.67% says that they do not care about the methodology to be a standard Other interesting findings emerging from our analysis are: Project Management Support is relevant for the 37.04% of the respondents Process customization is relevant for the 48.15% of the respondents, and absolutely necessary for the 25.93% of the respondents Fast Prototyping is relevant for the 33.33% of the respondents, while 40.74% considers it strongly relevant and 22.22% considers it absolutely necessary For CASE Tools Support, the answers are uniformly distributed on the different measurement values, but we can remark that the only 25.93% of the respondents mark as Not Relevant this characteristic Documentation Support is marked as relevant by the 33.33% of our sample, strongly relevant by the 37.04% of the respondents, and absolutely necessary by 29.63%, but the most important, none marks it as Not Relevant The analysis of the detailed answers (see examples in the previous figure 2) provides an additional set of useful data: Support for Iterative and Incremental Design Lifecycle and Fast Prototyping As mentioned above, both characteristics are considered as absolutely necessary and relevant for a significant portion of our sample. There are a significant 85% of the respondents that prefer an evolutionary prototype rather than a throwaway prototype, which is preferred only by the 15%. This high preference for a prototype that evolves until becoming the final application can be justified by the fact that industry people do not want to lose resources in working on a system (the throwaway prototype) that they will have to discard later. Concerning fast prototyping, almost all (92,59%) say that requirements validation is the main reason for fast prototyping, while 51.85% indicates design validation, and 48.15% choose rapid client satisfaction. Easiness of learning The results show that the 70% expects spending between two and four weeks learning a model and a methodology, 19% prefers spending at most one week, and only 11% can spend more than four weeks. The results on the preferred type of training highlights that 48.15% desires courses with theory and practical information, 40.74% mentoring courses (with the expert side-by-side support), and 33.33% online web courses. Easiness of use Among the characteristics that make a model easy to use, 51% indicates customizability (the ability to adapt its use to different contingent and organizational situations for design.), 66% flexibility (the ability to provide different ways by which the designer can use the model) and more than 70% the presence of guidelines and patterns. The latter result empirically confirms a generally acknowledged principle of software engineering - the utility of design patterns - highlighting that patterns are largely perceived as useful by the industry to improve the usability of a design model and to make the design activity easier and more effective.
100
F. Garzotto and V. Perrone
Project management support To the question “Which are the project management activities that are considered as important to be supported by a design methodology?”, respondents answer that Time Planning (70,37%), Change Management (74.07%) and Configuration Management (77.78%) are the most voted. 3.2 Requirements for Documentation Support Figure 4 summarizes the main findings regarding the section on documentation support. 41
41
44
41
33
33
22
26
30 26 22 22
22
19
19 19
19
11 7
Desired
Strongly Desired Absolutely Necessary
oo k O nl in e
H
yp er m
B
Not Desired
ed ia
4
H U an se db rG oo ui k de or M an ua l C oo kb oo k
45 40 35 30 25 20 15 10 5 0
Fig. 4. Requirements on documentation types.
The above diagram highlights that online hypermedia is the most required form of documentation, being marked as strongly desired by the 44.44% of the respondents, and as absolutely necessary by the 32%. In contrast, cookbook receives the highest percentage (22%) of not-desired, and has slightly lower values for strongly desired and absolutely necessary than the other forms of documentation. The results about book show that this form of documentation does not have any dominant attribute. The respondents’ opinions seem evenly dispersed across all choices. Regarding the desired size, in terms of amount of pages, of each proposed documentation type, the main results are: Handbook: 5-10 pages: 19%; 10-20 pages: 70%; more than 20 pages: 11% User Guide Manual: 40-50 pages: 33%; 50-80 pages: 37%; more than 80: 30% Cook Book: 10-20 pages: 35%; 20-40 pages: 42%; more than 40 pages: 42% Book: 70-100 pages: 33%; more than 100 pages: 67% On-line Hypermedia: 20-40 pages: 59%; more than 40 pages: 41% 3.3 Requirements for Design Tools The most interesting results of this section is that 70.37% of the respondents answers “YES” to the general question on the utility of software tools for supporting the design process. Figure 5 summarizes the main findings regarding the different characteristics desired for support tools.
On the Acceptability of Conceptual Design Models for Web Applications
101
The most important results to observe are: 51.85% marks consistency check as Absolutely Necessary, while Model Versioning gets the same vote by the 29.63% of the sample, followed by Semiautomatic derivation of documentation (26%). All other characteristics get the same vote by less than 15% of the sample. 60,00
Not Desired
Desired
50,00 Strongly Desired
40,00
Absolutely Necessary
30,00 20,00 10,00 0,00
F
ls m ode le m lexib
l ies ck fact ent pes ti on ing f ee tion ti vit che nta ar ti sion protot y gem der iva and n ac me ign ncy ve r ana e of look f docu esig des iste del n s s d o e o n C od w o M m gy do rati Co on e sa dolo ene -Win er ivati tho of th MS tic g d me ma view ati c with ut o m le ia n to ip o lt i- au Sem tegrati Mu Sem In
Fig. 5. Requirements for a design tool
Five
characteristics are indicated as Strongly Necessary with relatively similar percentages of votes (between 40% and 55% of the respondents): Flexible Model Management - 48.15%; Code Derivation - 55.56%; Model Versioning and Integration with Methodology Design Activities - 48.15%; Semi-automatic Derivation of Documentation - 40.74% Multiples views of the same Design Artifacts is considered as desired by 40.74 % Regarding Semiautomatic Generation of Prototypes, it is interesting to note that although it is signaled as strongly desired and desired by 37.04% and 40.74% of the votes respectively, is also signaled as not desired by 22.22% of the responses The most important conclusion regarding what people don’t want is achieved by the MS-Windows Look And Feel feature, which gets a not desired by 44.44% of the respondents
4 Lessons Learned and Conclusions What does “quality” mean for a conceptual design model? Quality is a very broad and generic term, which can be defined along many different perspectives. In this paper, we suggest that a possible measure of quality for a conceptual design model is its degree of acceptance in the practitioners’ world. Our research investigates some
102
F. Garzotto and V. Perrone
factors that contribute to the industrial acceptability of web conceptual design models, by examining the requirements of a significant sample of software companies, internet services providers, or organizations which are moving their business towards the web. The results of our research validate our general hypothesis: being a “good” conceptual model for web application design is not the only relevant factor for industrial acceptability. A good set of modeling primitives and notations should be delivered to industry together with a number of complementary features: a proper methodology and design process, an effective documentations about the model and the methodology, a kit of support tools. For the model per se, and for each complementary feature - the methodology, the documentation, and the tools – we summarize the main lessons learned from our study. Acceptability features for a conceptual model “per se” The two important characteristics that industry people have identified as most relevant for a model per se are ease of use and learnability (crucial factors for the usability of any human artefact). Our findings on learnability suggest that the model (and the companion methodology – see below) has to be learned in no more than 4 weeks, and that theory and practical courses combined with mentoring courses (better if including hands-on activities carried on side by side with an expert) are the preferred learning mechanisms. We may conclude that to improve both ease of use and learnability, our models should find a compromise between richness and simplicity, and should try to balance completeness and expressive power of the modelling primitives with intuitiveness and with evidence of their utility. A possible way to achieve this compromise might be to deliver “multi-version” models, made of a “basic kit” and an “advance kit” of modelling concepts and notations. The basic kit can be understood and learnt relatively easily (2-5 days) and can be almost immediately applied for the design of relatively simple applications. The advanced version can address more sophisticated modelling needs, can be learnt after the basic modelling features are fully digested, and used to design complex application features. Acceptability features for a methodology A model should be integrated with a proper methodology, which identifies a systematic design process and provides a clear set of guidelines to help designers use the model. The design process should be flexible, incremental, iterative. It should be customisable to different scenarios of use (i.e., to the needs of each specific application and to the individual industrial practice). The design process should be integrated with the whole development process. In particular, it should provide some support for the human translation of conceptual design solutions into implementation solutions, and for fast prototyping production. Prototypes should be evolutionary and should basically help designers to validate user requirements. In addition, the methodology and the process should look at application design and development within the organizational context: they should address project management issues, to help managers monitor the project lifecycle in terms of time and resources. Acceptability features for documentation Good quality documentation is crucial for making a model and a methodology easy to learn and to use in an industrial setting. Our study points out that the preferred documentation support is online hypermedia documentation, followed by user guides and manuals. Richness of examples, case studies, and lesson learned is perceived as an important content to be provided in all forms of documentations.
On the Acceptability of Conceptual Design Models for Web Applications
103
Acceptability features for support tools An important highlight of our study is that the availability of CASE tool and fast prototyping tools has a very high priority for the world of practitioners. The most desired feature for a CASE tool is a consistency checker (to support a design task where humans are less effective than machines). Additional requirements concern facilities for multiple views of the specification schemas (e.g., at different levels of details, along different design perspectives), support to versioning, possibility of switching among different design schemas. Concerning fast prototyping, there is a need for support to code derivation and for a strong integration of (semi-automatic) prototyping facilities with the representation tools. It is interesting to note that the main reason for fast prototypes is requirements and design validation, which suggests that requirement traceability support is an additional useful feature for CASE tools Even if the findings of our research reflect the needs and inclinations of a specific industrial sector, they have a general validity. In principle, they may offer interesting highlights on the industrial requirements of any design model, also in fields other than the web. In the short term, our future work includes a refinement of our study and an accurate validation of its results. We will enlarge the answer set and adopt more sophisticated evaluation procedures, for checking and correcting errors in the results, and for approximating missing answers. In the mid term, we are using the research findings to improve the features of the web design model (W2000 [12]), documentation, and toolkit, developed within the UWA project, in order to make it more acceptable and potentially usable in the industry practice. Acknowledgments. Authors are grateful to Mauricio Sansano, who carried on the preliminary version of this study, to the UWA project partners and reviewers for their helpful suggestions, and to the anonymous reviewers for their valuable feedbacks.
References 1. 2. 3. 4. 5. 6. 7. 8.
Norman E. Fenton: Software Metrics: A Rigorous Approach. London: Chapman & Hall, 1991. Barry and Lang: A Survey of Multimedia and Web Development Techniques and Methodology Usage. IEEE Multimedia, April–June, 2001 C. Britton et al.: A Survey of Current Practice in the Development of Multimedia Systems. Information and Software Technology, vol. 39, no. 10, 1997, pp. 695–705. B. Fitzgerald: An Investigation of the Use of Systems Development Methodologies in Practice. Proc. Fourth European Conf. Information Systems, Lisbon, Portugal, 1996, F. Garzotto, P. Paolini,D. Schwabe: HDM – A Model-Based Approach to Hypertext Application Design. ACM Transactions on Information Systems, Vol. 11, No. 1, January 1995. S. Ceri, P. Fraternali,A. Bongio: Web Modeling Language (WebML): a modeling language for designing Web sites. Proc. of the 9th World Wide Web Conference (WWW9), Amsterdam, 2000. T. Isakowitz, E. Stohr, P. Balasubramanian: RMM: A Methodology for Structured Hypermedia Design. CACM (1995), 38(8), pp. 34–44 D. Schwabe, G. Rossi: An Object Oriented Approach to Web-Based Application Design. Theory and Practice of Object Systems, 4 (4), J. Wiley, 1998.
104 9. 10. 11. 12. 13. 14. 15.
F. Garzotto and V. Perrone Rolf Hennicker and Nora Koch: A UML-based Methodology for Hypermedia Design. In volume 1939 of Lecture Notes in Computer Science, York, England, October 2000. Springer Verlag. J. Conallen: “Modeling Web Application Architectures with UML. Communications of the ACM, 42:10, pp. 63–70. UWA (“Ubiquitous Web Applications”) project – IST 2000–25131 –, technical annex 1 “Description of Work”. Official Web site: www.uwa-project.org L. Baresi, F. Garzotto, P. Paolini, V. Perrone: UWA Deliverable D7, “Hypermedia and Operation Design”. www.uwa-project.org Alan Dix et al. Uman-Computer Interaction. Prentice Hall 1998. Jakob Nielsen: Designing Web Usability: The Practice of Simplicity. New Riders Publishing, Indianapolis, 2000 G. Booch, I. Jacobson, and J. Rumbaugh: The Unified Modeling Language User Guide. The Addison-Wesley Object Technology Series, 1998.
Consistency by Construction: The Case of MERODE Monique Snoeck, Cindy Michiels, and Guido Dedene Department of Applied Economic Sciences, Katholieke Universteit Leuven, Naamsestraat 69, 3000 Leuven, Belgium {monique.snoeck,cindy.michiels,guido.dedene}@econ.kuleuven.ac.be
Abstract. Modeling languages such as UML offer a set of basic models to describe a software system from different views and at different levels of abstraction. Tools supporting an unrestricted usage of these UML models cannot guarantee the consistency between multiple models/views, due to the lack of a formal definition of the semantics of UML diagrams. A better alternative that does allow for automatic consistency checking is modeling according to the single model principle. This approach is based on the conception of a single model, for which different views are constructed, and with an automatic or semi-automatic generation or consistency checking among these views. Three basic approaches to consistency checking are consistency by analysis, consistency by monitoring and consistency by construction. In this paper we illustrate the consistency by construction approach by means of the conceptual domain modeling approach MERODE and its associated case-tool MERMAID. We also illustrate how consistency by construction improves the validity and completeness of the conceptual model.
1 The Single Model Principle The framework of Lindland, Sindre and Solvberg for quality-improvement of conceptual models distinguishes itself from previous attempts by not only identifying major quality goals for conceptual models, but also the means for achieving them [1]. As such, the framework contains a core set of quality goals and means, subdivided according to syntactic, semantic and pragmatic quality. With respect to semantic quality, two goals are put forward, i.e. feasible validity and feasible completeness. Validity means that all statements made by the model are correct and relevant to the problem, whereas completeness means that the model contains all the statements about the domain that are correct and relevant. To achieve a feasible level of validity, consistency checking is considered as an important semantic means: it allows verifying the internal correctness of specifications1. In order to do automatic consistency checking, the model must be captured in a formal language. Modeling languages such as UML offer a set of basic models to describe a software system from different views and at different levels of abstraction [2]. Examples of models included in UML are Use Cases for the functional requirements, class diagrams for the static view, interaction diagrams for the dynamic view, etc. 1
As opposed to external correctness, meaning that a specification should meet the user requirements.
M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 105–117, 2003. © Springer-Verlag Berlin Heidelberg 2003
106
M. Snoeck, C. Michiels, and G. Dedene
Tools supporting an unrestricted usage of these UML models, cannot guarantee the consistency between multiple models/views of the same system if these are constructed independently. The reason why automatic consistency checking cannot be supported is that UML lacks formal rules to enforce a consistent mapping between the models it defines. A better alternative that does allow for automatic consistency checking is modeling according to the single model principle [3]. This approach is based on the conception of a single model, for which different views are constructed, and with an automatic or semi-automatic generation or consistency checking among these views2. For the verification of view consistency three basic approaches can be distinguished. A first approach is consistency by analysis, meaning that an algorithm is used to detect all inconsistencies between two deliverables, and a report is generated thereafter for the developers. In this kind of approach the requirements engineer can freely construct the different views. At the end of the specification process or at regular intervals, the algorithm is run against the models to spot errors and/or incompleteness in the various views. The verification can be done manually, but obviously building the algorithm into a case-tool will substantially facilitate the consistency checking procedure. The second approach can be denoted as consistency by monitoring, meaning that a tool has a monitoring facility that checks every new specification against the already existing specifications in the case-tool’s repository. Whenever an attempt is made to enter a specification that is inconsistent with some previously entered specification, the new specification is rejected. The advantage of this approach is that the model is constantly consistent. Whereas the first approach puts the burden of correcting inconsistencies on the requirement engineer, the second approach avoids the input of inconsistencies. At the end of the specification process, the model must still be verified for completeness. The possible disadvantage of this approach is that a too stringent verification procedure will turn the input of specifications into a frustrating activity. The two approaches can be compared to two spelling and grammar checking strategies in word processing: the first checks spelling and grammar by running the spelling and grammar checker periodically, whereas the second approach is the equivalent of the option "check spelling and grammar as you type". A third approach is consistency by construction, meaning that a tool generates one deliverable from another and guarantees semantic consistency. Whenever specifications are defined in one view, those elements in other views that can automatically be inferred are included in those views. Also in this approach, the requirements engineer can only define consistent models. The major advantage is however that the specifications are more or less constructed in an automated way: everything that can automatically be inferred is generated by the case-tool. This saves a lot of input effort. In addition, whereas the monitoring approach leads to a case-tool that generates error messages at every attempt to enter inconsistent specifications, the self-constructing approach avoids the input of inconsistent specifications by completing the entered specifications with their consistent consequences. The result is a much more user-friendly environment. Moreover, the automated generation of specifications offers the major advantage of improved completeness of the model. In the remainder of the paper, we discuss the integration of the consistency by construction approach in MERMAID, a modeling tool based on the object-oriented 2
Notice that in UML, each view is called a "model".
Consistency by Construction: The Case of MERODE
107
analysis method MERODE. This methodology offers three basic views on a business model –static, dynamic and interaction view– and is formalized in a set of rules managing all mappings between these views. Since the aim of the paper is to illustrate the modeling gains of a tool supporting the single model principle by consistency by construction, we kindly ask the reader to take the methodology “as is”. The paper is organized as follows. Section 2 introduces the three views that are supported in MERODE. Section 3 then briefly presents the consistency checking rules as they have been elaborated in [4][5] and discusses consistency by construction in MERMAID (due to space limitations, inheritance will not be discussed). Section 4 illustrates how consistency by construction improves the validity and completeness of the conceptual model. Finally section 5 presents some conclusions.
2 Overview of the Static, Dynamic, and Interaction View MERODE stands for Model-driven Existence dependency Relation, Object-oriented DEvelopment. It is a methodology for object-oriented enterprise modeling that has grown out of research on semantic modeling approaches, Jackson Systems Development [6] and object-oriented analysis. The most distinguishing features of this methodology are its specific orientation to domain modeling, the use of Existence Dependency to model the static aspects of the domain model, and the event driven approach to behavior modeling. Relevant concepts will be explained in subsequent sections. By means of an example will be illustrated how a specification can be self-completing to a certain extent and how this automated consistency by construction contributes to the validity and completeness of specifications. A MERODE model consists of three subviews: - an existence dependency graph (EDG) that organizes enterprise object types according to existence dependency and inheritance, - an object-event table (OET), which identifies business event types and relates those to the enterprise object types, - a behavioral model, consisting of one finite state machine (FSM) per enterprise object type. The semantics of the EDG, the OET and the FSMs have been defined by means of process algebra and view consistency has been defined at the same time [4][5]. As a result, a set of consistency checking rules is available for this method, which also provide for some basic completeness check. Fig.1 gives an overview of the views and the rules. 2.1
The Existence Dependency Graph
Let us consider the UML class diagram in Fig. 2. It represents a situation where customers can place zero to many orders for projects. Each project is ordered by exactly one customer. Employees work on projects: each employee works on exactly one project at a time and each project has zero to many employees working on it.
108
M. Snoeck, C. Michiels, and G. Dedene
Existence Dependency Graph
FiniteState State Finite Finite State Finite State Machine Machine Machine Machine
Object Event table
$OSKDEHW5XOH 3URSDJDWLRQUXOH 7\SHRI,QYROYHPHQWUXOH 'HWHFWLRQRISRVVLEOHUHGXQGDQWSDWKV
$OSKDEHW5XOHELV 'HIDXOWOLIHF\FOHUXOH
Fig. 1. Views and consistency checking rules in MERODE
CUSTOMER
>@
> @
orders
PROJECT
>@
> @
EMPLOYEE
works for
Fig. 2. Project management
Although the two associations look identical in their graphical representation, there is some substantial difference in the semantics of each association. Indeed, every employee works on one project at a time, but over time employees can work on several projects consecutively. In other words, the association "works for" is modifiable. The "orders" association however, is not modifiable: a project is ordered by one customer, but this customer remains the same over time. Consequently the diagram in Fig. 2 can be considered to be semantically incomplete: some relevant statements about the domain have not been expressed. Therefore, in MERODE, it is required to transform a class diagram into an existence dependency graph (EDG). In such graph, all object types are only related through associations that express existence dependency. According to the formal definitions in MERODE, a class D is existence dependent of a class M if and only if the life of each occurrence of class D is embedded in the life of one single and always the same occurrence of class M. D is called the dependent class and is existence dependent of M, called the master class. A more informal way of defining existence dependency is as follows: if each object of a class D always refers to minimum one, maximum one and always the same occurrence of class M, then D is existence dependent of M. Notice that existence dependency is equivalent to the notion of weak entity as defined by Chen [7][4]. To avoid confusion with a standard UML class diagram, MERODE uses a proprietary notation with dots and arrows to define cardinality of the existence dependency relationship. This cardinality defines how many occurrences of the dependent object type can be dependent of one master object at one point in time. As the cardinality of the master class is always exactly one (every dependent is associated to exactly one master), only the cardinality for the dependent needs to be specified. An arrowhead means that the master can have several dependents simultaneously whereas a straight line limits the maximum cardinality to one. A white dot means that having a dependent is optional for the master, whereas a black dot imposes a minimum constraint of one (the master has at least one dependent at any time). The transformation of the class diagram of Fig. 2 results in the EDG of Fig. 3. The "orders" association expresses existence dependency: each project can only exist within the context of a customer and refers to exactly one and always the same
Consistency by Construction: The Case of MERODE
109
customer for the whole duration of its life. A customer on the contrary can exist on its own. He needs not to have a project in order to exist (optionality indicated by the white dot) and he can have many ongoing projects (arrowhead). The "works for" relationship does not represent existence dependency. An employee can exist outside of the context of a project and a project can exist outside of the context of an employee. When an association does not express existence dependency, the association is turned into an object type that is existence dependent of all the object types participating in the association. In this case this means that the "works for" association is turned into an object type ASSIGNMENT, which is existence dependent of PROJECT and EMPLOYEE. MERODE calls this type of intermediate class a "contract" class: it models what can happen during the period of time that a project and an employee are related to each other. Since a project can have zero to many employees, each project has zero to many assignments (white dot, arrow). And as each employee is assigned to exactly one project at a time, each employee has exactly one assignment at a time (black dot, straight line). CUSTOMER
PROJECT
ASSIGNMENT
EMPLOYEE
Fig. 3. Existence dependency graph for the project management example
2.2
The Object-Event Table
In the case of object-oriented conceptual modeling, domain requirements will be formulated in terms of business or enterprise object types, associations between these object types and the behavior of business object types. The definition of desired object behavior is an essential part in the specification process. On the one hand, we have to consider the behavior of individual objects. This type of behavior will be specified as methods and statecharts for object classes. On the other hand, objects have to collaborate and interact. Typical techniques for modeling object interaction aspects are interaction diagrams or sequence charts, and collaboration diagrams. In most object-oriented approaches events are considered as subordinate to objects, because they only serve as a trigger for an object’s method. The object interactions themselves are modeled by means of sequence and/or collaboration diagrams. In contrast, MERODE follows an event-driven approach that raises events to the same level of importance as objects, and recognizes them as a fundamental part of the structure of experience [8]. A business event is now defined as an atomic unit of action that represents something that happens in the real world, such as the creation of a new customer, an order placement, etc. The business events reflect how domain objects come into existence (the creating events), how domain objects are modified (the modifying events), and how they disappear from the universe of discourse (the ending events). Object interaction can now be modeled by defining which objects are concurrently involved in which events. Object-event participations are denoted by means of an object-event table (OET). When an object participates in an event, it implements a method that defines the effect of the event on the object. On occurrence of the event all corresponding methods in the participating objects are executed in parallel. Thus, instead of modeling a complex sequence of method invocations, it is
110
M. Snoeck, C. Michiels, and G. Dedene
now assumed that all methods are concurrently executed. The OET for the project management example is given in Table 1. The rules that govern the construction of this table are described in the next section. 2.3 The Finite State Machines Finally, the life cycle of every enterprise object class is modeled by means of a finite state machine (FSM). The events of the object-event table are used as triggers for the transitions in the finite state machine. As an example, Fig. 4 shows the FSM for EMPLOYEE. Similarly, a FSM can be defined for the classes PROJECT, ASSIGNMENT and CUSTOMER. Table 1. Object-event table for project management. U H
FUBFXVWRPHU PRGBFXVWRPHU
W F MH R U 3
P R W V X &
& 0
P Q LJ V V $
P (
HQGBFXVWRPHU
(
FUBSURMHFW
0
&
PRGBSURMHFW
0
0
HQGBSURMHFW
0
(
FUBHPSOR\HH
&
PRGBHPSOR\HH
0
HQGBHPSOR\HH
PRGBHPSOR\HH
W Q H
H H \ OR S
(
DVVLJQ
0
0
0
&
UHPRYH
0
0
0
(
FUBHPSOR\HH H[LVWV
HQGBHPSOR\HH UHPRYH
DVVLJQ DVVLJQHG
Fig. 4. Finite state machine for Employee
3 Consistency by Construction The construction of the OET is governed by a number of rules that ensure the consistency of the OET with the EDG. An algorithmic approach to consistency checking would verify the consistency after entering the specification. In this section we illustrate how many of the consistency rules allow to automatically generate some parts of the requirements, preventing in this way inconsistencies and incompleteness. 3.1 Alphabet Rule The alphabet of an object class is defined as the set of all event types that are marked for this object type in the OET. The Alphabet Rule states that each event can have only one effect on objects of a class: the event either creates, modifies or deletes objects. In addition, the rule states that each object class needs at least one event to create occurrences and one event to destroy occurrences in this class.
Consistency by Construction: The Case of MERODE
111
Rather than verifying post factum whether there is at least one creating and one ending event for each enterprise object type, the case-tool will automatically generate two business events when an object type is added to the EDG. The default names are the name of the object type preceded by "cr_" and "end_", but as shown in Fig. 5, the user can overwrite the names and decide not to generate one or both event types. Simultaneously, the OET is completed accordingly: a column is added for the object type, two rows are added for the event types and the participations are marked (see Fig 6.).
Fig. 5. Existence dependency graph
Fig. 6. Object-event table
3.2 Propagation Rule and Type of Involvement Rule A second rule in the construction of the OET is the propagation rule. The propagation rule states that when an object type D is existence dependent of an object type M, the latter is by default also involved in all event types D is involved in. This means that if an involvement is marked for an event type in the column of a dependent object type D, it must also be marked in the column of the master object type M. In addition, the type of involvement rule states that since an existence dependent object type cannot start to exist before its master, a creating event type for a dependent class is a creating or a modifying event type for the master class. A modifying event type for a dependent class is also a modifying event type for its master class. And finally, since a dependent cannot outlive its master, an ending event type for a dependent is an ending or modifying event type for its master. To discern the participations the master acquired from its dependents through the propagation rule from the event type participations that are proprietary to the master class, the former are preceded by a ’A/’ (from Acquired) and the latter by an ’O/’ (from Owned). Performing and verifying the propagation by hand is a time consuming task, especially for larger projects. A case-tool however, can easily generate all the propagated participations. For the project management example, the resulting OET after entering the four object types and the existence dependency relations is shown in Fig. 7.
112
M. Snoeck, C. Michiels, and G. Dedene
The OET can be modified independently from the EDG, but also in this case, consistency is automatically enforced whenever possible. Adding an object type in the OET will add the object type in the EDG as well, although it will not be related to other object types already in the EDG. Events can be added in the OET and for these events we can add owned methods, which will be automatically propagated. Acquired methods cannot be added or removed. The type of involvement can be modified, provided it follows the type of involvement rule.
Fig. 7. OET with propagated object-event participations
3.3 Detection of Possible Redundant Paths Joining paths in the EDG occur when a master can be reached from a dependent by following two different existence dependency paths transitively from dependent to master. Assume that the project management example is extended with invoicing as in Fig. 8. During his/her assignment to a project, each employee can register the hours performed for the project. This time registration is included on an invoice at the end of the month as an invoice line. &86720(5
352-(&7
$66,*1 0(17
,192,&(
,192,&( /,1(
7,0( 5(*,675$ 7,21
(03/2<((
Fig. 8. Extended EDG for Project Management
Going along the existence dependency relations from dependent to master, the object type CUSTOMER can be reached in two ways from the class INVOICE LINE: INVOICE LINE Æ INVOICE Æ CUSTOMER
and
INVOICE LINEÆ TIME REGISTERATION Æ ASSIGNMENT Æ PROJECT Æ CUSTOMER
Applying the propagation rule in the OET automatically identifies this kind of path joins: path joins lead to multiple propagations in the OET. In the ordering example the
Consistency by Construction: The Case of MERODE
113
object type CUSTOMER will acquire the event types from invoice line two times, once through each path (see Table 2). Identifying path joins is important since one must answer the question whether one or two customers are involved in an invoice line. In other words: is the customer for whom the work was done (that is to say, the customer connected to the project connected to the invoice line via assignment and timeregistration) the same person as the one whom we send the invoice to? If this is the case, the double participation is replaced by a single participation and a constraint (an invariant) is added in the class INVOICE LINE: self.INVOICE.CUSTOMER = self.TIME_REGISTRATION.ASSIGNMENT.PROJECT.CUSTOMER
3.4 Alphabet Rule and Default Lifecycle Rule The alphabet rule also states that the FSM that defines the behavior of an object type P must contain all and only the event types for which there is a ‘C’, ‘M’ or ‘E’ in the column of P in the OET. In addition, the sequence constraints imposed by the FSM must not violate the default lifecycle of create, modify, end. Hence, according to these rules, a default FSM can be generated for each object type. This FSM can be further refined by adding new events and states. Fig. 9 shows the FSM that can automatically be derived from the OET for TIME_REGISTRATION. This FSM can be further refined, for example to ensure that a time registration cannot be modified once it has been invoiced (as in Fig. 10). The case-tool ensures at any time that a creating event is only used for a transition departing from the initial state, that a modifying event is only associated to transitions between intermediate states and that an ending event is only associated with transitions terminating in a final state. Table 2. Object-event table for the extended project management example
&86720(5
352-(&7
(03/2<((
$66,*1 0(17
7,0(5(*,6 75$7,21
,192,&(
,192,&( /,1(
FUBFXVWRPHU HQGBFXVWRPHU FUBSURMHFW HQGBSURMHFW FUBHPSOR\HH HQGBHPSOR\HH DVVLJQ UHPRYH UHJLVWHU PRGLI\BUHJLVWUDWLRQ HQGBUHJLVWUDWLRQ FUHDWHBLQYRLFH SD\BLQYRLFH HQGBLQYRLFH SXWB75BRQBLQYRLFH PRGLI\BLQYRLFHBOLQH HQGBLQYRLFHBOLQH
2& 2( $0 $0 $0 $0 $0 $0 $0 $0 $0 $0 $0$0 $0$0 $0$0
2& 2( $0 $0 $0 $0 $0 $0 $0 $0
2& 2( $0 $0 $0 $0 $0 $0 $0 $0
2& 2( $0 $0 $0 $0 $0 $0
2& 20 2( $0 $0 $0
2& 20 2( $0 $0 $0
2& 20 2(
114
M. Snoeck, C. Michiels, and G. Dedene UHJLVWHU
HQGBUHJLVWUDWLRQ H[LVWV PRGLI\BUHJLVWUDWLRQ SXWB75BRQBLQYRLFH PRGLI\BLQYRLFHBOLQH HQGBLQYRLFHBOLQH
Fig. 9. Default FSM for TIME_REGISTRATION
UHJLVWHU
H[LVWV
PRGLI\B UHJLVWUDWLRQ
SXWB75B RQBLQYRLFH
HQGBUHJLVWUDWLRQ LQYRLFHG PRGLI\BLQYRLFHBOLQH HQGBLQYRLFHBOLQH
Fig. 10. Modified FSM for TIME_REGISIRATION
4 Completeness Traditionally, modeling is viewed as a mapping of an area or part of the real world into a model [1][9]. In this view, validity means that all statements made by the model are correct and relevant to the problem, whereas completeness means that the model contains all the statements about the domain that are correct and relevant. When checking the completeness of a model, the user requirements are the reference point. Hence, user signoff is often considered to be a de facto measurement of completeness [10]. Unfortunately, users often don’t even understand data models, let alone object-oriented conceptual models. It is therefore impossible to check completeness of a model by having it checked by users alone. In this respect, the automatic generation of those parts of the specifications that can be inferred from the already existing specifications will simplify the checking for completeness of the specifications. What can be inferred is however tightly connected to the semantics of the techniques for conceptual modeling. As an example, let us reconsider the class diagram of Fig. 2: it is semantically correct but incomplete as some relevant constraints were not identified nor explicitly incorporated into the model. It is certainly possible to add a note or a stereotype to express the differences between the two associations or else to express the difference in the behavioral model. The important point is however that the diagramming technique does not "enforce" the requirements engineer to think of the difference: it does not help in discovering the incompleteness in the model. By transforming this graph into an EDG, the incompleteness is resolved, resulting in a model that is semantically more complete. In the project management example, the transformation of the class diagram to an EDG leads to the creation of the object type ASSIGNMENT for the project management example. Subsequently, the alphabet rule requires the definition of a creating and an ending event type for the object type ASSIGNMENT, namely assign and remove. These event types allow specifying under what conditions it is allowed to assign and remove an employee to/from a project. The automatic generation of these two events helps in achieving the completeness of the model. Nothing in the original UML will point the requirements engineer to consider modeling these events. In [4][5] the propagation rule is motivated as follows. Since and existence dependent object cannot exist outside the life of its master, anything that happens to the dependent also affects the master, at least indirectly. By notifying the master of the occurrence of the events on its dependents, the master class is able to do some accounting (e.g. in EMPLOYEE counting the number of projects an employee has ever worked on), or to enforce some constraints (e.g. PROJECT can set as precondition for
Consistency by Construction: The Case of MERODE
115
the assign event that the state of the project should not be ’closed’). Again, the propagation rule illustrates how the automatic generation of object-event participations makes the specifications more complete: by propagating event type participations, all possible places for constraint definitions and information gathering are identified. In this way, the requirements engineer is invited to consider all these elements for the inclusion of potential business rules. In the end, when all requirements have been collected, some of the marked cells might have no constraint or method body associated with them. Those participations can easily be removed before implementation. Again, the rules of MERODE improve the self-completing character of requirements. Finally, the OET provides an automatic mechanism for identifying path joins, which in turn leads to the identification of relevant constraints in the domain.
5 Discussion The key factor of the single model principle is the verification of consistency between the different views of a model. In [3], Paige and Ostroff illustrate how BON/Eiffel follows the single model-principle and how the Single Model Principle can be applied to UML/Java by using profiles. In order to achieve a single model approach, they strongly restrict the types of UML diagrams used: only class diagrams and collaboration diagrams are included in the deliverables of the approach. Paige and Ostroff also identify two types of dependencies between the deliverables: an automatic construction dependency where a tool generates one deliverable from another and guarantees semantic consistency and an algorithmic consistency checking dependency where an algorithm is used to detect all inconsistencies between two deliverables and a report is thereafter generated for the developers. MERODE also strongly restricts the type of diagrams used in order to meet the single model principle. In addition the EDG takes an unusual approach to data modeling, but as explained in [4][5], it is exactly existence dependency that is the key to the semantic consistency checking. Achieving a single model approach with UML is rather difficult because of the lack of precise and formal semantics. The need for formal underpinning of UML has long been recognized and significant advances have been made [11], [12], [13], [14]. Many of these efforts are however limited to the isolated definition of a single modeling notation [15], [13], [16], [17]. Advances have been made towards the integration of different UML views [18]. Examples of such integration efforts are the definition of state machine inheritance in relation to the generalization/specialization hierarchy [19], [20], the integration of life-cycle model and interaction model [21] [18] or the integration of behavior and the notion of composition [16]. In this paper we have illustrated how the MERODE case-tool addresses the consistency checking required to achieve the single model approach. In fact, the MERODE case-tool uses a mix of automated construction (consistency by construction) and algorithmic consistency checking (consistency by analysis). Indeed, since the requirements engineer can further modify the diagrams, the automatic construction must be complemented by an algorithmic verification for those parts of the diagrams that were not constructed automatically. As an example, the MERODE case-tool provides an algorithm for checking FSMs for unreachable states. However,
116
M. Snoeck, C. Michiels, and G. Dedene
because a large part of the specifications were generated automatically, the number of remaining inconsistencies that have to be detected by algorithmic verification is much smaller than if the three views were built in an independent manner. The automatic generation of specifications is also a means to avoid a "big bang" approach to quality, that is to say, an approach where quality is only checked at the end of the specification process, causing rework and delay. An additional benefit of the automatic construction of specification is that it helps to improve the completeness of the specifications. Since its creation, the MERMAID case-tool has proved its usefulness in several real-life projects, the largest of which counts over 44 enterprise objects and 134 business events. Since MERODE only covers the domain modeling part of a project, the tool has been provided with an XMI [22] interface. This allows exporting the specifications to other case-tools, e.g. those that support all types of UML diagrams.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
[10]
[11] [12]
[13]
Lindland, O.I., Sindre, Guttom, Sølvberg, Arne, Understanding Quality in Conceptual Modeling, IEEE Software, March 1994, pp. 42–49 UML, OMG, http://www.omg.org/UML Richard Paige, Jonathan Ostroff: "The Single Model Principle", in Journal of Object Technology, vol. 1, no. 5, November–December 2002, pp. 63–81. online available at http://www.jot.fm/issues/issue_2002_11/column6 Snoeck M., Dedene G., Existence Dependency: The key to semantic integrity between structural and behavioral aspects of object types. IEEE Transactions on Software Engineering, 24(24), 233–251. Snoeck M., Dedene G., Verhelst M., Depuydt A.M., Object-oriented Enterprise Modelling with MERODE. Leuven: Leuven University Press. 1999 Jackson M. Cameron J., System Development, Prentice Hall (1983). Chen, P.P., The Entity Relationship Approach to logical Database Design, QED information sciences, Wellesley (Mass.),1977 Cook, S., Daniels, J.: Designing object systems: object-oriented modelling with Syntropy. Prentice Hall (1994) Schuette, R., Rotthowe, T., The Guidelines of Modeling – An Approach to Enhance the Quality in Information Models, In Tok Wang Ling, Sudha Ram,Mong Li Lee (eds), Conceptual Modeling – ER'98, 17th Interntional Conference on Conceptual Modeling, Singapore, LNCS 1507, Springer. Moody D.L., Shanks, G.G., Darke, P., Improving the Quality of Entity Relationship Models – Experience in Research and Practice, In Tok Wang Ling, Sudha Ram,Mong Li Lee (eds), Conceptual Modeling – ER'98, 17th Interntional Conference on Conceptual Modeling, Singapore, LNCS 1507, Springer. pUML, The precise UML group, http://www.cs.york.ac.uk/puml/ Kuzniarz L., Reggio G., Sourrouille J. L., Huzar Z., Workshop on Consistency Problems in UML-based software development, Workshop Materials, Research Report 2002:06, Blekinge Institute of Technology, Ronneby 2002 , Workshop at the UML 2002 Conference, online at http://www.ipd.bth.se/consistencyUML/ Evans, R. France, K. Lano, B. Rumpe, Developping the UML as a Formal Modelling Notation, in UML’98 Beyond the notation; International Workshop Mulhouse France, PA. Muller, J; Bézivin (eds.), 1998
Consistency by Construction: The Case of MERODE
117
[14] Rumpe, A note on Semantics (with an emphasis on UML), in Second ECOOP Workshop on Precise Behavioural Semantics, H. Kilov, B; Rumpe (eds.), Technische Universität München, TUM–I9813, 1998 [15] M. Saksena, R. B. France, M. M. Larrondo-Petrie, A characterization of Aggregation, in Proceedings of the International Conference on Object Oriented Information Systems, 9– 11 September, Paris, 1998 [16] J. Brunet, An enhanced definition of Composition and its use for Abstration, in Proceedings of the International Conference on Object Oriented Information Systems, 9– 11 September, Paris, 1998 [17] R.H. Bourdeau, B.H.C. Cheng, A formal semantics for object model diagrams, IEEE Transactions on Software Engineering, 21 (10), October 1995, pp. 799–821 [18] Bruel J.M., Lilius, J., Moreira A., France R.B, Defining Precise Semantics for UML, ECOOP 2000 Workshop Reaer, LNCS 1964, Springer 2000, pp.113–122. [19] M. Snoeck, G. Dedene, Generalisation/Specilisation and Role in object-oriented conceptual modelling, Data and Knowledge Engineering, 19(2), 1996 [20] Le Grand, Specialisation of Object Lifecycles, in Proceedings of the International Conference on Object Oriented Information Systems, 9–11 September, Paris, 1998 [21] K.S. Cheung, K.O. Chow, T.Y. Cheung, Consistency analysis on lifecycle model and interaction model, in Proceedings of the International Conference on Object Oriented Information Systems, 9–11 September, Paris, 1998 [22] OMG, XML Metadata Interchange, http://www.omg.org/technology/documents/formal/xmi.htm
Defining Metrics for UML Statechart Diagrams in a Methodological Way Marcela Genero, David Miranda, and Mario Piattini ALARCOS Research Group Department of Computer Science, University of Castilla – La Mancha, Paseo de la Universidad, 4, 13071 - Ciudad Real, Spain {Marcela.Genero,Mario.Piattini}@uclm.es [email protected]
Abstract. The fact that the usage of metrics at early phases of OO development can help designers make better decisions is gaining relevance. Moreover, the necessity of having early indicators of external quality attributes, such as understandability, based on early metrics is growing. There exists several works related to metrics for UML structural diagrams such as class diagrams. However, UML behavioral diagrams metrics have been disregarded in the software measurement arena. This fact leaded us to define a set of metrics for the size and structural complexity of UML statechart diagrams. Apart from the definition of the metrics, a contribution of this study is the methodological approach that was followed to theoretically validate them and to empirically validate them as understandability indicators. Keywords: OO Software, UML statechart diagrams, understandability, maintainability, structural complexity, size, metrics, theoretical validation, empirical validation, experiment replication
1 Introduction It is widely recognised that structural properties of OO software artefacts obtained at early phases of the development has a great influence on the quality of the product that is finally implemented. For this reason several proposals of metrics exists that can be applied to measure the size, structural complexity, coupling, etc. of UML class diagrams [11, 15, 20, 27, 28] and use case diagrams [23, 28]. However, there is little reference to metrics for behavioural diagrams such as UML statechart diagrams in the existing literature. One of the first approaches towards the definition of metrics for behavioural diagrams can be found in [17], where metrics were applied to statechart diagrams developed with OMT [34]. Yacoub et al. [40] proposed structural complexity and coupling metrics for measuring the quality of dynamic executions. These metrics were defined basing in concepts as Petri Net and McCabe’s cyclomatic structural complexity and were applied to simulated scenarios in Real-Time Object Modelling (ROOM) [35]. Poels and Dedene [33] defined structural complexity metrics for event -driven OO conceptual models using MERODE [37]. These proposals of metrics have not gone beyond the definition step. As far as we know, there is no published works related to their theoretical an M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 118–128, 2003. © Springer-Verlag Berlin Heidelberg 2003
Defining Metrics for UML Statechart Diagrams in a Methodological Way
119
empirical validation (except Poels and Dedene who performed the theoretical validation). The lack of metrics for diagrams that capture dynamics aspects of OO software motivated us to define metrics for behavioural diagrams, starting with metrics for UML statechart diagrams [21]. The aim of this paper is to define a set of metrics for measuring the size and structural complexity of UML statechart diagrams and investigate through experimentation if they are related the understandability of UML statechart diagrams1. If such a relationship exists and is confirmed by empirical studies, we will have really obtained early indicators of UML statechart diagram understandability. We consider the understandability because it is an external quality attribute which directly influence several quality characteristics [24], among others maintainability.2 The definition and the theoretical and empirical validation of the metrics have been done in a disciplined manner following a method that emerged as a combination of two proposals [13, 14]. The rest of this paper is organised in the following way: Section 2 presents the identification of metric goals and the proposal of metrics for UML statechart diagrams. Section 3 and 4 present the theoretical and the empirical validation of the proposed metrics, respectively. The paper ends with some conclusions and outlines the direction of our future research work.
2 Metric Definition Using the GQM [1, 2] template for goal definition, the goal pursued for the definition of the metrics for UML statechart diagrams is: Analyse UML statechart diagrams for the purpose of Assessing with respect to their Maintainability from the point of view of the Conceptual modellers, OO software designers in the context of Software development organisations. Following this goal we have defined a set of metrics each one focusing on a different UML statechart diagram elements [21] (see table 1).
3 Theoretical Validation of the Proposed Metrics For the theoretical validation of the proposed metrics we followed Briand et al.´s framework [4] as a property-based framework, and Poels and Dedene´s framework [32] as measurement theory-based framework. Briand et al.’s framework [4] provides a set of mathematical properties that characterise and formalise several important measurement concepts such as size, length, complexity, cohesion and coupling, related to internal software attributes. 1
2
The theoretical basis for developing quantitative models relating structural properties (size and structural complexity) and external quality attributes (understandability, maintain ability) is based on the model provided by Briand et al. [6] and the standard ISO-9126 [24]. There exists a lot of work related to software measurement that consider understandability to be a factor that influences maintainability [19, 8, 22].
120
M. Genero, D. Miranda, and M. Piattini Table 1. Metrics for UML statechart diagrams Size metrics
Metric Name Metric definition NEntryA (Number The total number of entry actions, i.e. the actions performed each time a state is entered of entry actions) NexitA (Number of The total number of exit actions, i.e. the actions performed each time a state is left exit actions)
Structural complexity metrics
NA (Number of activities) NSS (Number of simple states) NCS (Number of composite states) NE (Number of events) NG (Number of guards) McCabe (Cyclomatic Number of McCabe [29]3. NT (Number of transitions)
The total number of activities (do/activity). The total number of states considering also the simple states within the composites states The total number of composite states. The total number of events. The total numbers of guard conditions. It is defined as |NSS|-|NT|+2
The total number of transitions, considering common transitions (the source and the target states are different), the initial and final transitions, selftransitions (the source and the target states are the same) and internal transitions (transitions inside a state that responds to an event but without leaving the state).
Theoretical validation of NSS Metric. For our purpose and in accordance with Briand et al.’s framework [4], we consider that a statechart diagram is a system composed of states (elements) and transitions (relations). A module is composed of a subset of the states and transitions. We will demonstrate that NSS fulfils all of the axioms that characterise size metrics, as follows: – Nonnegativity. The number of states in an statechart diagram is always greater than zero, so that NSS can never be negative. – Null value. If we have no states NSS=0. – Module additivity. If we consider that a statechart diagram is composed of modules with no states in common, the number of states of an statechart diagram it will be always the sum of the number of states of its modules. Following an analogous reasoning used for NSS metric, it can be proved that the other metrics related to internal transitions, such as NCS and NE are also size metrics. Theoretical validation of NA Metric. We consider that states are system modules, the activities are the elements and relationships are represented by the relation 3
Even tough the Cyclomatic Number of McCabe was defined to calculate single module complexity and entire system complexity, we adapted it for measuring the structural complexity of UML statechart diagrams.
Defining Metrics for UML Statechart Diagrams in a Methodological Way
121
“belong to”, which reflects that each activity belongs to a state. We will demonstrate that NA fulfils all of the axioms that characterise size metrics, as follows: – Nonnegativity. One state can or cannot have activities, i.e. it could happen that NA=0 or NA>0, but never NA<0. – Null value. If we have no activities then NA=0. – Module additivity. If we consider that a state is composed of substates (modules) with no activities in common, the NA of a state will always be the sum of the NA of all its substates, because each activity of a substate is an activity of the state. In a similar way it can be proved that the metrics NEntryA, NexitA and NG are also size metrics. Theoretical validation of NT metric. We consider that a statechart diagram is a system composed of states (elements) and transitions (relations). A module is composed of a subset of the statechart diagram states and statechart diagram transitions. We will demonstrate that NT fulfils all of the axioms that characterise complexity metrics, as follows: – Nonnegativity. It is obvious that there is always a null or positive value of transitions. Then, NT≥0. – Null value. If there are no transitions within an statechart diagram then NT = 0. – Symmetry. The number of transitions does not depend on the convention used to represent the transitions. – Module monotonicity. According to the definition of this property, it is obvious that: being m1 and m2 any two modules of the statechart diagram with no transitions in common, have the value of NT(SD)≥ NT(m1) + NT(m2). – Disjoint module additivity. Let m1 and m2 be any two disjoint modules such that SD=m1∪m2. Let NT1 and NT2 be the number of transitions in the m1 and m2 modules. Obviously: NT=NT1+NT2, because m1 and m2 are disjoint modules. A property- based approach such as Briand et al.’s framework propose a measure property set that is necessary but not sufficient [4, 32]. They can be used as a filter to reject proposed measures [25], but they are not sufficient to prove the validity of the measure. A measurement theory-based approach to software metric validation, like DISTANCE [32] offers a measure construction procedure to model properties of software artefacts and define the corresponding software metrics. An important pragmatic consequence of the explicit link with measurement theory is that the resulting measures define ratio scales. Due to space constraints we cannot present the measurement construction process for the proposed metrics. A detailed validation can be found in [21].
4 Empirical Validation of the Proposed Metrics In this section we describe an experiment and its replication, we carried out to empirically validate the proposed metrics as early understandability indicators. We have followed some suggestions provided in [7, 26, 31, 39] on how to perform controlled experiments and have used (with only minor changes) the format proposed by Wohlin et al. [39] to describe them.
122
M. Genero, D. Miranda, and M. Piattini
Definition. Using the GQM template for goal definition, the goal of the experiment is the following: Analyse UML statechart diagrams structural complexity and size metrics for the purpose Evaluating with respect to their capability of being used as understandability indicators of UML statechart diagrams from the point of view of the OO software modellers and OO software designers in the context of Undergraduate students in the final year of Computer Science and teachers of the Area of Software Engineering at the Department of Computer Science in the University of Castilla-La Mancha Planning. The planning includes the following activities: – Context selection. The subjects were eight teachers and eleven students. Students are enrolled in the final-year of Computer Science at the Department of Computer Science in the University of Castilla-La Mancha in Spain. All of the teachers belong to the Software Engineering area. The experiment is specific since it focuses on UML statechart diagram structural complexity and size metrics. The ability to generalise from this specific context is further elaborated below when we discuss threats to the experiment. The experiment addresses a real problem, i.e., which indicators can be used to assess the understandability of UML statechart diagram? To this end it investigates the correlation between metrics and understandability. – Selection of subjects. The subjects are chosen for convenience, i.e. the subjects are students that have medium experience in the design and development of OO software. – Variables selection. The independent variables are UML statechart diagram structural complexity and size and the dependent variable is UML statechart diagram understandability. – Instrumentation. The objects used in the experiment were 20 UML statechart diagrams. The independent variable was measured by the metrics presented in section 2. The dependent variable was measured by the time the subject spent answering the questionnaire attached to each diagram. We called that time “understandability time”. – Hypothesis formulation. An important aspect of experiments is to know and to state in a clear and formal way what we intend to evaluate in the experiment. This leads us to the formulation of the following hypotheses: Null hypothesis, H0: There is no significant correlation between the UML statechart diagrams structural complexity and size me trics and the understandability time. Alternative hypothesis, H1: There is a significant correlation between the UML statechart diagrams structural complexity and size metrics and the understandability time. – Experiment design. We selected a within-subject design experiment, i.e. all the questionnaires had to be solved by each of the subjects. The subjects were given the tests in different order. Operation. It is in this phase where measurements are collected including the following activities: – Preparation. At the time the experiment was done all of the students had taken a course on Software Engineering, in which they learnt in depth how to design OO
Defining Metrics for UML Statechart Diagrams in a Methodological Way
123
software using UML. Moreover, the subjects were given an intensive training session before the experiment took placing. However, the subjects were not aware of what aspects we intended to study. Neither they were informed about the actual hypotheses stated. We prepared the material we handed to the subjects, which consisted of a guide explaining the UML statechart notation and 20 UML statechart diagrams. These diagrams were related to different universes of discourse that were easy enough to be understood by each of the subjects. The structural complexity and size of each diagram is different, covering a broad range of the metrics values. Each diagram had a test enclosed, which includes a questionnaire in order to evaluate if the subjects really understand the content of the UML statechart diagrams. Each questionnaire contained exactly the same number of questions (four) and the questions were conceptually similar and were written in identical order. Each subject had to write down the time he started answering the questionnaire and at the time they finished. The difference between the two is what we called the understandability time (expressed in seconds). – Execution. The subjects were given all of the material described in the previous paragraph. We explained to them how to carry out the experiment. We allowed them one week to do the experiment, i.e., each subject had to carry out the test alone, and could use unlimited time to solve it. We collected all of the data with the understandability time calculated from the responses of the experiments. – Data Validation. Once the data were collected, we controlled if the tests were completed and if the questions have been answered correctly. All the tests were considered valid because all the questions were correctly answered. Analysis and Interpretation. First we summarized the data collected for each diagram. We had the metric values and we calculated the mean of the subjects’ understandability time for each statechart diagram.4 First, we applied the Kolmogrov-Smirnov test to ascertain if the distribution of the data collected was normal. As the data were non-normal we decided to use a nonparametric test like Spearman’s correlation coefficient, with a level of significance α = 0.05, which means the level of confidence is 95% (i.e. the probability that we reject H0 when H0 is false is at least 95%, which is statistically acceptable). Each of the metrics was correlated separately to the mean of the subjects’ understandability time (see table 2). Table 2. Spearman’s correlation coefficients between metrics and understandability time
NEntryA NExitA 0.1808 -0.2521
NA 0.4830
NSS 0.4999
NCS 0.3352
NT 0.6049
NE 0.4261
NG 0.5535
McCabe 0.0773
For a sample size of 20 (mean values for each diagram) and α = 0.05, the Spearman cut-off for accepting H0 is 0.44 [5, 16]. Because the computed Spearman’s correlation coefficients for metrics NA, NS and NT (see table 2), are above this cutoff and the p-value < 0.05, the null hypothesis H0, is rejected. Given these results, we can conclude that there is a significant correlation between NA, NSS, NT and NG metrics and subjects’ understandability time.
4
For analyzing the empirical data we used the Statistical Package for Social Science [38].
124
M. Genero, D. Miranda, and M. Piattini
Part of the information that these metrics provide might be redundant, which in statistical terms is equivalent to saying that metrics might be very correlated. This justifies the interest of analyzing the information that each metric captures to eliminate such redundancy. In the experimental research in software engineering [6, 9] problem is solved by using the Principal Component Analysis (PCA) [18] . In this case, through the PCA, the purpose is to reduce the space of 11 metric dimensions that contain the initial information. Form the results of the PCA we may conclude that the rotated principal components are difficult to interpret, and it is too premature to decide if which of the metrics we proposed are redundant. Fact that confirm what is already known [6, 9], that the results obtained in the PCA are dependent on the data. So further investigation is needed too obtain stronger findings and decide if some of the metrics are redundant or not. Validity evaluation. We will discuss the various issues that threaten the validity of the empirical study and how we attempted to alleviate them: – Threats to Conclusion Validity. The conclusion validity defines the extent to which conclusions are statistically valid. The only issue that could affect the statistical validity of this study is the size of the sample data (20 values), which perhaps are not enough for both parametric and non-parametric statistic test [5]. We are aware of this, so we will try to obtain bigger sample data through more experimentation. – Threats to Construct Validity. The construct validity is the degree to which the independent and the dependent variables are accurately measured by the measurement instruments used in the experiment. For the dependent variable we use the understandability time, i.e., the time each subject spent answering the questions related to each diagram, that it is considered the time they need to understand it. It is an objective measure so we consider the understandability time could be considered a measure constructively valid. The construct validity of the metrics used for the independent variable is guaranteed by the theoretical validation we carried out with them (see section 3). – Threats to Internal Validity. The internal validity is the degree of confidence in a cause-effect relationship between factors of interest and the observed results. The analysis performed here is correlational in nature. We have demonstrated that several of the metrics investigated had a statistically and practically significant relationship with understandability. Such statistical relationship do not demonstrate per se a causal relationship. They only provide empirical evidence of it. Only controlled experiments, where the metrics would be varied in a controlled manner and all other factors would be held constant, could really demonstrate causality. However, such a controlled experiment would be difficult to run since varying structural complexity and size in a system, while preserving its functionality, is difficult in practice. On the other hand, it is difficult to imagine what could be alternative explanations for our results besides a relationship between structural complexity, size and understandability. The following issues have also been dealt with: Differences among subjects, Knowledge of the universe of discourse among class diagrams, Precision in the time values, Learning effects, Fatigue effects, Persistence effects, Subject motivation, Plagiarism Influence between students, etc.
Defining Metrics for UML Statechart Diagrams in a Methodological Way
125
–
Threats to External Validit y. External validity is the degree to which the research results can be generalised to the population under study and other research settings. The greater the external validity, the more the results of an empirical study can be generalised to actual software engineering practice. Two threats to validity have been identified which limit the ability to apply any such generalisation: Materials and tasks used. In the experiment we tried to use statechart diagrams and tasks which can be representative of real cases, but more empirical studies taking “real cases” from software companies must be done in the future. Subjects. To solve the difficulty of obtaining professional subjects, we used teachers and students from advanced software engineering courses. We are aware that more experiments with practitioners and professionals must be carried out in order to be able to generalise these results. However, in this case, the tasks to be performed do not require high levels of industrial experience, so, experiments with students could be appropriate [3]. Presentation and Package. As the diffusion of experimental data is important for the external replication of the experiments [12] we have put all of the material of this experiment onto the website http://alarcos.inf -cr.uclm.es . 4.1 Replication of the Experiment In order to corroborate the findings obtained in the experiment previously described we carried out an internal strict replication [3, 12] of it. The most important difference between the previous experiment we carried out and this replication are: – The subjects were undergraduate third-year student of Computer Science, which have had only one course of Software Engineering, where they learnt how to design OO software using UML. This means that the experience of the subjects is lesser. – The subjects have to solve the tests alone, in no more than two hours. Any doubt, could be solved by the person who monitored the experiment. In fact, the replication was carried out in a more controlled environment due to the fact that it was supervised. Fact that can contribute to control the plagiarism between subjects. After performing a correlational analysis using the data obtained in the replication we obtained the results shown in table 3.
Table 3. Spearman’s correlation coefficients between metrics and understandability time (replication) NEntryA NExitA
NA
-0.04581
0.51714 0.57474 0.42809 0.54980 0.36412 0.63063 -0.03260
-0.34611
NSS
NCS
NT
NE
NG
McCABE
Comparing the findings of both experiments (see tables 2 and 3) we realized that they are similar. This means that the metrics NA, NSS, NG and NT are to some extent correlated with the understandability time of UML statechart diagrams.
126
M. Genero, D. Miranda, and M. Piattini
5 Conclusions and Future Work With the hypothesis that the size and the structural complexity of UML statechart diagrams may influence their understandability (and therefore in their maintainability), we defined a set of metrics for the structural complexity and size of UML statechart diagrams in a methodological way [21]. The theoretical validity of the proposed metrics, which means that they really measure the attribute they purport to measure was demonstrated trough the validation following two approaches: a property-based appro ach such as the the Briand et al.’s framework [4] and a measurement theory-based approach such as the DISTANCE framework [32]. Moreover, the use of DISTANCE guarantees that the metrics can be used as ratio scale measurement instruments. Our hypothesis was to some extent empirically corroborated by a controlled experiment we carried out and its replication. As a result of all the experimental work, we can conclude that the metrics NA, NS, NG and NT seem to be highly correlated with the understandability of UML statechart diagrams. Nevertheless, despite the encouraging results obtained we still consider them as preliminaries. Further replication, both internal and external, is of course necessary and also new experiments must be carried out with practitioners who work in software development organizations. Only after performing a family of experiments we can build an adequate body of knowledge to extract useful measurement conclusions regarding the use these metrics to be applied in real measurement projects as early understandability indicators of the UML statechart diagrams [3, 30, 36]. Once we obtained stronger results in this line, we think the metrics we proposed could also be used for allowing OO software modellers a quantitative comparison of design alternatives, and therefore, an objective selection among several statechart diagram alternatives with equivalent semantic content, and predicting external quality characteristic, like maintainability in the initial stages of the OO software life cycle and a better resource allocation based on these predictions. In this sense we plan to build a maintainability prediction model (based on the metrics values) using traditional statistical techniques and advances techniques borrowed from artificial intelligence. Finally, another research line of interest would be to evaluate the influence of the structural complexity and size of the UML statechart diagrams on other maintainability factors such as modifiability and analysability. Acknowledgements. This research is part of the DOLMEN (TIC 2000-1673-C06-06) and the CALDEA (TIC 2000-1673-C06-06) projects, financed by Subdirección General de Proyectos de Investigación, – Ministerio de Ciencia y Tecnología (in Spain).
References 1.
Basili, V., Rombach, H.: The TAME project: towards improvement-oriented software environments. IEEE Transactions on Software Engineering 14(6) 1998 758–773
Defining Metrics for UML Statechart Diagrams in a Methodological Way 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
127
Basili, V., Weiss, D.: A Methodology for Collecting Valid Software Engineering Data. IEEE Transactions on Software Engineering 10(6) (1984) 728–738 Basili, V., Shull, F., Lanubille, F.: Building Knowledge through families of experiments. IEEE Transactions on Software Engineering 25(4) (1999) 456–473 Briand, L., Morasca, S., Basili, V.: Property-based software engineering measurement. IEEE Transactions on Software Engineering, 22 (1) (1996) 68–85 Briand, L., El Emam, K., Morasca, S.: Theoretical and empirical validation of software product measures. Technical Report ISERN-95-03, International Software Engineering Research Network (1995) Briand, L., Wüst, J., Lounis, H.: Replicated Case Studies for Investigating Quality Factors in Object-oriented Designs. Technical report ISERN 98-29 (version 3), International Software Engineering Research Network (1998) Briand, L., Arisholm, S., Counsell, F., Houdek, F., Thévenod-Fosse, P.: Empirical Studies of Object-Oriented Artifacts, Methods, and Processes: State of the Art and Future Directions. In Empirical Software Engineering 4(4) (1999) 387–404 Briand, L., Bunse, C., Daly, J.: Controlled Experiment for evaluating Quality Guidelines on the Maintainability of Object-Oriented Designs. IEEE Transactions on Software Engineering, 27(6) (2001) 513–530 Briand, L., Wüst, J.: Empirical studies of quality models.: Advances in Computers Academic Press, Zelkowitz (ed), 59 (2002) 97–166 Briand, L., Morasca, S., Basili, V.: An operational process for goal-driven definition of measures. IEEE Transactions on Software Engineering 28(12) (2002) 1106–1125 Brito e Abreu, F., Carapuça, R.: Object-Oriented Software Engineering: Measuring and controlling the development process., 4th Int Conference on Software Quality, Mc Lean, Va, USA, (1994) Brooks, A., Daly, J., Miller, J., Roper, M., Wood, M.: Replication of experimental results in software engineering. In Technical report ISERN-96-10, International Software Engineering Research Network, (1996) Calero, C., Piattini, M., Genero, M.: Method for obtaining correct metrics. International Conference on Enterprise and Information Systems (ICEIS`2001), (2001) 779–784 Cantone, G., Donzelli, P.: Production and maintenance of software measurement models. Journal of Software Engineering and Knowledge Engineering, 5 (2000) 605–626 Chidamber, S., Kemerer, C.: A Metrics Suite for Object Oriented Design. In IEEE Transactions on Software Engineering 20(6) (1994) 476–493 CUHK – Chinese University of Hong Kong – Department of Obstetrics and Gynaecology – http://department.obg.cuhk.edu.hk/ResearchSupport/Minimum_correlation.asp (Last visited on July 22nd, 2002). Derr, K.: Applying OMT,SIGS Books, Prentice Hall. New York, (1995) Dunteman, G.:Principal Component Analysis. Sage University Paper 07–69, Thousand Oaks, CA, (1989) Fenton, N., Pfleeger, S.: Software Met rics: A Rigorous Approach, Chapman & Hall. London, 2nd. Edition, (1997) Genero, M. Defining and Validating Metrics for Conceptual Model, Ph.D. thesis. University of Castilla–La Mancha, (2002) Genero, M., Miranda, D., Piattini, M.: Defining and Validating Metrics for UML Statechart Diagrams. In 6th International ECOOP Workshop on Quantitative Approaches in Object-Oriented Software Engineering (QAOOSE´2002), 120–136 (2002) Harrison, R., Counsell, S., Nithi, R.: Experimental Assessment of the Effect of Inheritance on the Maintainability of Object-Oriented Systems. Journal of Systems and Software, 52 (2000) 173–179 Henderson-Sellers, B., Zowghi, D., Klemola, T., Parasuram, S. Sizing Use Cases: How to create a standard metrical approach. 8th International Conference (OOIS 2002), LNCS 2425, (2002) 409–421
128
M. Genero, D. Miranda, and M. Piattini
24. ISO 9126: Software Product Evaluation-Quality Characteristics and Guidelines for their Use, ISO/IEC Standard 9126. Geneva (2001) 25. Kitchenham, B., Stell, J.: The danger of using axioms in software metrics. IEE Proc.-Soft. Eng. 144(5–6) (1997) 79–285 26. Kitchenham, B., Pflegger, S., Pickard, L., Jones, P., Hoaglin, D., El-Emam, K. and Rosenberg, J. Preliminary Guidelines for Empirical Research in Software Engineering. In IEEE Transactions of Software Engineering, 28(8) (2002) 721–734 27. Lorenz, M., Kidd, J. : Object-Oriented Software Metrics: A Practical Guide, Prentice Hall. Englewood Cliffs, New Jersey, (1994) 28. Marchesi, M.: OOA Metrics for the Unified Modeling Language. In 2nd Euromicro Conference on Software Maintenance and Reengineering, (1998) 67–73 29. McCabe, T.: A Complexity Measure. IEEE Transactions on Software Engineering, 2(4) (1976) 308–320 30. Miller, J.: Applying Meta-Analytical Procedures to Software Engineering Experiments. In Journal of Systems and Software, 54 (2000) 29–39 31. Perry, D., Porter, A., Votta, L.: Empirical Studies of Software Engineering: A Roadmap. In Future of Software Engineering. Anthony Finkelstein Ed., ACM, (2000) 345–355 32. Poels, G., Dedene, G. Distance-Based software measurement: necessary and sufficient properties for software measures. In Information and Software Technology, 42(1) (2000) 35–46 33. Poels, G., Dedene, G.: Measures for Assessing Dynamic Complexity Aspects of ObjectOriented Conceptual Schemes. In 19th International Conference on Conceptual Modelling (ER 2000), LNCS 1920, Springer-Verlag, (2000) 499–512. 34. Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F. , Lorensen, W.: Object-Oriented Modelling and Design, Prentice Hall. USA, (1991) 35. Selic, B., Gullekson, G., Ward, P.: Real-Time Object Oriented Modelling, John Wiley & Sons, Inc., (1994) 36. Shull, F., Basili, V., Carver, J., Maldonado, J.: Replicating Software Engineering Experiments: Addressing the Tacit Knowledge Problem. International Symposium on Empirical Software Engineering (ISESE 2002), Nara, Japan, IEEE Computer Society, (2002) 7–16 37. Snoeck, M.: On a process algebra approach for the construction and analysis of M.E.R.O.D.E.-based conceptual models, Ph.D. Katholieke Universiteit Leuven, (1995) 38. SPSS 11.0.: Syntax Reference Guide. Chicago. SPSS Inc., 2001 39. Wohlin, C., Runeson, P., Höst, M., Ohlson, M., Regnell, B.,Wesslén, A.: Experimentation in Software Engineering: An Introduction, Kluwer Academic Publishers, (2000) 40. Yacoub, S., Ammar, H., Robinson, T.: Dynamic Metrics for Object Oriented Designs. In Sixth IEEE International Symposium on Software Metrics, 1998
Visual SQL – High-Quality ER-Based Query Treatment Hannu Jaakkola1 and Bernhard Thalheim2 1
Tampere University of Technology, Pori, P.O.Box 300, FIN-28101 Pori, [email protected] 2 Computer Science Institute, Brandenburg University of Technology at Cottbus, PostBox 101344, D-03013 Cottbus, [email protected]
Abstract. Query formulation is still a difficult task whenever a database schema is large or complex. The user has to entirely understand the schema before a correct and complete formulation of the query will be found. Furthermore, users may overlook types in the SQL schema that must be used in the query. We show in this paper that visualization led in this case to higher conceptual correctness and conceptual completeness. Visualization is based on Visual SQL. Visual SQL follows the paradigm of entity-relationship representation. At the same time, it has the same expression power as SQL-92. The quality of query formulation is, however, higher.
1
Visualization Increases Correctness and Completeness of Queries
The improvement of database management systems and of computing power, the increase of the size of the databases under use, and the availability of distributed applications have led to new and more challenging applications. The size of database schemata has grown to a size that is far beyond an experienced programmer may survey. Therefore, new techniques are required to understand data structuring and to discover some knowledge from databases. Visualization. Visualization of complex conceptual structures is a key component of support for many applications in science and engineering. An ER schema is an abstract structure that is used to model information. ER schemata are used to represent information that can be modeled as objects and connections between those objects. Visualization of structures is only useful to the degree that the associated diagrams effectively convey information to the people that use them. A good diagram helps the reader to understand the system, but a poor diagram can be confusing and misleading. Diagrams should be of practical value. If diagrams ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 129–139, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
130
H. Jaakkola and B. Thalheim
are only used for graphical visualization and later not used they become only illustrations. If we can, however, translate the diagram to such languages as SQL we have graphical languages of high utility. Visualization of database structuring offers a number of advantages for querying purposes: – Visualization enables in surveying and understanding the query and the associations among relations that are used in one query. – Visualization enables in direct manipulation applying changing restrictions or viewpoints. – Visualization allows co-ordination of multiple viewpoints. It supports understanding complex and multivariate data and the exploration of complex databases. Quality of Querying Depends on Visualization. Characterization of database system quality is based on a number of parameters [Jaa02] such as functionality (suitability, accuracy, interoperability, security), reliability (maturity, fault tolerance, revoverability), efficiency (time behavior, resource utilization), maintenability (analyzability, changeability, stability, testability) [JKV01], portability (adaptivity, installability, co-existence, replaceability), and last but not least usability (understandability, learnability, operability, attractiveness). In this paper we concentrate on the last quality parameter, usability. Users must be able to request a large database system without complete understanding of a large database schema, without understanding of the impact of specific values such as null values and without deep consideration of integrity constraints. Therefore, we can also base evaluation of quality of SQL query interface referring to correctness of query formulation, completeness of query expression depending on the underlying schema, its suitability for use, its adherence to a predetermined set of expectations, or its freedom from mistakes or flaws [ReG94]. We may furthermore distinguish between conceptual and syntactic quality parameters. Conceptual parameters reflect the business concepts of the enterprize based the business user layer [Thal00]. These concepts are depending on unstated plans and pragmatics of the application, on policies, objectives, strategies and rules usually applied in the enterprize. Therefore, conceptual quality must be based on the meaningful and at the same time accurate representation. Furthermore, users must be supported to adequately describe the query in its full scope. Syntactic quality parameters are usually supported by query editors. We observe thus that query interfaces must support high-quality querying by supporting – a correct understanding of business plans, policies, and strategies, – a recognized set of rules common in the application area, – a direct involvement of all stakeholders, i.e. business managers, domain experts, data modelers and facilitators, data management professionals, and information system developers, – transformers to high-quality SQL queries,
Visual SQL – High-Quality ER-Based Query Treatment
131
– an integration into the context of the request leading to the query such as architecture, enterprize, life cycle and infrastructure. This high-level support cannot be provided on the basis of plain SQL. At least some visualization of the query formulation is necessary. We step ahead by enabling users to formulate the request in a visual form. Early Approaches to Visualization of Queries. ER-SQL has been introduced by several authors [AtC81,Gog94,Hoh93,Kun82,SuM87]. ER-SQL has been extended, generalized and summarized in [Thal00]. Despite its simplicity and high utility it has neither been widely used in practice nor implemented in DBMS. QBE has been proposed as a query language that allows to represent queries in a tabular form. Unfortunately, QBE is not relational complete. Therefore, it does not allow to specify all SQL queries by QBE table sets. [BOO02] has proposed a graphical extension of SQL that enables in handling geometric and spacial information. This query language is supported by a convenient query formulation tool. Hypothesis: Visualization of Queries through Visual SQL Increases Correctness and Completeness. We have developed Visual SQL [Tha03]. It generalizes, corrects and extends the approaches by [ChT02,VSD] and [DAISY]. These approaches lack in a formal semantics. We combine the formalization by [LSS96] with those provided by [CaT94,ScT00b] and [Thal00]. We have chosen Visual SQL as a compromise between SQL-92 and ER-SQL. Visual SQL keeps the advantage of clear understandability of ER-SQL. At the same time, Visual SQL has the expressive power of SQL-92 and operates on the top of the SQL-92 schema. Therefore, the language qualifies to improve the quality of querying. Visual SQL can be considered as a compromise between the requirement to visualize querying and the rigidity of textual query representation used in SQL. The hypothesis has been tested in the university courses environment by randomly assigning students to groups, examining students by a set of standardized questions at a certain time restriction, and evaluating results against the standardized queries.
2 2.1
Testing Visual SQL Querying against SQL-92 Querying Main Constructs of Visual SQL
Visual SQL allows to graphically define an SQL schema. The SQL schema is displayed together with the integrity constraints and the corresponding enforcement policy. It has the following constructs which are used in Figure 1:
132
H. Jaakkola and B. Thalheim
Relational types are represented by rectangles. Attributes may be shown in the type. Keys are represented by the key sign. If there is more than one minimal key then we number the minimal keys and represent the attribute together with all numbers of all minimal keys in which the attribute is used. Output attributes are adorned by the surd sign. Units are covering a set of relational types or units. They are represented by rectangles. Output attributes of units are those to which the surd sign is attached. Relational and unit variables may be assigned to types and units within a query schema. Comparison predicates such as =, ≤, ≥, =, <, > are added to lines which associate attribute and relational types and units. The default comparison is the equality predicate. Special comparisons are the test whether values appear in a relational class, e.g., the test on null appearance. Set-valued comparison predicates and operators (any, all, some, exists, in, union, intersection, set difference) and their negations are displayed together with an association line. They are allowed whenever the predicate or the operation is definable in the relational type system [Thal00]. Predicates may be combined into Boolean formulas. Operations (insert, delete, update, grant, order by, group by) are added to the types or units. Temporary types and views can be introduced by associating them to a unit and are represented by dotted rectangles covering the unit and the temporary type. Aggregation values can be defined as algebraic expressions. The expression is added to the type or unit which is used for the expression. Derived attributes are represented together with their expression within a dotted rectangle. Integrity constraints are associated with the types in their scope by lines. Integrity enforcement policy can be added to the integrity constraint definition. The set of constructs cover the entire relational algebra and support aggregation, grouping and ordering. All SQL-92 queries to database system may be pictured in Visual SQL since Visual SQL has the same expressive power as SQL-92. 2.2
Using the ER Translation Profile for Generation of SQL Queries
Classically, queries are formulated on the basis of the relational schema. In this case, the relational schema is the main source for query formulation. This approach is appropriate as long as the query is not very complex and as long as the number of relational types to be used in the query is rather small. If the relational schema is large and the query is rather complex users are lost in the schema space and query formulation becomes error-prone. We use the translation profile of the ER schema for derivation of the query. The translation is rather straightforward and similar to the translation from
Visual SQL – High-Quality ER-Based Query Treatment
133
ER-SQL to SQL-92. The most difficult step is the treatment of set equalities which can be handled either on a tuple_variable IN query or NOT EXISTS query form depending on whether null valued attributes are involved or not. The translation profile corresponds to the stepwise translation procedure for extended ER schemata proposed and discussed in detail in [Thal00]. 2.3
The Query Suite Used for the Hypotheses Test
We developed a number of requests Req1 , ...., Req26 to the schema in Figure 2. This schema has been used for the experiments of a number of hypotheses. It displays information collected at the students office for lecture planning, student enrollment and project involvement. Each request Reqi has a number of SQL-92 query solutions Soli,1 , ...., Soli,ki . We call these the solution standard {(Reqi , {Soli,1 , ...., Soli,ki }) | 1 ≤ i ≤ 26}. 2.4
An Example of Querying
For illustration let us use the following query Req18 : Provide data on students who have successfully completed those and only those courses which have successfully been given or which are currently given by the student’s supervisor? The SQL-92 query Sol18,1 is rather complex: SELECT
P1.Name, P1.BirthData, P1. Address, P2.Name AS "Name of supervisor" FROM Person P1, Professor P2, Student S1, Supervisor, Lecture L, Enroll E WHERE P1.Name = Student.Name AND P1.BirthData = Student.BirthData AND S1.StudNo = E.StudNo AND E.Result NOT NULL AND S1.StudNo = Supervisor.StudNo AND Supervisor.Name = Professor.Name AND Supervisor.BirthData = Professor.BirthData AND P2.Name = Professor.Name AND P2.BirthData = P2.BirthData AND L.Name = Professor.Name AND L.BirthData = Professor.BirthData AND L.CourseNo IN (SELECT E2.CourseNo FROM Enroll E2 WHERE S1.StudNo = E2.StudNo AND E2.Result NOT NULL )
134
H. Jaakkola and B. Thalheim
Fig. 1. Visual SQL Involving Equality On Two Visual SQL Subqueries
AND E.CourseNo IN (SELECT L2.CourseNo FROM Lecture L2 WHERE L2.Name = P2.Name AND L2.BirthData = P2.BirthData );
The formulation of the query is much better to comprehend through the corresponding Visual SQL expression given in Figure 1. We are able to translate the Visual SQL expression directly in the corresponding SQL query given above. This query may have to be rephrased depending on the translation of the ER schema. 2.5
Conceptual Correctness and Completeness of Queries
We can formally compare SQL expressions by two criteria: Conceptual difference Dif f (Sol1 , Sol0 ) of Sol0 from Sol1 is the number of concepts in Sol1 that are missing in Sol0 . Conceptual completeness Complete(Sol1 , Sol0 ) of S0 compared with S1 is the conceptual difference of Sol0 from Sol1 divided by the number of concepts in Sol1 , i.e. Concept(Sol1 ) − Dif f (Sol1 , Sol0 ) Concept(Sol1 )
Visual SQL – High-Quality ER-Based Query Treatment
135
Conceptual errors Err(Sol1 , Sol0 ) of S0 compared with S1 is the number of concepts in Sol0 that are not found in Sol1 . Conceptual correctness Correct(Sol1 , Sol0 ) of S0 compared with S1 is the quotient of the difference of the number of concepts of Sol1 and the conceptual errors Err(Sol1 , Sol0 ) and the number Concept(Sol1 ) of concepts of Sol1 , i.e. Concept(Sol1 ) − Err(Sol1 , Sol0 ) Concept(Sol1 ) Conceptual similarity Sim(Sol1 , Sol0 ) of a query Sol0 compared with another query Sol1 is defined by the product of conceptual completeness and conceptual correctness. We define now conceptual completeness and conceptual correctness of a query Sol0 against a set of queries {Sol1 , ..., Solk } by the conceptual completeness and conceptual correctness to the most similar query in the set. We choose the most similar query in {Sol1 , ..., Solk }. To be as general as possible we use the subset of most similar queries for a query, i.e. let Sim(M, S0 ) = {S ∈ M|Sim(S , S0 ) = maxS∈M Sim(S, S0 )} then we define Complete({S1 , ..., Sk }, S0 ) = maxS∈Sim({S1 ,...,Sk },S0 ) Complete(S, S0 ) Correct({S1 , ..., Sk }, S0 ) = maxS∈Sim({S1 ,...,Sk },S0 ) Correct(S, S0 ) In most cases Sim({S1 , ..., Sk }, S0 ) contains only one query. If we have, however, more than one most similar query then we use a liberal approach for the computation of correctness and completeness.
3 3.1
Achievements and Advantages of Visual SQL Testing a Hypothesis
We claimed that (H1): Conceptual correctness and conceptual completeness of SQL-92 query formulation is higher for complex requests if students are using Visual SQL. (H2): Conceptual correctness and conceptual completeness of SQL-92 query formulation is the same for simple requests if students are using Visual SQL. The test of the hypotheses has been based on a random assignment of students to groups. The random assignment has led to an approximate equivalent level of gradings between the students of both groups. We checked the subhypothesis that the two groups have the same expectation for grades under the
136
H. Jaakkola and B. Thalheim
assumption that grades are distributed under normal distribution N (x; 0, 1) and with an error probability of 0.05. We tested computer science students in two courses given in fall 2002/2003 with simple and complex query sets: – Students in the first course got an introduction to SQL-92 based on the classical approach. – Students in the second course got an introduction to Visual SQL in the same depth as those in the first course. Both groups had to develop a set of queries in the examination time of 60 minutes. The following table displays the rounded average results of both groups. Correctness and completeness have been based on evaluation with measures between 0 (complete failure) and 1 (complete rightness). The test requests and solutions have been based on the observations made in the database programming course in Spring 2002. Students attending the database programming course at Cottbus University in Spring 2002 succeeded in formulating very complex search requests by more than 50 % on the basis of Visual SQL. Simple query sets Complex query sets Solved Conc. corr. Conc. compl. Solved Conc. corr. Conc. compl. SQL-92 group 92 % 0.8 0.97 54 % 0.6 0.42 Visual SQL group 74 % 0.8 0.92 86 % 0.8 0.94
The number of students have been 24 in both groups. Due to the low size of the groups statistical laws cannot be properly applied. The significance level has been too high and the security probability thus too low. For this reason, for both hypotheses errors of the first kind and of the second kind become too high. If we neglect this possibility then the first hypothesis cannot be rejected with an α-value of 0.02. The second hypothesis is rejected with the same α-value. Nevertheless, the test demonstrated that Visual SQL creates some overhead. Therefore, the formulation of queries in Visual SQL becomes more timeconsuming compared with the formulation in SQL-92. Therefore, hypothesis (H2) should be rejected. The correctness of the given answer was, however, the same. Differences between the groups of less than 0.1 are not statistically relevant. Syntactical correctness and completeness has not been evaluated in the test. At the same time, hypothesis (H1) can be accepted based on the test. The difference between the two groups is high enough even for a large error probability of 0.08. 3.2
Observations on Students Success
We have made a number of observations in university courses: Query formulation efficiency: Visual SQL has been used in our database courses over the last year. In our experience, students were able to formulate complex
Visual SQL – High-Quality ER-Based Query Treatment
137
queries in much shorter time. The query used in the paper has been given in an examination in 2001. Only about 10% came up with an almost correct formulation. The same query has been given this year in the examination after students became familiar with Visual SQL. The success rate was higher than 50 % for the final SQL query. At the same time syntactical correctness was better. Error reduction: Null values and special conditions in schemata are often very difficult to trace and to understand. For this reason, query formulation on relations which allow null values and which have specific restrictions becomes error-prone. Furthermore, query formulation which involve aggregation functions becomes a nightmare. Query formulation skills: We have tested a set of more complex queries such as: – Find all pairs of students who either succeed together or fail together or do both not attend courses the could attend. – Find out all students who succeeded in exactly two modules out of four modules in their study program at the university and failed in exactly one of the modules. – Find all students with the best average success rate of attendance in courses ordered by the number of terms they are studying. – Find all students who have completed at least the minimum required credits in all categories in their study program. – Find the average student credits for all students who attended all courses in a category of their study program and the average student credits for all students who did not attend all courses in some category of their study program. Also simple requests such as the following have been developed faster and more conceptually correct and complete: Find the average number of terms a student needs to complete all courses in his/her study program.
4
Conclusion
We discussed how query formulation quality may be improved by utilization of visualization concepts. Visual SQL is at the same time – as powerful as SQL-92 and – simpler to use and to comprehend, and less error-prone in complex settings. We, thus, conclude that utilization of Visual SQL should be preferred over utilization of SQL-92 for complex queries which are applied to large database schemata. Visual SQL leaves space for intensive research. The most severe problems are those of management and treatment of integrity constraints. A graphical editor is currently under development. The translation profile has already been used in the RADD design system.
138
H. Jaakkola and B. Thalheim
References [AtC81]
[BOO02]
[CaT94] [ChT02] [DAISY] [Gog94]
[Hoh93] [Jaa02] [JKV01]
[Kun82] [LSS96]
[ReG94] [SaP94] [ScT00b] [SuM87]
[Thal00] [Tha03]
[ThK01]
[VeT02] [VSD]
P. Atzeni and P.P. Chen, Completeness of query languages for the entityrelationship model. Proc. of the 2nd Int. Conf. on the Entity-Relationship Approach to Systems Analysis and Design (ed. P.P. Chen), 1981, 109–121. N.H. Balki, G. Ozsoyoglu, and Z.M. Ozsoyoglu, A graphical query language: VISUAL and its query processing. IEEE Transactions on Knowledge and Data Engineering, 14, 2002, 5, 955–979. T. Catarci and L. Tarantino, Database querying by hypergraph manipulation. Proc. IDS’94, Springer, Berlin, 1994, 84–103 D.Chappell/J.H.Trimble Jr., A visual introduction to SQL, Wiley, NY, 2002 Daisy200, Visual SQL charts. http://www.daisy2000.com M. Gogolla, An extended entity-relationship model: Fundamentals and pragmatics. Lecture Notes in Computer Science, 767, Springer, Heidelberg, 1994. U. Hohenstein, Formale Semantik eines erweiterten Entity-RelationshipModells. Teubner, Stuttgart, 1993. H. Jaakkola et al., Experiences in software process improvement with small organizations. Proc. IASTED’2002, Innsbruck, 2002, 13–17. H. Jaakkola, J. Kukkonen, and T. Varkoi, Best practices as reuse infrastruc¨ ture. Proc. ReTIS’2001, Osterreichische Computer Gesellschaft, Vienna, 2001, 9–31. H.S. Kunii, Graph data model and its data language. Springer, New York, 1982. L.V.S. Lakshmanan, F. Sadri and I.N. Subramanian, SchemaSQL – A language for interoperability in relational multi-database systems. Proc. VLDB’1996. M.C. Reingruber and W.W. Gregory, The data modeling handbook. John Wiley & Sons, New York, 1994. G. Santucci and F. Palmisano, A dynamic form-based visualiser for semantic query languages. Proc. IDS’94, Springer, Berlin, 1994, 249–265. K.-D. Schewe, B. Thalheim, Modeling Interaction and Media Objects. Proc. NLDB’ 2000, LNCS 1959, 2002, 313–324. K. Subieta and M. Missala, Semantics for the entity-relationship model. The Entity-Relationship Approach,ed. by S. Spaccapietra, North-Holland, Amsterdam, 1987, 197–216. B. Thalheim: Entity-Relationship Modeling – Fundamentals o Database Technology. Springer, Berlin, 2000 B. Thalheim, Visual SQL – An ER-Based Introduction to Database Querying. Report I–08/03 of the Computer Science Institute at BTU Cottbus, Cottbus, 2003. B. Thalheim and T. Kobienia, Generating DB queries for web NL requests using schema information and DB content. Proc. NLDB’2001, LNI 3, Springer, 2001, 205–209. V. Vestenicky and B. Thalheim, An Intelligent Query Generator. Proc. EJC’2002, ISO Press, Amsterdam, 2002, 135–141. Visual SQL-Designer (VSD), http://www.visualsoftru.com/support.htm
Visual SQL – High-Quality ER-Based Query Treatment
139
Schema for the University Application We use the extended entity relationship model [Thal00] 1 .
Fig. 2. Part of the HERM Diagram Specifying the Structure of the University Database
1
It generalizes the classical entity-relationship model by adding constructs for richer structures such as complex nested attributes, relationship types of higher-order i which may have relationship types of order i − 1, i − 2, ...1 as their components, and cluster types that allow disjoint union of types (displayed by ). Further, HERM extends the ER model by an algebra of operations, by rich sets of integrity constraints, by transactions, by workflows, by views and by techniques for translating or compiling HERM schemes to relational and object-relational specification. For introduction: http://www.informatik.tu-cottbus.de/∼thalheim/slides.htm
Multidimensional Schemas Quality: Assessing and Balancing Analyzability and Simplicity 1
2
Samira Si-Said Cherfi and Nicolas Prat 1
CEDRIC-CNAM; 292, rue Saint-Martin; 75141 Paris Cedex 03; France [email protected] 2 ESSEC; Avenue Bernard Hirsch; BP 105; 95021 Cergy Cedex; France [email protected]
Abstract. A data warehouse is a database focused on decision making. Decision makers typically access data warehouses through OLAP tools, based on a multidimensional representation of data. In the past, the key issue of data warehouse quality has often been centered on data quality. However, since OLAP tool users directly access multidimensional schemas, multidimensional schema quality evaluation is also crucial. This paper focuses on the quality of multidimensional schemas, more specifically on the analyzability and simplicity criteria. We present the underlying multidimensional model and address the problem of measuring and finding the right balance between analyzability and simplicity of multidimensional schemas. Analyzability and simplicity are assessed using quality metrics which are described and illustrated based on a case study. The main objective of our approach is to provide the data warehouse designer with precise measures to support him in the choice among several alternative multidimensional schemas. Keywords: Multidimensional schema design, quality criteria, quality metrics, analyzability, simplicity.
1 Introduction A data warehouse is a database aimed at decision making. Data warehouses are built by integrating data from external sources and from internal OLTP (On-Line Transactional Processing) systems. Data warehouses are typically accessed through OLAP (On-Line Analytical Processing) tools, which enable business users to formulate queries and analyze data depending on the decision problem at hand. OLAP tools represent data in a specific, multidimensional format. Since data warehouses are highly strategic in nature, quality issues raised by such systems are crucial. Previous work on data warehouse quality has often focused on data quality, in particular the quality of source transactional data. However, the quality of the data model, i.e. multidimensional schemas, is equally important. This paper focuses on multidimensional schemas quality evaluation, more specifically on the assessment of and trade-off between analyzability (i.e. the variety of analyzes enabled by the multidimensional schema) and simplicity. The evaluation of the quality of multidimensional schemas lies on the concepts used in these schemas i.e. on the multidimensional model. Therefore, the multidimensional model is central M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 140–151, 2003. © Springer-Verlag Berlin Heidelberg 2003
Multidimensional Schemas Quality
141
to our approach. For the sake of this paper, we will assume that a multidimensional schema is defined based on (1) the user requirements in terms of analysis (these requirements may be expressed as queries, lists of attributes, schemas modeled with the ER [1] or Extended ER notation…) and/or (2) the schema of operational data sources (represented with the (E)ER notation for example). Once the multidimensional schema has been defined, it may be implemented in a target OLAP tool. Since many multidimensional models have been defined in the literature [2,3] and no standard has emerged yet, we use a unified multidimensional model based on previous work [4,5]. Our approach for multidimensional schemas quality evaluation adapts the framework defined in the OLTP context for conceptual schemas quality evaluation [6,7]. The framework proposes three views: the first view, named specification, is concerned with the data warehouse designer. The second view, called usage, considers the point of view of the data warehouse user i.e. the decision maker. Finally, the implementation view deals with physical issues and concerns the data warehouse/data mart developer. For each viewpoint, a set of criteria is defined, with associated metrics facilitating automatic or semi-automatic quality assessment. The paper is organized as follows. Section 2 describes related research on data warehouse quality. Section 3 presents the multidimensional model used in our approach. Section 4 defines within the specification view the concepts of analyzability and simplicity in the context of multidimensional schemas modeling. It also addresses the problem of finding the right balance between analyzability and simplicity based on a set of quality criteria and metrics. An evaluation exercise is used to illustrate the problem of balancing analyzability and simplicity and the support provided by the quality approach proposed in the paper. Finally, section 5 concludes and describes further research directions.
2 Related Research Several approaches, which deal with the evaluation of software products, exist. They can be summarized according to how they consider different phases in software lifecycle such as software design, software development and maintenance, and data quality. A synthetic presentation of the literature related to the quality assessment can be found in [7]. Regarding quality evaluation in data warehouse environments, previous work can be classified into three categories : – Due to the importance of operational data sources quality, and more generally data quality in data warehouses, many papers address this issue. [8] proposes a risk-based approach to data quality assurance in data warehouses. [9] presents ideas and describes a model to support data quality enhancement in data warehouses. – The second category includes research dedicated to multidimensional schemas quality, reflecting the central role of multidimensional schemas in OLAP environments. These works often focus on the normalization of multidimensional schemas [10,11,12] which is only one aspect of their quality. In particular, correct multidimensional schemas should ensure correct summarization of data at various
142
S. Si-Said Cherfi and N. Prat
6 March02
9
6
8 11 5
9
9
3 1
9 March02
12
Bordeaux Brest Lyon Nantes Paris
CI TY
8 March02
P1 P2 P3 P4 P5 P6 P7 PRODUCT product name weight Measure
SUBCATEGORY
: E N S I O N DIM DIMENSION LEVEL Dimension level attribute Hierarchy
RI CT
9
DI ST
6
5 March02
CATEGORY
GI ON
4
4 March02
RE
DAY
MONTH
3 March02
7 March02 WEEK
YEAR
QUARTER
TIME
sale amount
G
E
O
G
R
A
P
H
Y
PRODUCT
Fig. 1. Multidimensional representation of data
levels of detail [13]. In [14], a set of metrics for evaluating multidimensional schemas quality is proposed, however the metrics are not related to quality criteria and are specific to ROLAP (Relational OLAP) environments. – The third category describes global frameworks for data warehouse quality evaluation. The DWQ project (Foundations of Data Warehouse Quality) is representative of this approach [15]. DWQ uses and adapts the GQM (GoalQuestion-Metric) approach from software quality management. With respect to previous work, our approach belongs to the second category (multidimensional schemas quality evaluation). It explores the criteria of analyzability and simplicity and the trade-off that often has to be made between these two criteria. Our approach may be used independently of any OLAP environment, since it is based on a multidimensional model which is itself independent from any target tool. We define metrics which are related to the quality criteria and may be computed semiautomatically. This approach is more operational than previous work on multidimensional schemas normalization, which is often of little help in practice.
3 The Multidimensional Model Many multidimensional models have been defined in the literature, however we found no satisfying model encompassing all the important concepts of multidimensional modeling. Therefore, we defined our unified multidimensional model [4,5]. After an introduction to the key multidimensional concepts, we present a simplified version of our unified multidimensional model. 3.1 Multidimensional Concepts Multidimensional models organize data in (hyper)cubes. Therefore, the key multidimensional concepts are cubes -which represent facts of interest for analyzis-, and dimensions. Figure 1 shows an example cube, which represents the fact Sale. Sales are analyzed according to three dimensions: time, product and geography. Facts are described by measures. A measure, like sale amount (in K¼) in Figure 1, is typically a quantitative data. The measures correspond to the cells of the cube.
Multidimensional Schemas Quality
COMPOSED_OF_1
MULTIDIMENSIONAL_SCHEMA 1,1
1,1
143
COMPOSED_OF_2
schema_name
1,N
1,N FACT
DIMENSION
fact name
dimension name DIMENSIONING
0,N
1,1
1,N
DIMENSION_LEVEL
OWNS_2
level_name 1,1 OWNS_1
1,1
0,N COMPOSED_OF_3 1,1 1,1
1,N
1,N DIMENSION_LEVEL_ATTRIBUTE
MEASURE measure_name
attribute_name CHILD
HIERARCHY
PARENT 0,N
CLASSIFICATION_RELATIONSHIP
1,N 0,N
1,N
OWNS_3 level
1,N APPLICABLE_AGGREGATION_FUNCTIONS
0,N
class_of_functions levels
Fig. 2. Unified multidimensional model
Every dimension may consist in one or several aggregation level(s), called dimension levels. Dimension levels are organized in hierarchies, i.e. aggregation paths between successive dimension levels. In Figure 1, “Day→Week→Quarter→ Year” is an example hierarchy. Hierarchies are used in conjunction with aggregation functions to aggregate (“rollup”) or detail (“drill-down”) measures. In our example, the sale amount may be totaled at different levels of the Time, Product and Geography dimensions. Often, hierarchies are completed with the special dimension level All, thereby enabling aggregation of measures at the highest possible level (e.g. following the hierarchy “Day→Week→Quarter→Year→All”, the total sale amount over the Time dimension may be computed). In addition to being organized in hierarchies, dimension levels may be described by attributes. For example, the dimension level Product is described by the name and weight of the product. Dimension level attributes are not the object of multidimensional analysis, as opposed to measures. Instances of dimension levels are called dimension members. For a given measure in a n-dimensional (hyper)cube, a combination of n dimension members, e.g. (3 March 02, “P1”, “Paris”), uniquely identifies a cell and therefore a measure value (4 K¼). More specifically, for each axis, the dimension members used as coordinates are instances of the least aggregated dimension level. In the sequel, we will refer to these dimension levels (in our example, Day, Product and City) as “base dimension levels” of their respective dimensions (Time, Product and Geography). 3.2 The Unified Multidimensional Model Figure 2 represents a simplified version of the unified multidimensional model, using the ER notation. Any multidimensional schema is composed of dimensions and facts, which are interrelated and composed of hierarchies and measures respectively. Dimensions are defined by grouping dimension levels into hierarchies (through classification relationships) and then hierarchies into dimensions.
144
S. Si-Said Cherfi and N. Prat
A classification relationship -e.g. “Day→Week”- links a child dimension level to a parent dimension level. Similarly to [16], we define a hierarchy -e.g. “Day→Week→ Quarter”- as a meaningful sequence of classification relationships where the parent dimension level of a classification relationship is also the child of the next classification relationship. In other words, a hierarchy is a meaningful aggregation path between dimension levels. An aggregation path is “meaningful” if valid sequences of drill-down and/or rollup operations can be performed by following the path. Different hierarchies may share common dimension levels and classification relationships. Dimension levels own dimension level attributes. Facts are composed of measures. Some facts have no measure. Facts are dimensioned by dimension levels. The relationship between a fact and its dimensioning dimension levels is called dimensioning. The definition of applicable aggregation functions to measures along the different hierarchies is crucial. For every measure, for every dimension level dimensioning the measure (i.e. dimensioning the fact which bears the measure), the set of aggregation functions applicable along the different hierarchies starting from the dimension level has to be specified. Our multidimensional model considers the following functions: SUM, AVG, MIN, MAX, MED (median), VAR (variance), STDDEV and COUNT. Following [17,18,19], we distinguish between three classes of aggregation functions. The first class, which includes all aggregation functions ({SUM, AVG, MIN, MAX, MED, VAR, STDDEV, COUNT}), is applicable to measures that can be summed. The second class ({AVG, MIN, MAX, MED, VAR, STDDEV, COUNT}) applies to measures that can be used for average calculations. The last class contains the single function COUNT. While it is often assumed that the aggregation functions applicable to a measure along a dimension level do not depend on the hierarchies starting from this dimension level, the unified multidimensional model explicitly states that the applicable functions depend on these hierarchies. For a given hierarchy, applicable aggregation functions may even depend on the levels of the hierarchy i.e. the aggregation functions are applicable only to the first n levels of the hierarchy. It may also happen that a measure is not summarizable, whatever the aggregation function and the hierarchy. We have defined a graphical notation associated with the unified multidimensional model. This notation is used in Figure 4 in the case study. Dimension levels are represented as 2D rectangles, with their name and attributes. Similarly, facts are represented as 3D rectangles, with their name and measures. Dimensioning relationships are represented as lines and classification relationships are represented with arrows. Among dimension levels, base dimension levels are represented with a double line rectangle instead of a single line rectangle. The name of a dimension is the name of its base dimension level. The graphical multidimensional schema does not include the specification of applicable aggregation functions. This information is specified in a separate table.
4 Multidimensional Schemas Quality Evaluation Framework *OPSEFSUPFWBMVBUFUIFRVBMJUZPGNVMUJEJNFOTJPOBMTDIFNBT we
suggest to adapt the quality framework proposed in [6] considering the quality assessment problem according to three views : • the specification view concerned with the data warehouse designer’s objectives,
Multidimensional Schemas Quality
145
• the usage view dealing with the decision maker’s requirements, • the implementation view related to the developer’s concerns. The specification view captures the degree of fitness of the multidimensional schema with reality and more specifically with the user’s needs. For the sake of this paper, we concentrate on two key criteria, namely expressiveness and simplicity. Our objective is to answer the question of which multidimensional schema to choose among a set of alternative schemas, taking into account two fundamental needs: • increasing analyzability as it is the main objective of a multidimensional schema, and • decreasing complexity or –conversely- increasing simplicity, as schemas are directly accessed by decision makers. In the context of multidimensional modeling, we define expressiveness as the richness of BOBMZ[FT QSPWJEFE CZ B TDIFNB therefore we adopt the term of analyzability instead of expressiveness. Before detailing the measurement of analyzability and simplicity, we present the case study of a video rental store, which shall be used throughout section 4 to illustrate how analyzability and simplicity are assessed and balanced. 4.1 Presentation of the Case Study The management of a major video (VHS and DVD) rental store has decided to build a data warehouse in order to better analyze the behavior of its customers and anticipate their needs. The information available to designers is represented in Figure 3. This information consists in : • the list of users’ needs in terms of analyzes (left part of Figure 3) • the EER schema of the source operational database (right part of Figure 3). Based on users’ analysis needs and on the operational schema, the multidimensional schema is built. While building this schema, the data warehouse designer has to make several choices which often boil down to trading off between analyzability and simplicity. The resulting alternative schemas are presented in Figure 4. 4.2. Measuring Analyzability A schema is said to be expressive when it represents users requirements in a natural way and can be easily understood without additional explanation [20]. In the case of multidimensional schemas, expressiveness measures the variety of analyzes that the decision makers will be able to perform on the schema. We distinguish between fact and schema analyzability. Measuring fact analyzability We assume that fact expressiveness depends on: – the number of measures describing the fact, – the number of dimensions i.e. the number of base dimension levels dimensioning the fact,
146
S. Si-Said Cherfi and N. Prat
¾Analyse rentals (in terms of
duration and amount paid) by film, medium, customer and date. Analyse the opinion of subscribers on the films that they saw. This opinion has to be analysed on multiple criteria, e.g. film type, country, continent, producer and medium (VHS or DVD depending on what was rented). Analyse subscriptions and their evolution in time. Analyse the characteristics of subscribers (average age…).
¾
¾ ¾
COPY
RENTAL OPERATION RENTAL
1,N
copy id
0,N
number of days 0,N
0,N
1,N FOR
ON
OF
operation number operation date
MEDIUM 1,1 1,1
medium rental price 1,1
FILM
CUSTOMER
title film type country year
0,N
customer id customer name
0,N OPINI0N rating
0,N
PRODUCER
1,N
SUBSCRIBER 1,N
BY
producer id producer name
subscriber address subscriber age subscriber sex SUBSCRIPTION
SUBSCRIPTION FORMULA formula id subscription fee discount
subscription number 1,1 subscription date
1,1 1,1 HAS 0,N
CHOSEN FOR
Fig. 3. Video rental store case study
– the number of dimension levels related to the fact, – and the aggregation functions that can be applied on the measures of the fact. To take these four aspects into account we identified four sub-criteria for fact expressiveness, namely: Fact richness, Fact dimensioning, Fact width-and-depth and Fact summarizability. Fact richness: The underlying assumption for this metric considers that the richness of a fact depends on the measure potential captured by its measures. This potential could be calculated with regard to the measure potential of the schema to which F belongs (calculated as Local Fact Richness) or to a maximal measure potential calculated with regard to a set of alternative multidimensional schemas proposed for the same reality (calculated as Global Fact Richness). The metrics are the following: Local Fact Richness(F)= NBmeasures(F) NBmeasures(S)
Where F is a fact from a S. NBmeasures is a function counting the number of measures contained in either a fact or a schema.
Note that local fact richness allows only the comparison of measure potential of facts within the same schema. However, we believe that a global fact richness is more suitable when the concern is to compare several alternative schemas. Global Fact Richness(F) = Where F is a fact from a schema S. NBmeasures is a function counting the number of measures contained in either a fact or NBmeasures(F) a schema and U(Si) a function calculating the union of the set N NBmeasures( (Si)) of alternative schemas (S1, …SN).
* 1
Fact dimensioning: This criterion assumes that a n-dimensioned fact is more expressive that a m-dimensioned fact when n>m. Similarly to the fact richness evaluation and taking into account the same considerations, we propose two metrics for the Fact dimensioning measurement: a local and a global one.
Multidimensional Schemas Quality Local Fact Dimensioning(F) =
N Bdimensions(F ) MAX (N Bdimensions(Fi ) Fi ∈ S )
147
Where F is a fact from a multidimensional schema S. NBdimensions is a function counting the number of base dimension levels of fact F.
For the global fact dimensioning metric, we consider a union schema in which all the facts appearing in the alternative schemas are represented and each fact is related to a maximal number of dimension levels deduced from the several schemas. Such a union schema is not always correct because the union here is purely syntactic and does not correspond to an integration of the schemas. Global Fact Dimensioning (F) =
Where F is a fact from a multidimensional schema S. NBdimensions is a function counting the number of base dimension levels of a fact F. N MAX (N Bd imensions(Fi ) Fi ∈ * ( Si )) U(Si) is a function calculating the union of the set 1 of alternative schemas (S1, …SN).
N Bdimensions(F )
Fact width-and-depth: This criterion refines the one concerning fact dimensioning by considering not only the dimension levels dimensioning a fact, but also the dimension levels directly or indirectly linked to these dimension levels by classification relationships. Fact summarizability : This criterion is related to the applicability of aggregation functions on the measures of a given fact. Let’s consider the example from Figure 4. For the fact "Rental", several aggregation functions (SUM, AVG, COUNT, etc...) are applicable to the measure "amount paid". Moreover, these functions are applicable at each of the dimension levels directly or indirectly related to the fact "Rental". However, the "number of days" of a rental may not be summed along the dimension level Copy (e.g. if a customer has rented two copies simultaneously, the duration of his rental is not the sum of the duration of rental of the two copies); therefore, in this case an aggregation function like AVG or MAX may be used. To evaluate fact summarizability, we associate values to the classes of aggregation functions presented in section 3 ({SUM, AVG, MIN, MAX, MED, VAR, STDDEV, COUNT} has the highest value, {AVG, MIN, MAX, MED, VAR, STDDEV, COUNT} has a lower value…). The metric proposed for fact summarizability is based on both the class of functions that can be applied and the number of dimension levels for which this application has a sense. Local Fact Summarizability (Fk)= Where Fk is a fact from a multidimensional schema S. FuncApp(DLi,Mjk) is a function associating a value FuncApp(DLi, Mjk ) for the aggregation functions applicability for the j i measure Mjk and the dimension level DLi. Mjk is a FuncApp(DLi, Mjk ) measure belonging to the fact Fk and DLi is a k j i dimension level related to Fk.
∑∑
∑∑∑
For the global fact summarizability metric, we consider again a union schema. Global Fact Summarizability (Fk)=
∑∑ FuncApp(DLi, Mjk ) j
i
∑∑∑ FuncApp(DL , M ) i
l
j
i
jl
Where Fk is a fact from a multidimensional schema S. FuncApp(DLi,Mjk) is a function associating a value for the aggregation functions applicability for the measure Mjk and the dimension level DLi. Mjk is a measure belonging to the fact Fk and DLi is a dimension level related to Fk. l is the number of facts from the union schema.
148
S. Si-Said Cherfi and N. Prat
4.2.2 Measuring Schema Analyzability For schema analyzability, we suggest an average calculated for each of the fact expressiveness sub-criteria. We will not detail the metrics here due to space limitations. 4.3 Measuring Simplicity Our measure of simplicity is based on the following assumptions: • simplicity decreases with the facts/measures ratio (respectively the dimension levels/attributes ratio), • simplicity decreases with the number of relations between concepts. We have therefore identified three metrics of simplicity, namely: Fact based simplicity, Dimension based simplicity and Relationships based simplicity. 4.3.1 Measuring Fact Based Simplicity If we assume that simplicity decreases with the number of facts, one means to increase simplicity is by merging facts to decrease their number in the schema and consequently increase the number of measures per fact. This leads to the metric: Fact based simplicity (S)=
NB(F) 1− NB(F)
∑NB(M ) i
i =1
Where NB(F) is the number of facts from a schema S and NB(Mi) the number of measures from a fact Fi. Facts with no measure are not considered.
4.3.2 Measuring Dimension Levels Based Simplicity Similarly to the previous metric, this metric considers that one means to decrease the number of dimension levels is by merging dimension levels and by grouping their attributes. The metric is the following: Dimension level based simplicity (S) NB(DL)
=1− NB(DL)
∑NB(Att ) i
Where NB(DL) is the number of dimension levels from a schema S and NB(Atti) the number of attributes of a dimension level Dli.
i =1
4.3.3 Measuring Relationships Based Simplicity This metric is based on the assumption that the relationships are more complex than independent concepts as their understanding requires all the related concepts. The corresponding metric is the following: Relationships based simplicity (S)= NB(F) + NB(DL) 1− NB(F) + NB(DL)+ NB(link)
Where NB(F), NB(DL) are respectively the number of facts and dimension levels in a schema S. NB(link) is the number of links (dimensioning relationships and classification relationships).
4.4 An Assessment Exercise We will now apply the metrics presented above to compare three alternative multidimensional schemas for the video rental store case study. The alternative schemas are presented in Figure 4 below.
Multidimensional Schemas Quality Month month
Year
Copy
year
copy id
Rental
Date
number of days amount paid
date
Customer type Medium
Film type
Film
film type
title
customer type
medium
Customer Opinion
Country
customer id customer name subscriber age subscriber sex
rating
Film production
country
Producer Continent
producer id producer name
continent
Subscriber city
Subscription
subscriber city
subscription fee
In this first schema, what counts for analysis is the rental of copies, not the rental operations themselves. The rental date is defined as a dimension level to be able to analyze rentals over time. The amount paid for each rental is computed from the operational database. Both Film year
Rental year
Rental month
Rental date
rental year
rental month
rental date
film year
Copy Film type
Rental operation
Rental
copy id
operation number
number of days amount paid
film type
Customer type Country country
Continent continent
Film
Medium
Customer
title
medium
customer id customer name
Main producer producer id producer name
Subscription year subscription year
customer type City city
Opinion Subscriber
rating
subscriber id subscriber age
Age category age category
Subscription
Sex
subscription fee
sex
Subscription month
Subscription date
Subscription formula
subscription month
subscription date
formula id
Copy
Rental
copy id
number of days amount paid
Medium Film title film type film year
medium
Opinion rating
Country country
Continent continent
Date
Month
Year
date
month
year
Customer customer id customer name
Subscriber subscriber id subscriber city subscriber age subscriber sex subscription date subscription formula subscription fee
149
Subscriber and Customer from the EER schema are mapped to the dimension level Customer. From the attribute subscriber address, only the city is interesting and it is defined for subscribers only. The dimension level Customer type considers the distinction between ordinary customers and subscribers and, for the later, the type of subscription. Since the users want to analyze subscriptions, these are represented by a fact. The dimension level Film allows aggregations on this criterion. Similarly, the dimension levels Country and Continent are added. The designer chooses to represent the N-N relationship with producers as a fact without measures.
In this solution, rental operations are considered relevant for analysis and represented explicitly in the schema. Days, months and years are represented with different dimension levels. Both Customer and Subscriber are represented as dimension levels. The inheritance link of the EER schema is also mapped by defining the dimension level Customer type. The age of subscribers is represented both as a dimension level attribute and as a dimension level by defining age categories. Only the main producer of each film is represented. This last solution is the simplest. Several pieces of information are omitted or represented as dimension level attributes, which will restrict analysis possibilities (e.g.subscriptions are not represented explicitly as a fact but as attributes of the dimension level Subscriber). The problem of the N-N relationship between Film and Producer in the EER schema is solved by omitting this relationship and consequently the entity Producer.
Fig. 4. Alternative multidimensional schemas for the video rental store example
150
S. Si-Said Cherfi and N. Prat
Table 1 summarizes the quality metric based calculations. Note that the metrics applied are those defined at the schema level. Table 1. A comparative evaluation of alternative multidimensional schemas Quality criterion Schema
Schema 1 Schema 2 Schema 3
Analysability fact Richness
0.25 0.33 0.37
Fact Dimensioning
0.38 0.62 0.35
Simplicity Summarizability
0.33 0.55 0.16
Fact Based Simplicity
0.25 0.25 0.33
Dimension Level Based Simplicity
0.50 0.12 0.58
Relationship Based Simplicity
0.67 0.68 0.73
These calculations show that schema 2 is the most analyzable but it is also the less simple. Indeed, this schema 2 proposes the maximal analysis by the multiplication of dimension levels that also decreases its simplicity. Schema 3 is the least analyzable and the simplest. Table 1 shows also that schema 1 could be the suggested solution. It has medium values for both analyzability and simplicity and could consequently be a good compromise between simplicity and analyzability.
5 Conclusion and Further Research In order to take into account the quality aspects in decision systems engineering in general and in multidimensional modeling in particular, we proposed an approach based on a unified multidimensional model and a multidimensional schemas quality evaluation framework. The framework consists in three viewpoints, namely the specification view (for the data warehouse designer), the usage view (for the decision maker) and the implementation view (for the data warehouse/data mart developer). Focusing on the specification view, we addressed the problem of assessing and balancing between multidimensional schemas analyzability and simplicity. To this end, we presented and illustrated a set of criteria and associated metrics, which can be computed in a (semi-)automated way. Our approach is especially useful when the data warehouse designer has to choose between several alternative designs. Further work concerns the empirical and formal validation of the metrics proposed in this paper (e.g. formal verification of the properties of the metrics, clarification of the relationship between analyzability and simplicity using a large set of schemas…). We are currently working on these issues. Acknowledgements. We thank colleagues from CEDRIC-CNAM, more specifically Jacky AKOKA and Isabelle COMYN-WATTIAU, for their helpful comments on this paper. This work is partly funded by the Research Center of ESSEC.
References 1.
Chen, P.P.: The entity-relationship model – toward a unified view of data. ACM TODS, volume 1, number 1, March 1976
Multidimensional Schemas Quality 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
151
Blaschka, M., Sapia, C., Höfling, G., Dinter, B.: Finding your way through multidimensional data models. DEXA Workshop on Data Warehouse Design and OLAP Technology (DWDOT ’98), Vienna, Austria, 1998 Vassiliadis, P., Sellis, T.: A survey of logical models for OLAP databases. SIGMOD Record, volume 28, number 4, December 1999 Akoka, J., Comyn-Wattiau, I., Prat, N.: Dimension hierarchies design from UML generalizations and aggregations. 20th International Conference on Conceptual Modeling (ER 2001), Yokohama, Japan, November 2001 Prat, N., Akoka, J., Wattiau, I.: A data warehouse design method based on UML, to be submitted for publication. Si-Saïd, S., Akoka, J., Comyn-Wattiau, I.: Conceptual Modeling Quality – From EER to UML Schemas Evaluation. Proceedings of the 21th International Conference on Conceptual Modeling (ER2002), LNCS, Springer. Si-Saïd, S., Akoka, J., Comyn-Wattiau, I.: Measuring UML Conceptual Modeling Quality – Method and Implementation. Proceedings of the BDA Conference, Ed. P. Pucheral, Collection INT, France, 2002 Hufford, D.: Data warehouse quality: special feature from January 1996. DM Review, January 1996 Ballou, D.P., Tayi, G.K.: Enhancing data quality in data warehouse environments. Communications of the ACM, volume 42, number 1, January 1999 Lechtenbörger, J., Vossen, G.: Multidimensional normal forms for data warehouse design. Information Systems, volume 28, number 5, 2003 Lehner, W., Albrecht, J., Wedekind, H.: Normal forms for multidimensional databases. 10th International Conference on Statistical and Scientific Database Management (SSDBM ’98), Capri, Italy, July 1998 Levene, M., Loizou, G.: Why is the snowflake schema a good candidate for data warehouse design? Information Systems, volume 28, number 3, 2003 Lenz, H.-J., Shoshani, A.: Summarizability in OLAP and statistical data bases. 9th International Conference on Statistical and Scientific Database Management (SSDBM ’97), Olympia, Washington, USA, August 1997 Calero, C., Piattini, M., Pascual, C., Serrano, M.A.: Towards data warehouse quality metrics. 3rd International Workshop on Design and Management of Data Warehouses (DMDW 2001), Interlaken, Switzerland, June 2001 Jarke, M., Jeusfeld, M., Quix, C., Vassiliadis, P.: Architecture and quality in data warehouses: an extended repository approach. Information Systems, volume 24, number 3, 1999 Tsois, A., Karayannidis, N., Sellis, T.: MAC: conceptual data modeling for OLAP. 3rd International Workshop on Design and Management of Data Warehouses (DMDW 2001), Interlaken, Switzerland, June 2001 Lehner, W.: Modeling large scale OLAP scenarios. 6th International Workshop on Extending Database Technology (EDBT ’98), Valencia, Spain, March 1998 Pedersen, T.B., Jensen, C.S.: Multidimensional data modeling for complex data. 15th International Conference on Data Engineering (ICDE ’99), Sydney, Australia, March 1999 Rafanelli, M., Ricci, F.: Proposal of a logical model for statistical databases. 2nd International Workshop on Statistical Database Management (SSDBM ’83), Los Altos, California, September 1983 Batini, C., Ceri, S., Navathe, S.B.: Conceptual database design : An entity relationship approach. Benjamen Cummings, Redwood City, California, 1992
Conceptual Modeling of Accounting Information Systems: A Comparative Study of REA and ER Diagrams Geert Poels Faculty of Economics and Business Administration Ghent University Hoveniersberg 24, 9000 Gent, Belgium [email protected]
Abstract. The Resource-Event-Agent (REA) accounting model is an enterprise domain ontology for developing accounting systems. As a semantic data model, the REA accounting model is based on the Entity-Relationship (ER) metamodel, but contains additional ontological primitives and axioms, and associated modeling guidelines, that help constructing and validating conceptual models of accounting information systems. In this paper we evaluate REA modeling by means of a controlled experiment. The goal of this empirical investigation is to assess and compare the pragmatic quality of REA and ER diagrams when used as conceptual models of transaction-oriented business processes. We hope that this work contributes towards an improved understanding of the benefits of using the REA accounting model as a quality assurance tool for the conceptual modeling of accounting information systems.
1 Introduction Quality has been identified as a main topic in contemporary conceptual modeling research [1]. Improving the quality of conceptual representations provides a major opportunity to improve the productivity of information systems development [2]. Although the need for more quality-related process guidelines and quality assurance of modeling processes has been emphasized in a number of quality frameworks [3-4], most of today’s quality research initiatives in conceptual modeling focus on product quality [5]. Existing research defines criteria and measures for evaluating the quality of models but not how to develop such models in a high quality manner [2]. There clearly is a need for more method evaluation research in the conceptual modeling field, which is generally true for IS development methods [6]. In this paper we evaluate McCarthy’s REA accounting model [7-8]. This data model and associated modeling method is an extension of Chen’s Entity-Relationship (ER) approach [9]. It has been specifically proposed for the conceptual modeling of transaction-oriented business processes with the aim of establishing a sound basis for the development of an integrated and enterprise-wide accounting information system (AIS). The inclusion of REA in two standard AIS textbooks [10-11] has resulted in a widespread diffusion of REA-oriented development of accounting systems in AIS education [12]. Although adoption in practice is still limited to a few companies, M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 152–164, 2003. © Springer-Verlag Berlin Heidelberg 2003
Conceptual Modeling of Accounting Information Systems
153
including IBM-Japan and Price-Waterhouse Consulting [13], over twenty years after its conception, REA modeling is getting increasing attention and recognition [14]. Moreover, continuing research efforts have resulted in the extension of the REA semantic data model into a comprehensive enterprise domain ontology, called the REA enterprise information architecture [15-17]. In spite of its importance in the AIS field, empirical validations of REA are scarce. According to [13] most validation studies are instances of ‘design science’ aimed at building prototypes to demonstrate the feasibility of implementing REA conceptual models using various types of database and knowledge-base technologies. Apart from these proofs of concept, the only ‘natural science’ research efforts related to REA mentioned in [13] are in the accounting realm and do not focus on REA conceptual modeling. Nevertheless, REA researchers, amongst them McCarthy, recognize that the more promising REA research is work on the empirical end [12]. In our opinion, this work should include also investigations of the usefulness of the REA accounting model for conceptual modeling. In this paper we present an empirical study that evaluates the REA modeling of transaction-oriented business processes from the perspective of the user who needs to understand the conceptual models. Such users include database designers and system developers, but also the information analysts and AIS end-users (e.g. accountants, controllers, auditors, financial managers) that need to specify, analyse and validate the conceptual representations of business activities and rules. Our study takes the form of a laboratory experiment in which we evaluate the pragmatic quality of REA diagrams (i.e. the product of a REA modeling effort) as compared to ‘standard’ ER diagrams. Pragmatic quality is a quality dimension taken from the framework of Lindland, Sindre, and Sølvberg [18] that captures how well a conceptual model is understood by its users. In the experiment we wished to corroborate the hypothesis that REA diagrams are better understood than equivalent ER diagrams that model the same real-world phenomena. As our focus is primarily on business users of AIS conceptual models, the experimental subjects are business administration students enrolled in a post-graduate AIS course, which we consider as being a more representative sample of the population under study than a group of computer science students. The rest of this paper is structured as follows: Section 2 reviews the main concepts of the REA accounting model. Section 3 describes the definition, planning and operation of the experiment. Section 4 presents the analysis and interpretation of the data that was collected and evaluates study validity. Finally, section 5 contains conclusions.
2 The REA Accounting Model The R(esource)-E(vent)-A(gent) framework of McCarthy [7-8] is based on the concept of ‘economic exchange’, i.e. the process of giving up some resources to obtain others [15]. The REA accounting model provides the constructs to model such transaction-oriented business processes (or transaction cycles as they are called by accountants [12]) and to integrate these models into a conceptual representation of a company’s value cycle [14], value chain [12] or inter-enterprise value system [19].
154
G. Poels
The foundation of the REA accounting model is the ER model. Three categories of entities are distinguished: (1) the events that occur as part of the exchange, (2) the resources that are affected, and (3) the participating agents. Apart from this classification, the REA accounting model imposes a normative structure on the relationships between events [13]. Economic duality relationships between event entities reflect the mirror-image nature of exchanges, where one resource is given up (outflow event) for another (inflow event). Connections between resource entities and event entities are represented by stock-flow relationships. Participation relationships link agents to the events they participate in. The stereotypical pattern of an economic exchange is described in the basic REA template (Fig. 1(a)), which can be instantiated to any transaction cycle (see e.g. Fig. 1(b)). Cardinality constraints can be added to express common business practices or specific business policies [11].
3DUWLFLSDWLRQ Resource A
,QIORZ
Internal agent
Get resource A
3DUWLFLSDWLRQ Inventory
3DUWLFLSDWLRQ
,QIORZ
External agent
3DUWLFLSDWLRQ 3D\PHQWB IRU
(FRQRPLF GXDOLW\
3DUWLFLSDWLRQ Resource B
2XWIORZ
Cash
3DUWLFLSDWLRQ
(a)
Vendor
3DUWLFLSDWLRQ
Internal agent
Give up resource B
Purchase agent
Purchases
2XWIORZ
Cash Disbursements
External agent
3DUWLFLSDWLRQ
Cashier
(b)
Fig. 1. (a) Basic REA template (based on Fig. 6-2 in [11]) (b) REA diagram of expenditure cycle for a retail company
Associated with the basic REA template are modeling rules and guidelines like normative axioms (e.g. all events effecting an outflow must eventually be paired in duality relationships with events effecting an inflow and vice-versa [15]), optional naming conventions for relationships and usual placement conventions (e.g. three leftto-right columns for respectively resources, events, and agents [11]). Any ER diagram that is obtained through the instantiation of the basic REA template, including the application of the REA modeling rules and guidelines, is called a REA diagram (as in [11]). Such REA diagrams may also include other types of relationships, not shown in the basic REA template, like custody relationships between agents and resources [15]. Amongst the benefits of REA modeling are mentioned the support for conceptual design [15]. The ontological primitives and axioms help assuring the perceived semantic quality of the transaction cycle models (i.e. validity and completeness of the model with respect to the modeled domain as it is known [3]). Moreover, the reusable stereotypical pattern and readability guidelines inherent in the basic REA template are meant to improve the pragmatic quality of the models, where the main goal is comprehension [20]. According to quality frameworks for conceptual models, structure [18] and diagram layout and aesthetics [3, 20] are means to achieve this goal. It is especially this claim of improved comprehension (as compared to using the standard ER approach) that is investigated in this paper.
Conceptual Modeling of Accounting Information Systems
155
3 Preparing the Experiment Using the framework suggested by Wohlin et al. [21], the experiment can be defined as follows: The object of study is a diagrammatic conceptual representation of a single transaction-oriented business process. The purpose of the experiment is to evaluate the REA accounting model with respect to its pragmatic quality, using the ER model as a benchmark. This evaluation is performed from the point of view of the AIS researcher. The context of the experiment is a university, graduate-level AIS course, using course participants as subjects and conceptual models of a fictitious company’s finance cycle as objects. It is organized as a classroom experiment and can be classified as a synthetic environment experiment [22] that is run off-line, i.e. not as (part of) a course or real project [21]. In this section we describe the planning and operation of the experiment. In the next section, the data that is obtained through the experiment is analyzed, the results of the data analysis are interpreted, and their validity is evaluated. 3.1 Variables Selection and Hypotheses For the purpose of the experiment we define the pragmatic quality of a REA or ER diagram in terms of (human) user comprehension (to differentiate it from technical pragmatic quality which considers how good models are interpreted by technical actors [23]). The concept of user comprehension is operationalized using two variables: (1) the accuracy of comprehension, measured as the correctness of answers to comprehension questions about the diagrams, and (2) comprehension time, defined as the time required to perform the comprehension task (i.e. answering the questions). These variables are representative criteria of two orthogonal performance-based dimensions, described respectively as effectiveness and efficiency, against which to evaluate methods, as proposed in Moody’s Method Evaluation Model [24]. Since the experiment aims at investigating the effect of the type of diagram on the response variables, the ‘diagram type’ is the main factor under investigation and its treatments are REA and ER. Alternatively, since the REA accounting model is based on the ER approach, we could consider REA as the treatment and ER as the control. Given the alleged benefits of the REA accounting model, the direction of the effect has been specified in the experimental hypotheses. The first hypothesis set relates to the effectiveness of the REA accounting model as a conceptual modeling tool that increases user understanding: H0a: There is no significant difference in accuracy of comprehension between a REA diagram and an ER diagram; H1a: The accuracy of comprehension of a REA diagram is significantly higher than that of an ER diagram. The second hypothesis set is related to the efficiency of understanding REA diagrams as compared to ER diagrams: H0b: There is no significant difference in comprehension time between a REA diagram and an ER diagram;
156
G. Poels
H1b: The comprehension time of a REA diagram is significantly less than the comprehension time of an ER diagram. In all of the above hypotheses it is assumed that the REA diagram and ER diagram that are compared are alternative conceptual representations of the same real-world phenomena, which is an issue we should take into account when preparing the experimental materials and selecting an appropriate experimental design. 3.2 Subjects The subjects that participated in this study were 21 students enrolled in the AIS course of the international Master in Accounting program, organized by Ghent University. These students had diverse backgrounds, representing seven countries and three continents, aged between 22 and 32 years, and having various levels of working experience (including no experience). As business administration students, nearly all of them had previously taken a MIS course. As part of the AIS course, students attended a 4.5 hours session on ER modeling of transaction-oriented business processes, during which they were also introduced to the REA accounting model and learned to instantiate the basic REA template. Students were taught the principles of the REA event categories, the structured relationships of the basic REA template, and the usual placement conventions as found in textbooks like [11]. They were, however, not aware of the full details of the REA ontology, nor did this course session focus on the added value of using REA diagrams to model transaction-oriented business processes. During subsequent course sessions on the nature and structure of transaction cycles for various types of company, frequent use was made of conceptual models to illustrate ideas and stimulate discussion. Most of these examples were REA diagrams, but also ER diagrams were used that implemented only parts of the basic REA template and applied only some of the modeling rules and guidelines. Although it was hoped that students would appreciate the advantages of the REA accounting model when constructing or analyzing conceptual models in the AIS domain, this was not explicitly emphasized during the course. Moreover the course required them to understand and develop both types of diagram. We therefore do not believe that the experimental subjects were strongly biased towards the use of REA modeling. We are aware, though, that this certainly is an issue that must be discussed when evaluating the study validity. The experiment took place at the end of the course and was part of the course examination. The students were familiar with the type of questions asked (i.e. the comprehension questions), but the transaction cycle modeled (see below) was new to them. They were not aware that they were experimental subjects. 3.3 Objects and Experimental Design Because of the heterogeneous nature of the student group, a between-subjects design with randomized allocation to the two treatments could have introduced considerable error variance due to differences among subjects. As there was little information available on the ability, prior knowledge, and relevant working experience of the
Conceptual Modeling of Accounting Information Systems
157
students, we could not resort to blocking or matching techniques. We therefore chose for a within-subjects experiment. This choice implies that each experimental subject contributes an observation for each treatment, which is an additional advantage given the small sample size. Hence each subject had to perform the experimental tasks on both types of diagram. Since each REA diagram is also an ER diagram, we could not provide subjects alternative diagrams of the same reality. Therefore we used conceptual representations of two different business processes, stock issuance and loan acquisition, involving sequences of transactions that, in AIS terms, build a finance transaction cycle. Although both processes are similar in scope, structure, and goal (i.e. acquiring funds for the company), the process used is an independent variable that could have introduced a confounding effect if only two experimental objects were used (i.e. an ER diagram for one process and a REA diagram for the other process) and if one process is inherently more comprehensible than the other. To control this instrumentation effect it was decided to use four experimental objects, meaning a REA diagram and a standard ER diagram for both the stock issuance and the loan acquisition business processes. To control for further distorting effects, like a learning effect, we used the counterbalancing procedure where half the subjects were given an ER diagram of the loan acquisition process in a first experimental trial and a REA diagram of the stock issuance process in the second trial. The other half of the subjects received a REA diagram of the loan acquisition process in the first trial and an ER diagram of the stock issuance process in the second trial. Our experimental design can thus be described as a standard within-subject 2 x 2 factorial design with random allocation of subjects to two groups, hereafter referred to as groups A and B (Table 1). Such a design does not allow separating the effects that the experimental run (i.e. trial) and process modeled have on user comprehension (i.e. a 3 x 2 design would be necessary). It should be noted though that the factor under investigation is the type of diagram. The design was chosen to control the confounding effects of the other independent variables, not to investigate these learning and instrumentation effects in detail. Table 1. Experimental design used
Trial / Process 1 / Loan acquisition 2 / Stock Issuance
Treatment (type of diagram) ER diagram REA diagram A B B A
Another confounding effect could be caused by differences in the degree of difficulty in the experimental tasks (see below) to be performed on the four experimental objects. Especially with respect to the use of ER or REA diagrams such an effect could pose a serious threat to the study’s validity. To control for this effect we wished to use exactly the same comprehension questions, except perhaps from some rewording, for the alternative diagrams of a same process. This required ensuring the semantic equivalence of both types of diagram. This was done by deriving the ER diagram from the REA diagram using three types of modification: (1) reifying the economic duality relationship, (2) not obeying the naming conventions for the stock-flow and participation relationships, and (3) not adhering to the usual
158
G. Poels
placement conventions regarding the sequential top-down ordering of event entities and the left-to-right column classification of resource, event and agent entities. We wish to remark here that these modifications are relatively limited in scope and that we did not wish to depart from basic REA principles that ensure the correctness and completeness of the transaction cycle representation. Although many of the claimed benefits of the REA accounting model relate to such issues of (perceived) semantic quality, the quality perspective taken here is that of pragmatic quality. The modifications therefore primarily aim at hindering the quick and easy recognition of the underlying REA economic exchange pattern in the ER diagrams used. Another reason for not altering the REA diagrams too much when deriving ER diagrams is that previous empirical studies have demonstrated that an ER diagram’s size (mainly in terms of number of entities) and structural complexity (mainly in terms of number of relationships) are negatively correlated to the diagram’s ease of understanding [2, 25]. Hence a difference in size and structural complexity between semantically equivalent ER and REA diagrams could be another confounding effect. Table 2 presents size and complexity metrics for the experimental objects. As can be seen in the table, the differences amongst diagrams of a same process are not of an order that can be considered as influential. Table 2. Size and structural complexity of the experimental objects
Size and complexity metrics Number of entity types Number of 1:1 relationship types Number of 1:N relationship types Number of N:M relationship types
REAloan 7 0 7 2
ERloan 8 1 7 2
REAstock 6 1 5 3
ERstock 7 1 6 3
3.4 Tasks For each experimental object there was a questionnaire with 13 questions that were aimed at assessing the subject’s comprehension of the conceptual model. There were questions about the correctness of the business activities or business rules modeled (relative to textual descriptions of the parts of the model involved). An example question is: Is the conceptual model correct given that in reality each dividend payment transaction relates to exactly one stock issuance ? There were questions on the right interpretation of the business policies and practices modeled. For example, what specific business policy is reflected by the structural constraints on the relationships between stock issuance and dividend payment ? There were also questions about the completeness of the transaction cycle that was modeled. An example here is: Is the business cycle complete in the sense that for every business activity there is an employee that can be hold accountable ? Other questions related to the identification of management and AIS concepts like the value cycle and giveto-get relationships (e.g. Identify transaction types for which the economic dual transaction is not modeled). Because of the semantic equivalence between the REA diagram and ER diagram of a same business process, the same questionnaire could be used for both types of diagram, giving us effective control over possible differences in the degree of
Conceptual Modeling of Accounting Information Systems
159
difficulty of experimental tasks. But also the questions for the two processes were very similar and generally differed only with respect to the names of the entity and relationship types. A potential confounding effect due to the order of presentation (i.e. solving the same kind of questions again), could arise here, but is controlled with the counterbalancing procedure and cannot be distinguished from the effects that the experimental trial and the process modeled have on the response variables. At the same time we believe that the limited number of questions, easily answered in about half an hour (see below), reduced the probability of a fatigue effect occurring. 3.5 Operation and Data Collection Procedures The experiment took place in a class room where there was sufficient space for each subject. Each subject sat next to a subject from the other group. It was strictly controlled that no interaction whatsoever between subjects occurred. The experimenter was present at all times and questions could only be directed to him. There was no real time limit on answering the comprehension questions (avoiding a ceiling effect), though the more time that a student spent on the experimental tasks, the less time remained for doing the (rest of the) exam. As a guideline, students were told that it was perfectly feasible to fill in a questionnaire in less than 30 minutes. Since there was no break between experimental trials, the experimental tasks could be performed in about one hour. By allowing no break we had stronger control over the subjects. On the other hand it increases the probability of a persistence or maturation effect, especially for the group that was asked to understand a REA diagram first (i.e. group B). There is a chance that these subjects might find it easier in the second run to recognize the REA economic exchange pattern in the ER diagram than the subjects of group A, who were confronted with an ER diagram in their first run. Since a possible persistence effect would make it harder to corroborate our alternate hypotheses (favoring REA diagrams above ER diagrams), we do not consider it as a serious threat to the study’s validity. After they finished their first questionnaire, students had to hand in all experimental materials (experimental object, questionnaire and answers) of the first run and received the experimental object and questionnaire of the second run. The comprehension time (i.e., time between receiving the experimental materials of a run and handing in the answers) was measured by the experimenter (in minutes), who also calculated the ratio of correct answers to the (13) questions, i.e. our measure for accuracy of comprehension.
4 Data Analysis and Discussion of Results 4.1 Analysis and Interpretation We first test hypothesis H1a related to the effectiveness of the REA accounting model, in terms of accuracy of comprehension. Descriptive parametric and non-parametric statistics for the ER and REA diagrams correctness data are presented in table 3.
160
G. Poels Table 3. Descriptive statistics correctness data
Correctness data Number of observations Mean Standard deviation 95% confidence interval of mean Median Interquartile range (upper/lower quartile) 95% confidence interval of median
ER diagrams 21 0.5635 0.21635 0.4650 to 0.6620 0.5417 0.2917 0.4167 to 0.7083
REA diagrams 21 0.6329 0.14650 0.5663 to 0.6996 0.6250 0.1667 0.5417 to 0.7083
In order to evaluate the significance of the observed difference, we applied a statistical test with a significance level of 5 %, i.e. = 0.05. Formal tests for normality of the differences between the paired samples data only weakly support the assumption of a normal distribution (Shapiro-Wilk W statistic: 0.9174, p-value: 0.0770; Kolmogorov-Smirnov statistic: 0.6557, p-value > 0.15). Moreover, the skewness of the distribution is close to one (0.9601, p-value: 0.0578), meaning more observations in the left tail than normal. The kurtosis is even significantly larger than one (2.6891, p-value: 0.0381), meaning that more observations are clustered around the mean, with fewer in the tails, than normal. Because of these indications of a nonnormal distribution, we used the less powerful Wilcoxon signed rank test for the difference in medians, which is a non-parametric alternative to the paired samples ttest. The result of the one-tailed test (see table 4) allows rejecting the null hypothesis (H0a), meaning that we empirically corroborated that the accuracy of comprehension of the REA diagrams is significantly higher than that of the ER diagrams. Table 4. 1-tailed Wilcoxon’s signed rank test for differences in median correctness ratio (REA diagram versus ER diagram; = 0.05)
Response variable Accuracy of comprehension
Size of the effect detected 0.0625
95% CI 0.0208 to +∞
Wilcoxon’s W-statistic 116.5
1-tailed pvalue 0.0055
Table 5 shows descriptive statistics of the comprehension time data for the ER and REA diagrams. As the differences between the paired samples comprehension time data show a near-normal distribution (skewness (-0.1689, p-value: 0.7209), kurtosis (0.8532, p-value: 0.3145), Shapiro-Wilk W statistic (0.9412, p-value: 0.2300), Kolmogorov-Smirnov statistic (0.5863, p-value > 0.15)) we applied the one-tailed paired samples t-test for the difference in means, which is robust against minor deviations from normality [26]. The significance level was again set at 5 %, i.e. = 0.05. The results of the test are shown in table 6. On the basis of these results we cannot accept hypothesis H1b, meaning that there is no significant difference in comprehension time between REA diagrams and ER diagrams. Only if we would agree to a lower confidence level, like = 0.10 (i.e. the probability that we reject H0b when H0b is false, is at least 90%), then a significant difference is found, meaning that we found some empirical evidence that REA diagrams require less time to understand than ER diagrams.
Conceptual Modeling of Accounting Information Systems
161
Table 5. Descriptive statistics comprehension time data
Comprehension time data Number of observations Mean Standard deviation 95% confidence interval of mean Median Interquartile range (upper/lower quartile) 95% confidence interval of median
ER diagrams 21 27.6 5.89 24.9 to 30.3 28.0 9.0
REA diagrams 21 24.0 7.04 20.8 to 27.2 26.0 10.0
24.0 to 33.0
18.00to 28.00
Table 6. 1-tailed paired samples t-test for differences in mean comprehension time (REA diagram versus ER diagram; = 0.05)
Response variable Comprehension time
Difference between means -3.6
95% CI
t-statistic
-∞ to 0.6
-1.49
1-tailed pvalue 0.0757
4.2 Validity Evaluation The response variables used in the experiment, i.e. accuracy of comprehension and comprehension time, are only pseudo-measures of user comprehension, which relates to the pragmatic quality of the diagrams. We agree with Briand et al. [26] that it is impossible to capture all dimensions of a concept in a single controlled experiment. Correctness ratios and timings relative to comprehension questionnaires are frequently used measurements of user comprehension in this type of research [26-30]. One might object that the time to answer the questionnaire is not a good measure of user comprehension if the student’s performance was bad (i.e. a low correctness ratio). This threat to construct validity can be investigated with a correlational analysis. The Pearson correlation coefficient expressing the degree of association between the paired comprehension time and correctness values (42 observations, normally distributed) is close to zero (Pearson’s r: -0.08, p-value: 0.6131), meaning that there is no correlation between the two types of measurement. We already described how potential confounding effects like learning, instrumentation, ceiling, and fatigue effects were alleviated through counterbalancing and other operational procedures. A ‘real’ threat to the internal validity of the study’s results is the order of learning effect. Students studied the ER model first, before they were introduced to the REA model. Towards the end of the course, most example conceptual models were REA diagrams and this increasing familiarity with the REA model might explain why the experiment, organized after the course ended, showed a better comprehension of the REA diagrams. Switching the order of learning does of course not help, as the REA accounting model is based on the ER approach. A solution is to organize the experiment before the REA model is explained to students. We believe, however, that in that case the experimental comparison can only focus on the improved readability of REA diagrams due to the naming and placement
162
G. Poels
conventions, as the subjects would not be familiar with the more essential REA model features such as event categories, structured relationships and ontological axioms. The use of students as experimental subjects, though common in empirical software engineering research [31-32], is of course a threat to the external validity of the study if the population that is studied is mainly composed of professionals. According to Kitchenham et al. [33] this is not a major issue as students are the next generation of professionals and are close to the population under study. This certainly holds for our experiment where the subjects were graduate business administration students. A more serious threat is the choice of experimental objects. The diagrams used were models of a single isolated transaction cycle, whereas in practice the REA enterprise information architecture aims at the development of an enterprise-wide or inter-enterprise accounting data model and information system. Many of the claimed benefits therefore relate to the integration of models and systems and this is something we couldn’t evaluate in the experiment. The conclusions drawn in this paper therefore relate to a limited use of the REA accounting model, more in particular to the instantiation of the REA template to develop a conceptual model of a single transaction-oriented business process.
5 Conclusions The results of our experiment show that the user comprehension of REA diagrams that are used as conceptual representations of economic exchanges, is higher than that of standard ER diagrams. Especially the accuracy of comprehension, measured as the correctness of answering comprehension questions about the business process as modeled, is significantly higher (at the 5% significance level). The hypothesis that the time required to understand a REA diagram is less than that of an equivalent ER diagram could only be supported at the 10% level of significance. The empirical evidence gathered through this experiment validates to some extent the claimed pragmatic quality of conceptual representations that are developed using the REA accounting model. In this sense they demonstrate the effectiveness of REA modeling in the AIS domain. We should however be careful in drawing definite conclusions about the validity of the REA model. Several threats to the internal and external validity of this study have been identified in the paper. In particular the choice of experimental objects (limited in scope to a single transaction cycle) and experimental tasks (relatively simple comprehension questions, not going into the depth of details of the REA enterprise domain ontology), and the timing of the experiment warrant caution. Moreover, our focus on pragmatic quality and the consequent necessity to use only semantic equivalent models in the experiment (which constrained the way we derived ER diagrams from REA diagrams) prevented us from evaluating alleged benefits related to the semantic quality assurance properties of the REA model.
Conceptual Modeling of Accounting Information Systems
163
References [1] [2] [3]
[4] [5] [6] [7] [8] [9] [10] [11] [12]
[13] [14] [15] [16] [17] [18] [19]
Olivé, A.: Specific Relationship Types in Conceptual Modeling: The Cases of Generic and with Common Participants. Unpublished keynote lecture, 4th Int’l Conf. Enterprise Information Systems (ICEIS'02), Ciudad Real, Spain, April 2002. Moody, D.L., Shanks, G.G.: Improving the quality of data models: empirical validation of a quality management framework. Information Systems, 28 (6), 2003, 619–650. Krogstie, J., Lindland, O.I., Sindre, G.: Towards a Deeper Understanding of Quality in Requirements Engineering. In: Lecture Notes in Computer Science, 932. Proc. 7th Int’l Conf. Advanced Information Systems Engineering (CAiSE'95), Jyvaskyla, Finland, June 1995, 82–95. Nelson, H.J., Monarchi, D.E., Nelson, K.M.: Ensuring the "Goodness" of a Conceptual Representation. In: Proc. 4th European Conf. Software Measurement and ICT Control (FESMA'01), Heidelberg, Germany, May 2001. Poels, G., Nelson, J., Genero, M., Piattini, M.: Quality in Conceptual Modeling – New Research Directions. In: Proc. ER’02 Workshops, 1st Int’l Workshop on Conceptual Modeling Quality (IWCMQ’02), Tampere, Finland, October 2002, 1–8. Wynekoop, J.L, Russo, N.L.: Studying System Development Methodologies: An Examination of Research Methods. Information Systems, 7 (1), 1997, 47–66. McCarthy, W.E.: An Entity-Relationship View of Accounting Models. The Accounting Review, 54 (4), 1979, 667–686. McCarthy, W.E.: The REA Accounting Model: A Generalized Framework for Accounting Systems in a Shared Data Environment. The Accounting Review, 57 (3), 1982, 554–578. Chen, P.P.: The Entity-Relationship Model – Towards a Unified View of Data. ACM Transactions on Database Systems, 1 (1), 1976, 9–36. Hollander, A.S., Denna, E.L., Cherrington, J.O.: Accounting, Information Technology and Business Solutions, 1st edition, 1996. Irwin, Chicago. th Romney, M.B., Steinbart, P.J.: Accounting Information Systems, 8 edition, 2000. Prentice-Hall. McCarthy, W.E.: Semantic Modeling in Accounting Education, Practice, and Research: Some Progress and Impediments. In: Chen, P.P., Akoka, J., Kangassalo, H., Thalheim, B. (Eds.): Conceptual Modeling: Current Issues and Future Directions. Springer Verlag, Berlin, 1999, 144–153. Dunn, C.L., McCarthy, W.E.: The REA Accounting Model: Intellectual Heritage and Prospects for Progress. J. Information Systems, 11 (Spring), 1997, 31–51. Vaassens, J.: Accounting Information Systems: A Managerial Approach, 2002. John Wiley & Sons. Geerts, G.L., McCarthy, W.E.: The Ontological Foundation of REA Enterprise Information Systems. Paper presented at the American Accounting Association Conference, Philadelphia, USA, August 2000. Geerts, G.L., McCarthy, W.E.: Augmented Intensional Reasoning in Knowledge-Based Accounting Systems. J. Information Systems, 15 (Fall), 2001, 127–150. Geerts, G.L., McCarthy, W.E.: An Ontological Analysis of the Primitives of the Extended-REA Enterprise Information Architecture. Int’l J. Accounting Information Systems. 3, 2002, 1–16. Lindland, O.I., Sindre, G., Sølvberg, A.: Understanding Quality in Conceptual Modeling. IEEE Software, 11 (2), 1994, 42–49. McCarthy, W.E.: Inter-Enterprise REA Models. Unpublished keynote lecture, 3rd European Conf. Accounting Information Systems (ECAIS’00), Munich, Germany, March 2000.
164
G. Poels
[20] Krogstie, J., Lindland, O.I., Sindre, G.: Defining quality aspects for conceptual models. In: Proc. IFIP8.1 Working Conf. Information Systems. Marburg, Germany, 1995, 216– 231. [21] Wohlin, C. et al.: Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, 2000. [22] Zelkowitz, M.V., Wallace, D.: Experimental Validation in Software Engineering. Paper presented at the 1st Int’l Conf Empirical Assessment & Evaluation in Software Engineering (EASE’97), Keele, UK, March 1997. [23] Krogstie, J., Jørgensen, H.D.: Quality of Interactive Models. In: Proc. ER’02 Workshops, 1st Int’l Workshop on Conceptual Modeling Quality (IWCMQ’02), Tampere, Finland, October 2002, 115–126. [24] Moody, D.L., Sindre, G., Brasethvik, T., Sølvberg, A.: Evaluating the Quality of Process Models: Empirical Testing of a Quality Framework. In: Lecture Notes in Computer Science, 2503. Proc. 21st Int’l Conf Conceptual Modeling (ER’02), Tampere, Finland, October 2002, 380–396. [25] Genero, M., Poels, G., Piattini, M.: Defining and Validating Measures for Conceptual Data Model Quality. In: Lecture Notes in Computer Science, 2348. Proc. 14th Int’l Conf. Advanced Information Systems Engineering (CAiSE'02), Toronto, Canada, May 2002, 724–727. [26] Briand, L.C., Bunse, C., Daly, J.W.: A Controlled Experiment for Evaluating Quality Guidelines on the Maintainability of Object-Oriented Designs. IEEE Transactions on Software Engineering, 27 (6), 2001, 513–530. [27] Agarwal, R., De, P., Sinha, A.: Comprehending object and process models: An empirical study. IEEE Transactions on Software Engineering, 25 (4), 1999, 541–555. [28] Briand, L.C., Bunse, C., Daly, J., Differding, C.: An Experimental Comparison of the Maintainability of Object-Oriented and Structured Design Documents. Empirical Software Engineering, 2 (3), 1997, 291–312. [29] Harrison, R., Counsell, S., Nithi, R.: Experimental assessment of the effect of inheritance on the maintainability of object-oriented systems. J. Systems and Software, 52 (2–3), 2000, 173–179. [30] Danoch, R., Shoval, P., Balaban, M.: Hierarchical ER Diagrams (HERD) – The Method and Experimental Evaluation. In: Proc. ER’02 Workshops, 1st Int’l Workshop on Conceptual Modeling Quality (IWCMQ’02), Tampere, Finland, October 2002, 23–34. [31] Briand, L.C. et al.: Empirical Studies of Object-Oriented Artifacts, Methods, and Processes: State of the Art and Future Directions. Empirical Software Engineering, 4 (4), 1999, 387–404. [32] Deligiannis, I.S., Shepperd, M., Webster, S., Roumeliotis, M.: A Review of Experimental Investigations into Object-Oriented Technology. Empirical Software Engineering, 7, 2002, 193–231. [33] Kitchenham, B.A. et al.: Preliminary Guidelines for Empirical Research in Software Engineering. IEEE Transactions on Software Engineering, 28 (8), 2002, 721–734.
Preface to AOIS 2003
Information systems have become the backbone of all kinds of organizations today. In almost every sector – manufacturing, education, health care, government, and businesses large and small – information systems are relied upon for everyday work, communication, information gathering, and decision-making. Yet the inflexibilities in current technologies and methods have also resulted in poor performance, incompatibilities, and obstacles to change. As many organizations are reinventing themselves to meet the challenges of global competition and e-commerce, there is increasing pressure to develop and deploy new technologies that are flexible, robust, and responsive to rapid and unexpected change. Agent concepts hold great promise for responding to the new realities of information systems. They offer higher level abstractions and mechanisms which address issues such as knowledge representation and reasoning, communication, coordination, cooperation among heterogeneous and autonomous parties, perception, commitments, goals, beliefs, intentions, etc. all of which need conceptual modeling. On the one hand, the concrete implementation of these concepts can lead to advanced functionalities, e.g., in inference-based query answering, transaction control, adaptive work flows, brokering and integration of disparate information sources, and automated communication processes. On the other hand, their rich representational capabilities allow more faithful and flexible treatments of complex organizational processes, leading to more effective requirements analysis, and architectural/detailed design. The workshop will focus on how agent concepts and techniques will contribute to meeting information systems needs today and tomorrow. To foster greater communication and interaction between the Information Systems and Agents communities, we are organizing the workshop as a bi-conference event. It is intended to be a single “logical” event with two “physical” venues. It is hoped that this arrangement will encourage greater participation from, and more exchange between, both communities. The first part of the workshop was held on the 14th of July at AAMAS’03 – The 2nd International Joint Conference on Autonomous Agents and Multi-Agent Systems, in Melbourne (Australia), 14-18 July 2003. The second part of the workshop will be held in October at the 22nd International Conference on Conceptual Modeling ER03, in Chicago. We would like to gratefully acknowledge all the contributions to the workshop: those by the authors, the participants, and the reviewers. We believe that these accepted papers reflect the field’s state of the art very well. Furthermore, we anticipate that they constitute an excellent basis for an in-depth and fruitful exchange of thoughts and ideas on the various issues of agent-oriented information systems. October 2003
Paolo Giorgini Brian Henderson-Sellers
Bringing Multi-agent Systems into Human Organizations: Application to a Multi-agent Information System Emmanuel Adam and René Mandiau LAMIH UMR CNRS 8530, University of Valenciennes Le Mont Houy - France – 59313 Valenciennes Cedex 9 {emmanuel.adam,rene.mandiau}@univ-valenciennes.fr
Abstract. Agents are increasingly used to search for information on the Internet. However, they are principally used as single agents and not as part of a multi-agent system. Indeed, few projects use agents that communicate or collaborate with each other. This lack of communication often causes users to be isolated in front of their computer. We think that it is necessary for the user of an information search system (like an actor of a technological watch cell or a researcher in a laboratory for example) to be aware of what his/her colleagues are searching for (or at least to have knowledge of a part of their searches). This should avoid redundancies of information and work, and should (re) generate a team feeling among the actors. That is why we propose, in this article, an multiagent information system, which is composed itself of multi-agent systems located on the computer of each actor of a technological watch department, and which collaborate with each other and with the actors. This multi-agent architecture has been chosen in agreement with actors in a real department, following the analysis and modeling of their activities. The method of integrating a multi-agent system into a human organization is also discussed in this article.
1 Introduction The boom in Internet technology and companies networks has contributed to completely changing a good number of habits, which have been well established in companies for several decades. Enterprises are now launched in a race for information: being the first to find the correct information has become an essential objective for competitive enterprises. It is therefore important to own a fast tool for information search and distribution. Admittedly, tools have already been suggested such as: search engines, meta-engines, tools for automatic search (which search at determined regular intervals), and, more recently, information agents, capable of searching for information, sorting and filtering it. The problem with these solutions is that they do not take into account the human factors, such as the notion of the group or even the man-machine co-operation. We previously developed a method (AMOMCASYS, meaning the Adaptable Modeling Method for Complex Administrative Systems) to design and set up multi-agent systems within human organization, more precisely in the cooperative processes of M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 168–179, 2003. © Springer-Verlag Berlin Heidelberg 2003
Bringing Multi-agent Systems into Human Organizations
169
these organizations [1]. We have reused this method to develop multi-agent systems intended to facilitate cooperative information management within a technological watch team. The main advantage of our method, and of our system, is that it takes into account the cooperation between the actors of workflow processes. Indeed, we have noticed that most human organizations follow a holonic model (each part of the organization is stable, autonomous and cooperative and is composed of sub holonic organizations for which it is responsible) and we have built our method by integrating these notions. This paper describes firstly the AMOMCASYS method that we built to model human processes and to design multi-agent systems, which interact with the actors of the processes. Then, the method that we propose for the design of such multi agent systems is shown in two steps: a step for the individual characterization of the agents; and a step for the design of the agents’ cooperative working method. Finally, this article presents an application of our method to the design of multi-agent systems that help actors of a technological watch cell to search for and exchange information.
2 An Adaptable Modeling Method for Complex Administrative Systems (AMOMCASYS) 2.1 Holonic Principles Before designing a multi-agent system to assist information management within a human organization, we think that it is necessary to understand its working mechanisms. This should allow us to take into account all of its characteristics and to integrate multi-agent systems into it in a pertinent way. We have shown in [1] that most of complex administrative systems follow the holonic model that was proposed by Arthur Koestler in 1969 [2]. A series of grouped rules defines holonic systems, which are called Open Hierarchical Systems (OHS). Here we propose to give an interpretation of these rules according to the multi-agent point of view. We can retain the following principles from the rules: A holonic system possesses a tree structure. In fact it can be seen as a set of interwoven hierarchies. Each holon is responsible for a layer of other holons. – A holon is a stable, autonomous and cooperative part of the system and can be considered as a whole. Holons unite themselves to form a whole and each one can be broken down into holonic agents, this corresponds to the recursive breakdown of the problem into sub problems [3]. – A holon obeys precise principles, but is able to adopt different strategies according to its need. – The complex activities and behavior are situated at the top of the hierarchy; the interactions with the environment and the “simpler” reactive acts are located at the base of the holarchy. – The communications must follow the hierarchy and, according to the direction, must be filtered or detailed. The hierarchy is defined by the responsibility that a holon, which composes the system, has on a process or a sub-process and, so, on the holons that act within it. For
170
E. Adam and R. Mandiau
example, an organization by projects can be represented with a tree structure where each node (each holon) is responsible of a part of the project. As the holonic architecture is a well adapted or human organization where actors exchange information between themselves, we have proposed to reuse this architecture to design a multi-agent system that has to manage and exchange data. Here we can find at least two of the characteristics of the agents in a MAS sense: autonomy and cooperation. The third characteristic, the capacity to adapt itself to an environment is suggested by stability. A holon can therefore be seen as an agent whose stability is an essential point1. Our aim is the design of a multi-agent organization providing assistance to the actors in cooperative processes. This organization must be fixed (which does not imply rigidity) in order to be able to meet the user demands as quickly as possible. This is why we have used the social rules defined in the holonic concept in order to simplify and to accelerate the design of a multi-agent society (in the [4] sense). In such holonic multi-agent system, agents located at the top of the system are more cognitive than agents located at its base, which are more reactive. We think that this holonic concept is especially useable and useful in structured and cooperative fields [3]. Indeed, we think that systems composed of reactive agents, or systems that have a less rigid structure, are more dedicated to more flexible environments. However, before setting up a multi agent system, and more generally software, it is necessary to model the human organization in which it has to work. That is why we have proposed a method adapted to human organizations in which actors have different levels of responsibility and have to cooperate around documents. 2.2 Use of AMOMCASYS AMOMCASYS (meaning the Adaptable Modeling Method for Complex Administrative Systems) was designed for the modeling of the cooperative functioning of procedures in holonic human organization by integrating several software engineering methods after their comparisons [5]. Indeed, as our aim is to bring multi-agent system into human organizations, we wanted to imply their actors during the modeling step. So, we have built a benchmark to compare six families of method generally used in industry. As none of the compared methods totally fulfills our needs, we have integrated the most relevant parts of the methods to build the AMOMCASYS method. The integration allows, in a relatively short period of time, to build a method suited to its needs, in our case: to have a clear method, allowing explicit description of cooperation (communication, coordination, and collaboration) and the degrees of responsibility of the actors. AMOMCASYS is made up of four models, a data model, a data flow model, a processing model and a dynamic model: – Concerning the data model, the one proposed by UML makes it possible to represent the structure of the information (documents, …) and their relations 1
However, if the holons are stable, they do not have to be rigid. Indeed, the stability of the whole system is more important than the stability of each of its parts. So, it is sometimes necessary for some holons to be temporarily destabilized so that the whole system can take more long-term protection strategies.
Bringing Multi-agent Systems into Human Organizations
171
(inheritance, …). Regarding the specification of MAS, this model makes it possible to define agents’ structure. – The dataflow model, which represents the information flows between actors, is based on the SADT dataflow model (IDEFO)) which we have adapted to represent the levels of responsibility of the actors and of the agents. – Although the activity model represents all possible flows between actors, it does not represent the conditions for which the information follows a particular path. That is why it is sometimes necessary to use a processing model. This model has to also represent cooperation between actors and their hierarchy and/or responsibility relationships. For this, we use the data processing model of the OSSAD method. Like the previous model, this model can be used for MAS design, the actors having different responsibility levels are replaced by agents having different responsibility levels too. These two models allow us to check if the specified organization follows holonic communication rules. However, although these models can be reused in the low-level design step, they are not sufficient to lead to the high-level one. Indeed, there is an considerable lack relating to the dynamics of the human organization being studied. – So, the AMOMCASYS method is composed of a dynamic model which uses parameterized Petri nets. This model implies the definition of three levels of rules of the process working method: global rules, local rules and personal rules. This model is not yet used in MAS design but only in the modeling of human organizations. This method, supported by a CASE tool (Visual Basic layer based on the commercial software VISIO), enabled us not only to reveal the key points of the procedures where the multi-agent systems should be brought in, but also to improve them in an organizational way2. Three steps are necessary to set up a MAS with AMOMCASYS: firstly, the processes in which the multi-agent has to be set up are modeled by using the data model and the dataflow model; secondly, the agents are introduced into the processes with the dataflow model and in cooperation with the process actors (fig. 1); and finally, the data exchanges and the working mechanism of the multi-agent system are modeled with the processing model. Figure 1 presents the integration of software agents in the information retrieval process of a technological watch team. Each agent is linked to an actor. Agents search for information, filter it, compare it and transmit it to actors that check it to record it in a database. The integration of agents in the process has been done in cooperation with the actors by using the dataflow model and corresponds to the second step of our method. That is why our MAS are designed in two steps: a step to design the roles that the agents play; and a step to design the cooperative working of the agents and their interactions with human actors of the process in which they are integrated.
2
For example, the time for dealing with one procedure involving about 15 actors was halved (at least 20 days of processing have been suppressed), by improving cooperation and increasing the responsibilities of the actors. The simplifications of the procedure have been proposed by its actors and are concretely applied in their department.
172
E. Adam and R. Mandiau
Fig. 1. Example of integration of software agents into a cooperative processes
3 Design of a MAS in a Human Organization Although the definition of our MAS structure has been facilitated by the use of holonic principles, the modeling of the system organization and the characterization of the functionality of the agents remain problematic. Indeed, the research published on this subject is mainly theoretical. Only a few recent research projects allow organization modeling with an application goal, but they are mainly devoted to the field of collective robotics [6] [7]. Here, we propose modeling and specification in two stages: the first stage concerns the individual functioning of each type of holonic agent; the second concerns the functioning of the group, describing communications between agents and between actors and agents. 3.1 Individual Design of the Holonic Agents In order to describe the general characteristics of various types of agent, we use a grid adapted from Ferber [8]. This grid gives a description in three dimensions, instead of the five dimensions initially suggested (cf. table 1). The physical dimension, which is too dependent upon the system, does not appear in the description of general characteristics. The relational dimension is attached to the social dimension. Concerning the organization functions, the conative and organizational functions, dealing with planning, have been grouped together; the conative function being more oriented towards needs, desires and urges which our holonic agents do not have, at least for the time being. This grid enables us to define the functions for each holonic agent relating to: knowledge (the representational function also describes the non-procedural knowledge); action planning (the organizational function); interactions; maintenance (the preservation function) and the actions specific to the role of the agent (the productive function). These functions are described in relation to the agent’s environment, the other agents and the agent itself.
Bringing Multi-agent Systems into Human Organizations
173
This grid is applied to design the different types (different roles) of agents. However, it is interesting to note that a MAS can be considered as one single holonic agent. So it is possible to use this grid to define, at a higher abstraction level, the different functions of the multi-agent system. Table 1. Design grid adapted from Ferber’s analysis grid [8] Dimensions \ Function
Social
Environmental
Personal
Representation of the group (of Representation of the world Representation of itself, the other roles) of its capacities Planning of social actions, Planning of actions in the Planning control, metaOrganizational communications environment planning Description agent-society Perception and action Auto-communication Interactive interaction, performative mechanisms in relation to the Auto-action environment Management, coordination and Analysis, modification and Auto-modification, Productive negotiation tasks creation tasks learning Preservation of the society, the Preservation of resources, Self-preservation, repair, Preservation relations, the network of defense and maintenance of maintenance contacts territory
Representational
But, even though this grid enables us to have a clear view of the agents’ actions according to the environment and according to the other agents, it does not allow a definition of the functioning of the whole organization. Indeed, it does not allow the design of the cooperative functioning of the whole multi-agent organization. So, it is necessary to use a method, such as AMOMCASYS, which allows us to do this. 3.2 Cooperative Working of the Holonic Agents Regarding the design of the MAS to be integrated to the human process, AMOMCASYS data model allows us to represent the principal Holon class (which describes the general agent structure) as well as the classes associated to the knowledge concerning the environment (representation of the process, the actor, the workstation, the agent responsible, the subordinates). Each holonic agent has five main functions: to plan its action according to the process and its current state (corresponding to the organizational function); to receive and send to other holonic agents (corresponding to the interaction function); to act (corresponding to the productive function, to the specialty of the agent) and to manage the links between the agent responsible and the subordinates (corresponding to the preservation function). Of course each holonic agent has an implicit function: ’initialize’ (enabling it to acquire knowledge concerning the MAS). The four main functions (the organizational, interactive, productive and conservative functions) imply co-operations between holonic agents and sometimes between the agents and the actors (the users). The processing model of the AMOMCASYS method can model these co-operations, as we will see in the following case study.
174
E. Adam and R. Mandiau
4 Bringing an Information Multi-agent System into a Technological Watch Department The case study presented in this article has been performed in a technological watch department of a large company. In this application, we have designed a MAS in order to assist the actors of a technological watch department [9] in their tasks. This specification has been done following the analysis and the modeling of the department processes. In these processes, actors (called watchmen) have to collect information on the Internet, manage it and distribute it to their clients. So we had to design a MAS for information retrieval. 4.1 Structure of the Multi-agent Information System A multi-agent information system (IMAS) is generally composed of information agents that search, on the basis of requests that are sent to them (directly or indirectly through a database) for information on databases (local or distributed) or on Internet sites.Information agents’ activities are often coordinated through coordinator agents. These agents own knowledge on information agents (such as their addresses, their search domains for example) to which they send requests (in a targeted way if they own knowledge on their competences or by broadcast techniques). Coordinator agents have to gather collected information, in order to check it, compare it or filter it. Most information multi-agents systems are directly in touch with the user, upstream (to receive new requests) and/or downstream (to display search results). In order to have a reactive interface and distribute to the users, some IMAS propose the use of interface agents acting as interfaces between the users and the system. In our IMAS, called CIASCOTEWA (meaning CO-operative Information Agents’ System for the COoperative TEchnological WAtch), each agent proposed in the second step of our method (figure 1) is in fact a IMAS composed of agents. So, we associate an information agents system (a CIASTEWA, for CO-operative Information Agents’ System for the TEchnological WAtch) to each actor of the watch team. This allows the global system to have greater flexibility and allows each watchman to have more autonomy. Indeed, it is easier to integrate a new watchman by adding a sub-system to the global system than by reconfiguring a centralized system. A CIASTEWA is a sub holonic multi agent system, which has to search for information, collect it, sort it and communicate relevant information to other CIASTEWAs. The architecture of a CIASTEWA is shown in figure 2. Each of these sub systems is made up of: – a local database that contains the user requests, their results and information on the user, – an interface agent that assists the users to express their requests and allows them the interaction with the results provided by information agents, or with the other users of the group, – a coordinator agent that has the task of coordinating the actions of the other agents,
Bringing Multi-agent Systems into Human Organizations
– – –
175
an agent responsible for information, which distributes the requests that are recorded in the local database to the request agents according to a search strategy, request agents that distribute the request for which they are responsible to search engine agents search engine agents that have to find information on the Internet.
So, each CIASCOTEWA helps the user, to whom it is dedicated, to search for relevant information and to communicate it with other actors. In order to maintain or create the feelings of community or group among the actors, which is often forgotten with the use of new technologies (the individuals are isolated with their workstation), we have proposed the development of self-organizing capacities in order to generate communities of CIASTEWA, which have to answer the same kinds of request. This reorganization is indicated to users in order to encourage them to cooperate, if they want to do it, with other users who have the same centers of interests. 4XHULHV &RRUGLQDWRU $JHQW
*URXS
5HVXOWV
4XHULHV ,QWHUIDFH $JHQW
8VHU
5HTXHVW $JHQW
6HDUFK HQJLQH $JHQW
6HDUFK HQJLQH $JHQW
5HVXOWV
,QIRUPDWLRQ 5HVSRQVLEOH $JHQW
5HTXHVW $JHQW
6HDUFK HQJLQH $JHQW
6HDUFK HQJLQH $JHQW
6HDUFK HQJLQH $JHQW
Fig. 2. Architecture of a CIASTEWA
In fact, works on the generation or identification of communities in IMAS have appeared recently, like in [10] where the agentification of a Web server requires the creation of agent communities. And, we think that “to find the right person” who may know where the answers are located is the best way of finding the correct information. Some work has been carried out in this direction by [11] and [12]. For example, in a large laboratory, it is frequent for researchers to have the same center of interest momentarily without knowing it. In our system, they are informed of this and so encouraged to exchange their information. The cooperation between several agents of a CIASTEWA is organized around a user database that contains data about the user, the user requests and the corresponding results. A CIASTEWA is also linked to a group database (fig. 2). This database collects requests of the group that are qualified as public by the users, and, temporarily, the results of these requests during the comparison step (fig. 1). This knowledge about the team allows coherence in the global MAS and allows a cooperative watch by the watchmen.
176
E. Adam and R. Mandiau
As an example of cooperation accorded by the system, we have the fact that each watchman knows what the others have collected; this avoids redundancy in the management and storing of information. Of course, the redundancies of stored information could be suppressed by the use of a centralized system (like a proxy server or a centralized agent) but these centralized systems are less flexible than a distributed system regarding the addition or suppression of an actor from the information retrieval system. Another example: when a user adds a new request, this should be compared to the others, if it is a subset of an existing one, the results are extracted by its results and the watchman is informed by its MAS that he has analog request to another actor (we suppose that this should make cooperation easier between watchmen having the same centers of interest). But, the cooperative functioning of the different agents of the CIASCOTEWA can only be specified after the definition of the different roles composing a CIASTEWA. The Individual design of the CIASTEWA agents are made by reusing design grid of the five roles that have to be designed in a CIASCOTEWA: the interface agent role ( ), the coordinator agent role ( ), the agent responsible for information role ( ), the request agent role ( ), and the search engine agent role ( ). 4.2 Cooperative Functioning of a CIASTEWA After having described the individual roles of agents that compose a CIASTEWA, we have to define the cooperative interactions between them. For that, we use the processing model of the AMOMCASYS method. For example, we have design the recording of a new request in a CIASTEWA. The user adds a request to its CIASTEWA through the interface agent. The interface agent asks the coordinator agent if the request exists in the group database, if it is a subset of another one or if it includes requests of other actors. In all these cases, a message is displayed to the user. The request is then recorded in the user database and a message is sent to the agent responsible for information. This message asks it to execute the requests not yet carried out. For this, the agent creates a request agent for each request. Each request agent creates a search engine agent for each search engine specified in the request. Each of these request agents connects to Internet in order to find results and send them to the agent responsible for it. When the request agent has received the results of each of its subordinates, it filters them (it deletes the doubles) and sends the results to the agent responsible for it. When the agent responsible for information has received a response from each of the request agents, it compares results with the group results, given by the coordinator agents, and writes in the result characteristics, which the actor has also received. Thanks to the AMOMCASYS method, we have defined other interactions between users and agents of the CIASCOTEWA such as the mailing, the annotation, the deletion of a result, the modifying and the deletion of a request. 4.3 Application of the CIASCOTEWA A prototype has been built from these specifications and is currently used within our laboratory.
Bringing Multi-agent Systems into Human Organizations
177
Firstly, we have to define the structure of a CIASTEWA by creating the agents and by defining their links (their acquaintances). Then, as each CIASTEWA is composed of five kinds of agent, we have to implement five behaviors. Each behavior contains attributes relative to the knowledge of the agent to which it is associated and functions that the acquaintances of the agent can call. Thanks to the MAGIQUE platform [13], which is a java library that is dedicated to the design of hierarchical multi-agent systems, we have set up a multi-agent prototype. Indeed, MAGIQUE is dedicated to Hierarchic Multi-Agent Systems in which agents own competencies that they can learn or lose at run-time. By default, all the MAGIQUE agents contain a communication skill, which allow them to communicate with the other agents of their hierarchies. To define a CIASTEWA with MAGIQUE, we created five empty agents to which we added skills (sets of behaviors): SupSkill for the coordinator agent; AgentFrameSkill for the interface agent; LaunchSearchSkill for the agent responsible for information; SearchSkill for the request agents; OneEngineSearchSkill for the search engine agents. When an agent responsible for information creates request agents, it creates empty agents and informs them that it is their superior. Then it asks them to learn their competency (SearchSkill) and to perform their tasks. We have the same working when a request agent creates search engine agents. These search engine agents could create other sub agents if necessary (to interrogate different addresses of the same search engine for example) without having an impact on the working of the global system. Indeed, if the search engine agents return the required data, it is not important that it does it alone or with the help of sub agents. This principle is applicable for all the agents as it is specified by the holonic model. In our prototype, each CIASTEWA (fig. 3) allows an actor to: consult results of a request; modify, delete, add a request; send a result to other actors in its neighborhood; add notes to a result; delete a result. The user has the following information about a result: its address; its page name; its summary; its size; its date; its owner; the names of actors who own it (for that, the CIASTEWA communicates with the other CIASTEWAs in the group); the requests of the user that are linked to the result; and the search engines that have given the result. A Magique agent can communicate with others only if they are members of the same hierarchy. In order to make communication possible between CIASTEWAs, which are each linked to an actor in the technological watch cell, and thus to create a CIASCOTEWA, it was necessary to create a "super agent" which is the superior of all the coordinator agents. This agent has the BossSkill skill that contains no functions (the interactive function and the knowledge all the agents of the CIASCOTEWA system are implicit in a Magique agent). At the present time, the subgroups of CIASTEWA are created by the users through a configuration file that defines, amongst other things, the coordinator agent’s acquaintances of a CIASTEWA. Indeed, a set of xml files is necessary to initialize a CIASTEWA. Each of them corresponds to a knowledge: agent-config.xml corresponds to the personal knowledge; actors.xml, wastebasket.xml, search-engines.xml correspond to the environment knowledge; requests.xml and results.xml correspond to the user database.
178
E. Adam and R. Mandiau
Fig. 3. Screen copy of a CIASTEWA.
We have added some capacities to the BossSkill skill, which should allow the system to be able to self-organize itself according to the interest centers of the user, by using Kohonen self-organizing maps and interest distances, but we have not yet examined the results of this functionality. This reorganization of the system aims at decreasing number of communications between CIASTEWAs and increasing the team feeling amongst the users by informing them that they have interest centers in common (we think that this should increase cooperation between them).
5 Conclusion In order to specify a co-operative information agents’ system into a human organization, we have defined a method composed of three steps: a step of analysis and modeling of the human organization; a step of modeling the insertion of agent systems into the human organization; and a step of design of the multi-agent system. Our work uses the AMOMCASYS method that we have defined to analyze and model complex administrative system, and is backed up by the holonic model. Indeed, this model allows us to understand human organizations and we have shown (Adam, 2000) that this model allows the design of multi-agent systems particularly adapted to the human organizations being studied. Currently, the first prototype of the CIASCOTEWA system that we have proposed has been set up for short-term use (a few months), and has been particularly well accepted by actors, thanks to the participative design that we proposed. At this time, the CIASCOTEWA is being used in our laboratory in order to develop capacities of self-organization. A centralized CIASCOTEWA, accessible by
Bringing Multi-agent Systems into Human Organizations
179
jsp pages is currently being set up in a knowledge management department of a large company. We aim at developing a more automatic design of the agent, by using our latest work, such as the deployment of a holonic multi-agent system in a network from a xml configuration file and by the use of the MAGIQUE platform.
References 1.
2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12.
13.
Adam, E., Modèle d'organisation multi-agent pour l'aide au travail coopératif dans les processus d'entreprise : application aux systèmes administratifs complexes (in French), PhD Thesis, Université de Valenciennes et du Hainaut-Cambrésis, (2000). Koestler, A. The Ghost in the Machine. Arkana Books, London, (1969). Gerber, C., Siekmann, J., Vierke, G. Holonic Multi-Agent Systems. Research report, RR99-03, March, DFKI GmbH, Germany, (1999). Mandiau, R., Le Strugeon E. & Agimont G. Study of the influence of organizational structure on the efficiency of a multi-agent system. Networking and Information Systems Journal, 2(2), (1999), 153–179. Adam, E., Kolski, C. Etude comparative de méthodes de génie logiciel utiles au développement de systèmes interactifs dans les processus administratifs complexes. Génie Logiciel, 49, (1999), 40–54. Collinot, A., Drogoul, A. Approche orientée agent pour la conception d'organisations: application à la robotique collective. Revue d'intelligence artificielle, 12 (1), (1998), 125– 147. Burckert, H.-J. and Fischer, K. and Vierke, G.Transportation Scheduling with Holonic MAS The TeleTruck Approach, Proceedings of the Third International Conference on Practical Applications of Intelligent Agents and Multiagents (PAAM'98), (1998). Ferber, J. Les systèmes multi-agents, Vers une intelligence collective. InterEditions, Paris, (1995). Jonnequin, L., Adam, E., Kolski, C., Mandiau, R. Co-operative Agents for a Co-operative Technological Watch, CADUI'02 – 4th International Conference on Computer-Aided Design of User Interfaces, University of Valenciennes, (2002). Helmy, T., Amamiya, S., Mine, T., Amamiya M.: An Agent-Oriented Personalized Web Searching System. In: Giorgini, P., Lespérance, Y., Wagner, G., Yu, E. (eds.): Proceedings of the Fourth International Bi-Conference Workshop on Agent-Oriented Information Systems (AOIS-2002 at AAMAS*02), Bologna, Italy, (2002). Jie, M., Karlapalem, K., Lochovsky, F.: A Multi-agent framework for expertise location. In Wagner, G., Lesperance, Y. and Yu, E. (eds.). Agent-Oriented Information Systems 2000, iCue Publishing, Berlin, (2000). Kanawati, R., Malek, M.: A Multiagent for collaborative Bookmarking. In: Giorgini, P., Lespérance, Y., Wagner, G., Yu, E. (eds.): Proceedings of the Fourth International BiConference Workshop on Agent-Oriented Information Systems (AOIS-2002 at AAMAS*02), Bologna, Italy, (2002). Routier, J.C., Mathieu P., Secq, Y., Dynamic Skills Learning: A Support to Agent Evolution, Proceedings of AISB'01, York, (2001), pp 25–32.
Reconciling Physical, Communicative, and Social/Institutional Domains in Agent Oriented Information Systems – A Unified Framework Maria Bergholtz, Prasad Jayaweera, Paul Johannesson, and Petia Wohed Department of Computer and Systems Sciences Stockholm University and Royal Institute of Technology Forum 100, SE-164 40 Kista, Sweden {maria,prasad,pajo,petia}@dsv.su.se
Abstract. One of a business system’s roles is to provide a representation of a Universe of Discourse, which reflects its structure and behaviour. An equally important function of the system is to support communication within an organisation by structuring and co-ordinating the actions performed by the organisations agents. These two roles of a business system may be represented in terms of business and process models, i.e. the separation of the declarative aspects from the procedural control flow aspects of the system. Although this separation of concerns has many advantages, the differences in representation techniques and focus of the two model types constitutes a problem in itself. The main contribution of this paper is a unified framework based on agent oriented concepts to facilitate the analysis and integration of business models and process models in e-Commerce in a systematic way. The approach suggested bridges the gap between the declarative and social/economic aspects of a business model and the procedural and communicative aspects of a process model. We illustrate how our approach can be used to facilitate integration, process specification, process pattern interpretation and process choreography.
1 Introduction Agent-oriented concepts have recently been applied to the area of information systems design. One of the most promising applications of agent-orientation could be in the development of e-Commerce systems. In e-Commerce, systems design is based on two fundamental types of models, business models and process models. A business model is concerned with value exchanges among business partners [10], while a process model focuses on operational and procedural aspects of business communication. Thus, a business model defines the what in an e-Commerce system, while a process model defines the how. This means that the process of designing eCommerce systems consists of two main phases. First, a business requirement capture phase focusing on value exchanges, and secondly, a phase focused on operational and procedural realisation. In the business requirement capture phase, coarse-grained views of business activities as well as their relationships and arrangements in business collaborations are represented by means of business model constructs at an abstract level. In contrast, M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 180–194, 2003. © Springer-Verlag Berlin Heidelberg 2003
Reconciling Physical, Communicative, and Social/Institutional Domains
181
the specification of a process model deals with more fine-grained views of business transactions, their relationships and choreography in business collaborations. Although the two phases in e-Commerce design, and their related models, have different focuses, there is clearly a need for integrating them. A unified framework covering coarse-grained business modelling views to fine-grained process specification views provides several benefits. It can be used for supporting different user views of the system being designed, and it can form the basis of a precise understanding of modelling views and their inter-relationships. It can also provide a basis for design guidelines that can assist in developing process models. The purpose of this paper is to propose a framework integrating the contents of business models and process models. The framework is based on agent-oriented concepts, like agent, commitment, event, action, etc., [18]. We use ebXML [8] and UMM [3] as the basis of our framework, more specifically the UMM Business Requirements View (BRV) for business models and the UMM Business Transaction View (BTV) for process models. UMM BRV already includes a number of agentoriented concepts, which we extend by adding a number of constructs for bridging business and process models, in particular speech acts. The work presented in this paper builds on [4], where Speech Act Theory (SAT)[17] and the language/action approach [6], are used for analysing processes, as well as for clarifying the relationships between agents in business and process models. The rest of the paper is organised as follows. Section 2 gives an overview of related research and introduces informally the basic concepts. Section 3 introduces the UMM BRV and BTV. Section 4 contains the main contribution of the paper and presents the integrated framework. Section 5 illustrates two applications of the introduced framework, and the analysis and design of business process patterns. Section 6 introduces rules for governing the choreography of transactions and collaborations. Section 7, finally, concludes the paper and discusses the results.
2 Basic Concepts and Related Research A starting point for understanding the relationships between business models and process models is the observation that a person can carry out several different actions by performing one single physical act. An everyday example could be a person who turns on the water sprinkler and thereby both waters the lawn and fulfils the promise to take care of the garden – one physical act (turning on the sprinkler), which can be viewed as “carrying” two other actions (watering the lawn and fulfilling a promise). Relationships like these are particularly common for communicative actions, which are carried out by means of physical actions. One way to look at the role of communicative actions and their relationships to other actions is to view human actions as taking place in three different domains: * *
The physical domain. In this domain, people carry out physical actions – they utter sounds, wave their hands, send electronic messages, etc. The communicative domain. In this domain, people express their intentions and feelings. They tell other people what they know, and they try to influence the behaviour of other actors by communicating with them. People perform such communicative actions by performing actions in the physical domain.
182
*
M. Bergholtz et al.
The social/institutional domain. In this domain, people change the social and institutional relationships among them. For example, people become married or they acquire possession of property. People change social and institutional relationships by performing actions in the communicative domain.
Using this division, business models can be seen as describing the social/institutional domain, in particular economic relationships and actions like ownership and resource transfers. Process models, on the other hand, describe the communicative domain, in particular how people establish and fulfil obligations. The three-fold division above is based on an agent-oriented approach to information systems design, [19], [20]. A key assumption of this approach is that an enterprise can be viewed as a set of co-operating agents that establish, modify, cancel and fulfil commitments and contracts [7]. In carrying out these activities, agents rely on so called speech acts, which are actions that change the universe of discourse when a speaker utters it and a recipient grasps it. A speech act may be oral as well as written, or even expressed via some other communication form such as sign language. The feasibility of speech act theory for electronic communication systems is supported by several researchers, see [16] for a review. The work reported on in this paper differs from these approaches since it uses SAT for analysing and integrating different modelling domains in e-Commerce, rather than facilitating electronic message handling per se. One of the pioneers in the development of a theory of speech acts is John Searle, [17], who introduced a taxonomy of five different kinds of speech acts: assertive, directive, commissive, expressive, and declarative, also called illocutionary points. An assertive is a speech act the purpose of which is to convey information about some state of affairs of the world from one agent, the speaker, to another, the hearer. A commissive is a speech act, the purpose of which is to commit the speaker to carry out some action or to bring about some state of affairs. A directive is a speech act, where the speaker requests the hearer to carry out some action or to bring about some state of affairs. A declarative is a speech act, where the speaker brings about some state of affairs by the mere performance of the speech act, e.g. “I declare you husband and wife”. Finally, an expressive is a speech act, the purpose of which is to express the speaker’s attitude to some state of affairs. In addition to its illocutionary point, a speech act also has a propositional content. The speech acts “I hereby pronounce you husband and wife” and “You are hereby divorced”, which are both declaratives, have different propositional contents. A speech act is often viewed as consisting of two parts, its propositional content and its illocutionary force. The illocutionary force is the illocutionary point together with the manner (for example ordering, asking, begging) in which the speech act is performed and the context in which it occurs.
3 UMM Business and Process Models – BRV and BTV The Resource-Event-Agent (REA) [15] framework has recently been applied in the UN/CEFACT Modelling Methodology (UMM) for business process modelling [3]. The scope of UMM is to provide a procedure for specifying, in a technology-neutral and implementation-independent manner business processes involving information
Reconciling Physical, Communicative, and Social/Institutional Domains
183
exchange. In UMM, a number of meta-models are defined to support an incremental model development and to provide different levels of specification granularity. • A business meta-model, called the Business Operations Map (BOM) partitions business processes into business areas and business categories. • A requirements meta-model, called the Business Requirements View (BRV) specifies business processes and business collaborations. • An analysis meta-model, called the Business Transaction View (BTV) captures the semantics of business information entities and their flow of exchange between business partners as they perform business activities. • A design meta-model, called the Business Service View (BSV) models the network components services and agents and their message (information) exchange. The two meta-models relevant for our work are BRV and BTV (see [Fig. 1]) and we describe them briefly in the following sub sections. 3.1
Business Requirements View
As it is based on REA, BRV models EconomicEvents, the Resources transferred through the EconomicEvents, and the Agents, here called Partners between whom the Economic Events are performed. An EconomicEvent is the transfer of control of a Resource from one Partner to another. Each EconomicEvent has a counterpart, i.e. the EconomicEvent that is performed in return and realising an exchange. For instance the counter part of a goods transfer economic event could be a payment, i.e. a transfer of money economic event. This connection between two economic events is modelled through the relationship duality. Furthermore, an EconomicEvent fulfils an Economic Commitment. An EconomicCommitment can be seen as the result of a commissive speech act and is intended to model an obligation for the performance of an Economic Event. The duality between EconomicEvents is inherited into the Economic Commitments, where it is represented by the relationship reciprocal. In order to represent collections of related commitments, the concept of Economic Contract is used. A Contract is an aggregation of two or more reciprocal Economic Commitments. An example of a Contract is a purchase order composed of one or more order lines, each one representing a corresponding EconomicCommitment in the contract. The product type specified in each line is the Resource Type that is the subject for the EconomicCommitment. EconomicContracts are often made within the boundaries of different Agreements. An Agreement is an arrangement between two Partners that specifies the conditions under which they will trade. 3.2
Business Transaction View
The Business Transaction View (BTV) specifies the flow of business information between business roles as they perform business activities. A BusinessTransaction is a unit of work through which information and signals are exchanged (in agreed format, sequence and time interval) between two business partners. These information exchange chunks, called BusinessActions, are either Requesting Business
184
M. Bergholtz et al.
Activities or Responding Business Activities (depending on whether they are performed by a Partner Role who is requesting a business service or whether they are the response to such a request). A transaction completes when all the interactions within it succeed, otherwise it is rolled back. Furthermore, the flow between different Business Transactions can be choreographed through BusinessCollaborationProtocols. <<stereotype>> 1..* BusinessT ransactionActivity 0..* +transaction 1
<<stereotype>> BusinessProcessActivityModel
1 <<stereotype>> BusinessCollaboration1 * 1 1..* 1 1 <<stereotype>> 1 BusinessCollaborationTask +realize 1..* forms +role +realization 1 governs <<stereotype>> <<stereotype>> BusinessCollaboration´ PartnerType 1 UseCase 2..* * +collaboration +participation1..*
1
participation
<<stereotype>> BusinessCollaborationProtocolUseCase
1 * <<stereotype>>* Agreement
* <<stereotype>> resultsIn 1 <<stereotype>> 0..1 EconomicEvent EconomicEffect * 1..* +te duality +ini fulfills rmi tiat nat or <<stereotype>> or EconomicResourceT ype * reciprocal <<stereotype>> 1 1 EconomicContract 1 Resource Flow specifies 1 classifies + * * 1 <<stereotype>> establish Economic 2..* <<stereotype>> resultsIn Commitment 1 EconomicResource *
1
<<stereotype>> BusinessCollaborationProtocol 0..1 2..*+partner <<stereotype>> BusinessPartner
<<stereotype>> 1 BusinessTransaction 1
1
1 <<stereotype>> Requesting BusinessActivity
1
<<stereotype>> Responding BusinessActivity
0..1 1..* +role <<stereotype>>+performedBy 1..* <<stereotype>> AuthorizedRole 1..* +performs BusinessAction
<<signed By>>
Document Envelope
1 1 +source +target 0..1 1 1 1 ObjectFlow 1..2 State +type
<<signedBy>>
Business Document <<stereotype>>*+header InformationEntity 1 * +body +hadPart * +p 11 art Of <<stereotype>> <<stereotype>> UnstructuredDocument StructuredDocument
Fig. 1. UMM Business Requirement and Business Transaction Views
4 An Agent-Oriented Integration Framework In terms of the three domains introduced in Section 2, UMM explicitly addresses only the physical and the social/institutional domains. The physical domain is modelled through classes like BusinessTransaction and BusinessAction, while the social/institutional domain is modelled through EconomicCommitment, EconomicEvent, and other classes. The details of the communicative domain, however, are not explicitly modelled. This state of affairs causes two main problems. First, the relationship between the physical and the social/institutional domains is very coarsely modelled; essentially the UMM only states that a completed collaboration may influence objects in the social/institutional world, but it does not tell how the components of a collaboration affect the social/institutional objects. Secondly, there is no structured or systematic way of specifying how events in the physical domain influence the social/institutional domain. These problems can be overcome by introducing the communicative domain as an additional layer in the UMM, thereby creating a bridge between the physical and social/institutional domains.
Reconciling Physical, Communicative, and Social/Institutional Domains
185
As a preparation to modelling the communicative domain, a minor modification to UMM BRV is made, see [Fig. 2]. A class EconomicEffect is introduced as a superclass of EconomicCommitment, Agreement, and EconomicEvent. The power type [14] of EconomicEffect, called EconomicEffectType, is also added for the purpose of differentiating between the modelling of concrete, tangible objects in a domain, and the abstract characteristic categories of these objects. These modifications will allow for a more concise representation of the effects of communicative actions. In addition to these changes, the classes BusinessActionEnactment and BusinessTransactionEnactment are added. These represent the actual execution of a business action or business transaction, respectively. E conom icC om m itm ent
E conom icEvent
R ole
A greem ent
#baseC lass: string
2 * Econom icEffectT ype
1
E conom ic Effect
0..*
m easurem ent: string
+effectT ype: string
*
* * Pragm aticAction illocution: string action: string tim eT oP erform :
1 1
0..*
Pragm aticActionEnactm ent perform edT im e:
* * 1
* BusinessA ction
1
+isN onR epudiationR equired: +tim eToP erform :
0..*
BusinessActionEnactm ent perform edT im e:
1..2
1..*
1
1
BusinessT ransaction #baseC lass: A ctivityG raph
1..*
1
C ollaboration +beginsW hen: string +endsW hen: string
Fig. 2. Extended Business Requirement View
The basic notions introduced for modelling the communicative domain are those of a pragmatic action and its execution, i.e. PragmaticAction and PragmaticActionEnactment, see Fig. 2. A pragmatic action is a speech act as introduced in Section 2. It consists of three parts, denoted as a triple: Intuitively, these components of a pragmatic action mean the following: • EffectType specifies an EconomicEffectType, i.e. it tells what kind of object the pragmatic action may affect • Action is the type of action to be applied – create, change, or cancel • Illocution specifies the illocutionary force of the pragmatic action, i.e. it tells what intention the actor has to the Action on the EffectType Formally, Intention and Action are defined through enumeration: Action = {create, change, cancel, none} Illocution = {propose, accept, reject, declare, query, reply, assert} The meanings of the illocutions are as follows: propose – someone proposes to create, change, or cancel an object accept – someone accepts a previous proposal
186
M. Bergholtz et al.
reject – someone rejects a previous proposal declare – someone unilaterally creates, changes, or cancels an object query – someone asks for information reply – someone replies to a previous query assert – someone makes a statement about one or several objects For ‘query’, ‘reply’, and ‘assert’, there is no relevant Action involved, so only the “dummy” ‘none’ can be used. The class PragmaticActionEnactment is used to represent the actual executions of pragmatic actions. A PragmaticActionEnactment specifies a PragmaticAction as well as an EconomicEffect, i.e. the agreement, commitment, or economic event to be affected. Some examples of PragmaticActions are: “Query status of a sales order” would be modelled as “Request purchase order” would be modelled as <propose, create, purchaseOrder>, where ‘salesOrder’ and ‘purchaseOrder’ are EconomicEffectTypes 4.1
Integrated View of Process and Business Models
The glue between the physical domain and the communicative domain is made up by the associations between the classes BusinessAction and PragmaticAction, and BusinessActionEnactment and PragmaticActionEnactment. These associations express that a business action can carry one or more pragmatic actions, i.e. by performing a business action, an actor simultaneously performs one or several pragmatic actions. Often, only one pragmatic action is performed, but in some cases several can be performed, e.g. when creating a commitment and its contract at the same time. The global integrated view of BRV and BTV is shown graphically in [Fig. 3]. The original BTV-parts are grouped within the darker (lower) grey area boundary, BRVparts are grouped within the lighter grey area and the new parts introduced in this chapter are depicted in the white area.
5 Application/Analysis of Transaction and Collaboration Patterns In this section, a number of applications of the proposed framework with respect to business modelling patterns are introduced. A pattern is a description of a problem, its solution, when to apply the solution, and when and how to apply the solution in new contexts [12]. First, we discuss how the framework can be used for analysing the semantics of UMM business transaction patterns. Secondly, different collaboration patterns for incremental development are suggested. 5.1
Analysing UMM Business Transaction Patterns
UN/CEFACT has defined a number of business transaction patterns as part of UMM with the intention of providing an established semantics of frequently occurring business interactions. Below, we list a number of these patterns and show how they can be understood based on the framework introduced in the previous section.
Reconciling Physical, Communicative, and Social/Institutional Domains
187
participatetion
participatetion
typifies
<<stereotype>> Agreement
<<stereotype>> EconomicCommitmentType
typifies
<<stereotype>> EconomicCommitment
<<stereotype>> EconomicEventType
typifies
<<stereotype>> EconomicEvent
<<stereotype>> AgreementType
typifies
<<stereotype>> EconomicEffectType
2..*
<<stereotype>> EconomicEffect
resultsIn
2
<<stereotype>> BusinessPartner
<<stereotype>> PragmaticAction
<<stereotype>> PragmaticActionEnactment
<<stereotype>> BusinessAction
<<stereotype>> BusinessActionEnactment
0..1 1..* +role <<stereotype>> AuthorizedRole
+performedBy 1..* 1..* +performs
<<stereotype>> RequestingBusinessActivity
<<stereotype>> RespondingBusinessActivity
1
1
1
+transaction
1
<<stereotype>> BusinessTransaction 1
0..*
<<stereotype>> BusinessTransaction Activity
<<stereotype>> BusinessCollaboration
Fig. 3. Integrated Global view
Design patterns are defined as “descriptions of communicating objects and classes customised to solve a general design problem in a particular context” [9]. We will adopt this definition to the UMM transaction patterns and view a transaction pattern as a template of exactly one pair of a Requesting and Responding Business Activity customised to encode the intentions and effects of a business interaction in a context . Definition: A transaction pattern (TP) is an activity diagram with two states designating the Requesting and Responding Business Activity. Every other state is an end state. All state transitions are labelled by pragmatic actions, carried by the Requesting and Responding Business Activity, see [Fig. 4]-[Fig. 5] and [Table 1] below. The analysis suggests one way to interpret the definitions of the UMM transaction patterns, but it does not make any claims to be the final, “correct” interpretation of these definitions. This is not an achievable goal as the definitions are only formulated in natural language, sometimes quite vaguely. The value of the analysis is that it provides explicit interpretations that can be judged for their validity, and thereby can help in formulating more precise and unambiguous definitions of the patterns. Another use of the analysis is to suggest additional patterns than those already present in UMM. The Fulfilment, ContractProposal,, Bilateral and Unilateral Cancellations (from [Table 1]) are obvious candidates for business transaction patterns.
188
M. Bergholtz et al. Table 1. Analysis of transaction patterns in terms of pragmatic actions
TP
Definition
Analysis
Commercial (Offer/ Accept)
“This design pattern is best used to model the ‘offer and acceptance’ business transaction process that results in a residual obligation between both parties to fulfil the terms of the contract. The pattern specifies an originating business activity sending a business document to a responding business activity that may return a business signal or business document as the last responding message.” [3]. “The query/response design pattern specifies a query for information that a responding partner already has e.g. against a fixed data set that resides in a database. The response comprises zero or more results each of which meets the constraining criterion in the query.” [3].
Request
“The request/confirm activity pattern shall be used for business contracts when an initiating partner requests confirmation about their status with respect to previously established contracts or with respect to a responding partner’s business rules.” [3].
Request
Query/ Response
Request/ Confirm
Request/ Response
Information Distribution
Notification
1
2
“The request/response activity pattern shall be used for business contracts when an initiating partner requests information that a responding partner alread6y has and when the request for business information requires a complex interdependent set of results [3]. “This pattern specifies the exchange of a requesting business document and the return of an acknowledgement of receipt signal. The pattern is used to model an informal information exchange business transaction that therefore has no nonrepudiation requirements.” [3]. “This pattern specifies the exchange of a requesting business document and the return of an acknowledgement of receipt signal. The pattern is used to model a formal information exchange business transaction that therefore has nonrepudiation requirements.” [3].
Response or
Request Response
Response Request Response 1 Request Response Carries no pragmatic action Request Response Carries no pragmatic action 2.
Note that the analysis fails to make a distinction between the query/response and the request/response patterns; the reason for this is that the difference between the patterns does not reside in different business effects but in different ways of computing the responses. The motivation for this analysis is that a notification results in a binding specification of business conditions for the initiating partner and, thus, in a (partial) agreement.
Reconciling Physical, Communicative, and Social/Institutional Domains
189
TP
Definition
Analysis
Fulfilment
The fulfilment pattern specifies the completion of an Economic Event [Fig. 4].
Contract Proposal
The Contract Proposal Transaction Pattern is a variation of the aforementioned Offer-Accept transaction pattern where the Partners does not have to make their assertions of intentions legally binding.[Fig.4] The Bilateral Cancellation transaction pattern refer to the bilateral cancellation of an Economic Contract or to Commitment(s) within an Economic Contract. See left part of [Fig. 5].
Request Response or Request
Bilateral Cancellatio n
Unilateral Cancellatio n
The Unilateral Cancellation transaction pattern refers to the unilateral cancellation of an Economic Contract or to Commitment(s) within an Economic Contract. See right part of [Fig. 5].
Response or Request Response or Request Response Carries no pragmatic action Contract Proposal Transaction Pattern
FulFillment Transaction Pattern Initiating Agent
Responding Agent
<> Response
>
Faliure
e, non act"> ntr ject,
Success
<> <> <propose, none, Request Response "anEconomicContract">
te, "> t rea t, c Even jec
< "an Acce p Ec o n o t, c r e mic ate Ev , en t">
<> <declare, create, Request "anEconomicEvent">
Responding Agent
"an
Initiating Agent
"Ready to sign"
Faliure
Fig. 4. Fulfilment and Contract Proposal Transaction Patterns Cancellation Transaction Pattern I (bilateral)
Cancellation Transaction Pattern II (Unilateral)
<> Response
"a n < A c E c c ep on om t, can icO c blig el, atio n
el, n"> anc o t, c ligati jec b
Ec
Success
" an
">
<> <propose, cancel, Request "anEconom icO bligation">
Initiating Agent
Responding Agent
<> Request
<> Response
<declare, cancel, "anEconomicObligation">
Responding Agent
Initiating Agent
Faliure Success
Fig. 5. Bilateral and Unilateral Cancellation Transaction Patterns
190
M. Bergholtz et al.
5.2 Collaboration Patterns A Business Collaboration Pattern defines the orchestration of activities between partners by defining a set of BusinessTransactions patterns and/or more basic collaboration patterns plus the rules for transitioning from one transaction/collaboration to another [1]. The significance of a Business Collaboration Pattern is to serve as a predefined template in that it encodes business rules and business structure according to well-established best practices. A problem with the UMM collaboration patterns is that their complexity increases dramatically as new patterns are assembled from basic patterns, making the resulting activity diagrams hard to understand. To overcome this difficulty we use a layered approach where the transaction patterns constitute nodes in the activity diagram of the collaboration patterns. In this way the internal interactions between business partners within a transaction are modelled in a set of well-defined transaction patterns. In the collaboration pattern this complexity is hidden, and only the outcome of the transaction pattern taken into consideration. Definition: A collaboration pattern is a state chart over Transaction and Collaboration pattern(s). A collaboration pattern has exactly two end states representing success or failure of the collaboration respectively. 5.2.1 Fulfilment Collaboration Pattern Definition: The fulfilment collaboration pattern specifies relevant transaction patterns (see Fig. 6) and the rules for transitioning among these within the completion of an EconomicEvent. The pattern is assembled from the Fulfilment and Unilateral Cancellation transaction patterns defined in the previous section. Fulfillment Collaboration Pattern
Reject
<> Fulfillment
Reject
Accept
Success
<> Unilateral Cancellation Cancel
Faliure
Fig. 6. Fulfilment Collaboration Pattern C o ntract P ro p o sal C ollab o ratio n P atte rn
C on tract O ffer C ollabo ratio n P attern
R eject (= B id ing/A u ctioning)
< < T ransa ction P attern> > C om m ercial
<< T ransaction P attern >> C ontract P ropo se A ccept R eject
R eady to sign
R eject (= B iding/A uctio ning)
R eject+ S W IT C H (= C ounter P roposal)
Faliure
A ccept
R eject
C ontract form ed
R eject+ S W IT C H (= C ounte r O ffer)
Faliure
Fig. 7. Contract Proposal and Contract Offer Collaboration patterns
Reconciling Physical, Communicative, and Social/Institutional Domains
191
5.2.2 Contract Proposal and Contract Offer Collaboration Patterns Two basic collaboration patterns for business negotiation for contract formation are given in the Proposal and Offer collaboration patterns [3]. The Proposal collaboration pattern models the non-legally binding negotiation phase in a contract formation whereas the Offer collaboration pattern expresses the formal creation phase of a contract, see Fig. 7. These patterns are assembled from the Contract proposal transaction pattern and Commercial transaction pattern (described in Section 5.1) respectively. The two recursive paths when a contract offer/proposal has been rejected have a natural correspondence in the business concepts ‘Counter Offer’ and ‘Bidding’ (or ‘Auctioning’) respectively. ‘Counter Offer’ refers to the switch of roles between agents, i.e. when the responding agent has rejected the requesting agents offer, the former makes an offer of her own. ‘Bidding’ is modelled via the other transition from the decision activity, i.e. when the Responding Business Activity has turned down a contract offer, the Requesting Activity immediately initiates a new Transaction with a new (changed) offer for contract.
6 Modelling with Patterns – Governing the Choreography of Transactions and Collaborations Patterns evolve from structures and/or interactions that occur frequently in a certain context or domain. An issue is how to combine the patterns, i.e. how to avoid combining them in an incorrect way that diminishes their usefulness in solving problems. In this section, we propose rules governing the choreography, i.e. the sequencing of business transactions and business collaborations. Due to space limitations we will illustrate rules for ordering of transactions only. 6.1
Ordering of Transactions
When a designer constructs a choreography for a collaboration, it is helpful to consider the dependencies that exist among the transactions of the collaboration. Two kinds of dependencies occur across many domains: trust dependencies [11] and flow dependencies [13]. A trust dependency is an ordered pair of transactions , which expresses that A has to be performed before B as a consequence of limited trust between the initiator and the responder. As an example, it is possible to require that a product be paid before it can be delivered. A flow dependency is an ordered pair of transactions , which expresses that A has to be performed before B because the Economic Resources obtained in A are needed for carrying out B. We now define two partial orders, Flow and Trust, whose members are ordered pairs of BusinessTransactions between whom a trust or flow dependency holds. Furthermore a BusinessTransaction is classified according to the Economic EffectType of the pragmatic action it targets, i.e. the EconomicContract, Economic-
192
M. Bergholtz et al.
Commitment, or EconomicEvent to be affected. Fulfilment transactions targets EconomicEvents, commitment transactions targets EconomicCommitments and contract transactions target EconomicContracts. Cancellation transactions refer to all
types of pragmatic actions, where the Action is of type ‘Cancel’. The signatures of the partial orders are given below where Ful, Com, Ctr and Can refer to the sets of fulfilment, commitment, contract, and cancellation transactions, respectively. Trust is a partial order over {Ful ∪ Com ∪ Ctr} X {Ful ∪ Com ∪ Ctr}. Flow is a partial order over Ful X Ful. A set of rules that govern the orchestration of activities (as defined of a pair of Requesting/Responding Business Activities in a Transaction) can now be defined. Rule1: If A and B are nodes in a choreography C, and ∈ {Flow ∪ Trust} then there must exist a path from A to B in C. Furthermore we observe that the establishment of a commitment or contract must precede the cancellation of the same, which gives rise to the following rule: Rule 2: If A and B are nodes in a choreography C and A ∈ {Com ∪ Ctr} and B ∈ Can where B is cancelling the contract or commitment established by A, then there must exist a path from A to B in C.3 Returning to the relationships between EconomicCommitment, EconomicContract and EconomicEvent, as stated in [15], we observe that Economic Contracts are subtypes of Agreements carrying Economic Commitments that some actual economic exchange will be fulfilled in the future. Thus we identify the following rule: Rule 3: If A and B are nodes in a choreography C and A ∈ {Com ∪ Ctr} and B ∈ Ful, where B is establishing the economic event that fulfils the commitment established by A, then there must exist a path from A to B in C. Rules 1 - 3 can be used to guide and restrict the design of a choreography, i.e. give suggestions for possible paths between different transactions, as well as sequences of transactions, and rule out incorrect paths.
7 Concluding Remarks Integrating process and business models poses a number of problems along several dimensions. Differences in focus, abstraction level, and domain give rise to different types of discrepancies that must be resolved. Process models may be seen as describing the communicative world, in particular how agents establish and fulfils obligations, while business models depict the social/institutional world where economic relationships such as ‘ownership’ holds and actions such as transfer of economic resources occurs. The main contribution of this paper is a unified framework to facilitate the integration of business models and process models in e-Commerce. The approach suggested bridges the gap between the communicative aspects of a process model and the social/institutional aspects of a business model. A key assumption of this approach is that an enterprise can be viewed as a set of co-operating agents that establish, 3
A fulfilment transaction can be performed or not performed but it cannot be cancelled.
Reconciling Physical, Communicative, and Social/Institutional Domains
193
modify, cancel and fulfil commitments and contracts. In carrying out these activities, agents rely on so-called pragmatic acts (speech acts), which are actions that change the universe of discourse when a speaker utters them and a recipient grasps them. Besides facilitating process and business model integration, the proposed framework offers several benefits: Simplified Analysis and Design. It will be easier for business users to participate in analysis and design if they are able to express themselves using concepts that have a business meaning (like propose, declare, commit, cancel) instead of using technical concepts like message structures and state machines. Furthermore, the specification of a pragmatic action is simple, as it can be viewed as filling in a template. Technology Independence. An approach based on pragmatic actions makes it possible to abstract business semantic conversations out of technical messaging protocols, so that pragmatic actions can be used with any technical collaboration protocol (UMM BCP [3], ebXML BPSS[2], BPEL4WS [5], etc). Thus, pragmatic actions provide a clean interface to collaboration protocols.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12. 13. 14.
“eBTWG – Business Collaboration Patterns/Business Commitment Patterns Technical 2 Specification (BCP ) – "Monitored Commitments”, Valid on 20030401, http://www.collaborativedomain.com/standards/ “ebXML Business Process Specification Schema, (BPSS), Valid on 20030404 http://www.unece.org/cefact/ebxml/Documents/ebBPSS1.05.pdf.zip “UN/CEFACT Modelling Methodology (UMM-N090 Revision 10), Valid on 20030328, http://webster.disa.org/cefact-groups/tmg/doc_bpwg.html. Bergholtz M., Jayaweera P., Johannesson P., Wohed P., “Business and Process Models – a Unified Framework”, in Proc. of eCOMO’2002, held in conjunction with the 21th International Conference on Conceptual Modeling (ER’2002),Tampere, Finland Cubera F., Goland Y., Klein J., Leymann F., Roller D.,Thatte S., and Weerawarana S., “Business Process Execution Language for Web Services”, Valid on 20030404, http://dev2dev.bea.com/techtrack/BPEL4WS.jsp Dietz J., “Modelling Communication in Organisations”, in Linguistic Instruments in Knowledge Engineering, Ed. R.v.d.Riet, pp 131–142, Elsevier Science Publishers, 1992 Dignum F. and Weigand H., “Modelling Communication between Cooperative Systems”, th in proceedings of the 7 Conference of Computer Advanced Information System Engineering (CaiSE), Lecture Notes in Computer Science, Springer Verlag, 1995 ebXML Deliverables, Valid on 20030328, http://www.ebxml.org/specs/index.htm Gamma E., Helm R., Johnson R., and Vlissides J., “Design Patterns: Elements of Reusable Object-Oriented Software”, Addison Wesley. 1995 Gordijn J., Akkermans J. M. and Vliet J. C., “Business Modelling, is not Process Modelling”, Proc. of the 1th International Workshop on Conceptual Modeling Approaches for e-Business (eCOMO’2000), held in conjunction with the 19th International Conference on Conceptual Modeling (ER’2000), Salt Lake City, Utah, USA Jayaweera P., Johannesson P., Wohed P., “Collaborative Process Patterns for e-Business”, ACM SIGGROUP Bulletin 2001/Vol 22, No. 2 Larman C., “Applying UML and patterns: an introduction to object oriented analysis and design”, ISBN 0-13-74880-7 Malone et al.: “Towards a handbook of organisational processes”, MIT eBusiness Process Handbook, Valid on 20030404, http://ccs.mit.edu/21c/mgtsci/index.htm Martin, J., Odell, J.: Object-Oriented Methods. A Foundation, Prentice Hall 1994
194
M. Bergholtz et al.
15. McCarthy W. E., “REA Enterprise Ontology”, Valid on 20030404, http://www.msu.edu/user/mccarth4/rea-ontology/ 16. Scott A. Moore, “A foundation for flexible automated electronic communication”, Information Systems Research, 12:1 (March 2001) 17. Searle J. R., “A taxonomy of illocutionary acts”, K. Gunderson (Ed.), Language, Mind and Knowledge, Minneapolis: University of Minnesota, 1975. 18. Taveter K. and Wagner G., “Agent-Oriented Enterprise Modelling Based on Business th Rules”, proceedings of the 20 International Conference on Conceptual Modelling, Yokohama, Japan, November 2001 19. Wagner G., “The Agent-Object-Relationship Meta-Model: Towards a Unified View of State and Behaviour”, to appear in Information Systems 2003, Valid on 20030404, http://tmitwww.tm.tue.nl/staff/gwagner/AORML/AOR.pdf 20. Yu L., and Schmid B. F., “A conceptual framework for agent-oriented and role based st workflow modelling”, In G. Wagner and E.Yu, editors, Proc. of the 1 Int. Workshop on Agent-Oriented Information Systems, 1999
An Agent-Based Active Portal Framework Aizhong Lin, Igor T. Hawryszkiewycz, and Brian Henderson-Sellers Faculty of Information Technology University of Technology, Sydney POBox 123, Broadway NSW 2007, Australia {alin,igorh,brian}@it.uts.edu.au
Abstract. The paper introduces an agent-based active portal framework. It aims to support the construction of active portals for various purposes such as for public Internet users, for specific group users, for teaching, or for financing. Agents in a portal are capable of autonomous actions to capture, select, organize, and analyze the sharable resources. These actions are those traditionally taken by a number of people called information engineers or knowledge engineers. By using the agents, the portal can rapidly provide the newly created and modified resources distributed in different places to the users and the costs of the action executions are considerably reduced. The framework is applied to active portal development to reduce the development time and costs.
1 Introduction A portal, as defined here, is a specific web-based system that provides a collection of sharable resources or uniform resource locators (URLs). Following this definition, Yahoo [8] and Google [4] are portals since each of them provides a collection of publicly sharable information or information URLs. The resources of a portal could be universal (any kind of information or knowledge) or constrained (the information or knowledge related to a specific domain such as process management, teaching, finance and so on). The use of a portal could be public (can be accessed by any people from Internet) or internal (only a specific group of people can access it). Portals could be built for various purposes and different portals may have different names. A portal, for example, is called a knowledge portal when built for knowledge sharing; a teaching portal when used in a university for teaching subjects. Therefore, we use “portal” (rather than “information portal or “knowledge portal”) in the title as a generic name to indicate various kinds of portals. To a user, a portal is only a web site containing numerous meaningful links each of which can bring the user to a new world. Behind the web site, however, a large number of actions have to be taken for capturing, selecting, creating, organizing, analyzing, evaluating and synthesizing the resources. Traditionally, those actions are taken manually by a number of people called information engineers or knowledge engineers. For rapid provision of newly created and modified resources that are distributed in different places to users and for reducing the cost of executing actions so that to make itself more competitive than its counterparts, a portal employs M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 195–204, 2003. © Springer-Verlag Berlin Heidelberg 2003
196
A. Lin, I.T. Hawryszkiewycz, and B. Henderson-Sellers
intelligent agents to execute the actions that capture, select, organize and analyze resources. The term “active” in this paper indicates that the contents of a portal are automatically and dynamically refreshed by using intelligent agents. An intelligent agent (or an agent for short) is a software component, situated in some environment, that is capable of autonomous and reasonable actions to achieve the designed goals. Being applied here, the environment of the intelligent agents is the portal and the actions are resource capturing, selecting, organizing, and analyzing. An intelligent agent is built with one or more such actions and associated functions that decide when it should execute an action and which action it should execute. The associated functions in an intelligent agent are fulfilled using a BDI (Belief, Desire, Intention) [1] [7] based agent architecture [5]. Generally, resources and the actions to operate the resources are two important aspects of a portal. Therefore, firstly, a portal should provide a component to represent, store, and maintain the resources. Then, a portal has to provide the actions to operate the resources. Building a portal normally includes three steps: (1) building a resource repository; (2) gathering distributed resources or URLs and organizing them into the repository; and (3) defining actions to operate the resources. Traditionally, the development of each individual portal undergoes these three steps. Considering different portals normally have the same resources repository and most of the same actions, this kind of portal development results in the waste of development time and costs. This research proposes a generic method to build active portals to overcome the disadvantages of the development method described above. In this method, a reusable active portal framework is constructed first. The framework consists of a generic resource repository that can be used to represent, store, and maintain resources for different purposes, and a collection of intelligent agents each of which is built with one or more actions (e.g. text analysis) that operate resources. When installed with a specific set of resources, the framework becomes a specific active portal. Consequently, the work of building an active portal is reduced from traditional three steps to only one step – install specific resources into the resource repository and configure the goals of the intelligent agents. This paper introduces an agent-based active portal framework. In section 2, we describe the architecture of the agent-based active portal framework followed by the introduction of a resource repository and a multi-agent system. Section 3 is the detailed description of the resource repository including a meta-model of the portal resource organization. Section 4 is the description of the multi-agent system including the multi-agent architecture and the individual intelligent agent architecture. Finally, in section 5, an active teaching portal is constructed from the agent-based active portal framework.
2 The Architecture of the Agent-Based Active Portal Framework As described above, building an active portal is used to (1) construct a resource repository to represent, store, and maintain resources such as information resources or knowledge resources; (2) construct a set of actions that operate the resources; and (3) install a large number of resources collected. To reduce the development time and costs of an active portal, we propose a generic development method. We believe that
An Agent-Based Active Portal Framework
197
the resource repository and most of actions are similar in different active portals, so a generic resource repository and the generic actions can be built in an active portal framework. Once a specific active portal is to be constructed, what the developers should do is to adopt the active portal framework and install the resources into the resource repository. The developers might have to build some customized actions for the specific needs of the users of this portal, but compared with the large number of actions that have been built already in the framework, this work cost little time and money. Figure 1 is the illustration of the architecture of the agent-based active portal framework (AbAPF). Three kinds of roles use the framework. Developers use the framework to install specific resources into the resource repository to configure the framework into a specific portal. They may also build some customized actions for the specific portal. Users, a large number of portal participants, use the portal frequently to access the resources in the resource repository. Maintainers need to manage user accounts and resources. Interfaces including the interfaces for users and the interfaces for developers and maintainers are provided in the framework to support their usage. Actions built in this framework are assigned to individual intelligent agents each of which has one or more actions. Assigning actions to agents brings three benefits: (1) actions are categorized because a specific type of actions can be assigned to one agent; (2) agents are autonomous entities so they know when an action should be taken and what action should be taken in a given time without the intervention of humans; and (3) agents can interact with each other to share actions. Because actions reside in agents, agents can automatically gather resources from the Internet or a particular intranet and save the resources in the resource repository, analyze the resources, and organize resources in suitable workspaces (described in section 3). User
Active X-Portal User Interfaces (AXPI)
Developer
Maintainer
Active X-Portal Developer/Maintainer Interfaces (AXPDMI)
Multi-agent System (MAS)
Resource Repository (RR)
Internet/Intranet
Fig. 1. The architecture of the agent-based active portal framework
3 Resource Repository The resource repository of the agent-based active portal framework is separated into two layers as illustrated in Figure 2. In the bottom layer is a resource element repository (RER) that is used to represent and store resource elements. A resource
198
A. Lin, I.T. Hawryszkiewycz, and B. Henderson-Sellers
Resource Repository
WS
WS
WS
WS
WS
WS
Internet/Intranet WS
WS
WS
WS
…
WS
…
WS
Selecting agents Analyzing agents … Organizing agents
WS
Gathering agents
Resource Element Repository (RER)
Fig. 2. The resource repository that consists of two layers: the bottom layer is the resource element repository and the top layer is the sets of the workspace trees
link to
WS
activities
achieve
link to
WS
use
goal
… are related to
documents link to
WS
Workspace (WS)
Fig. 3. The semantic model of a workspace that represents, stores, and maintains the goal related documents and activities
element is a description (including the name, URL, type, abstract, length, and so on) of a resource. A resource is a piece of information or knowledge expressed explicitly, electronically, and persistently. Following this definition, as a piece of information, a resource could be an electronic document (including text based, picture/image based, or audio/video based), and, as a piece of knowledge, a resource could be an activity (called a task in [6]) that supports achieving a goal. In agent-based active portal framework, we employ the two terms – document and activity – to describe the resource repository. The top layer of the resource repository is composed of a set of workspace trees each of which organizes a set of workspaces in a tree structure. A workspace is a complex object that represents, stores, and maintains a set of documents related to a specific goal and a set of activities (such as “organizing a meeting”) that support achieving a goal. A workspace is created by an agent for a goal and then the documents and activities related to the goal are filled into the workspace. Figure 3 is the semantic model of a workspace. Based on a workspace, one or more new workspace(s) could be created by an agent. The new workspace(s) and the background workspace form a parent/child relationship(s) in which the background workspace is the parent and the new workspace(s) are the child(ren). A workspace tree is formed as the existence of the multiple levels of the parent/child relationships. A workspace tree represents, stores, and maintains separated groups of documents and tasks related to a high level goal or common goal.
An Agent-Based Active Portal Framework
199
4 Multi-agent System A multi-agent system (MAS) plays an important role in the agent-based active portal framework. As shown in Figure 4, the multi-agent system looks after two environments (some agents in the multi-agent system may look after the Internet/Intranet, some agents may look after the resource repository, and some of them may look after both of them). When an individual agent is designed to look after an environment (e.g. the resource repository), it perceives the environment and produces events according to its perceptions, then derives the actions using its reasoning, decision making, and interacting mechanisms, and then executes the actions to respond to the events. Agents in the multi-agent system interact with each other to meet the users’ goals. For example, if a student is using a teaching portal, she has a goal to learn the subject C# (a Microsoft programming language, called C sharp) and she inputs the goal in her personal workspace. The goal is expressed in First-Order logic [9] (e.g.: to_learn(“C#”)). An agent that is looking after the resource repository perceives that goal and then tells a workspace organizing agent to generate a C# workspace for the student. However, the workspace organizing agent may find that there is no C# resource to be found in the resource element repository that could be selected. It therefore asks a knowledge gathering agent to collect the knowledge related to the C# subject. As required by a workspace organizing agent, the knowledge gathering agent perceives the Internet or a specific intranet, searches a great number of resource elements related to C#, and saves them in the resource element repository. Then the workspace organizing agent generates a C# subject workspace and selects a set of resource elements into the workspace so that the student can learn the C# subject from the C# subject workspace. Figure 5 illustrates the interactive feature of individual agents. Perceives
Perceives
Multi-Agent System (MAS)
Resource Repository (RR)
Actions
Internet/Intranet
Actions
Fig. 4. The role of the multi-agent system. It perceives the Internet/Intranet and the resource repository and takes autonomous actions to both of them
MAS
Analysing agent perceives Resource Repository
perceives
Organizing agent
Internet/In tranet
Gathering agent
actions
actions
Selecting agent
Fig. 5. Individual agents interact with other agents to achieve the users’ goals
200
A. Lin, I.T. Hawryszkiewycz, and B. Henderson-Sellers
PIMA
Agent 2
Agent 3
Agent 4
Agent 1
MAS
Agent n
Fig. 6. Agents register or un-register themselves to the personal information management agent
4.1
Facilitative Agents
To support the multi-agent interaction, two facilitative agents are built. One is the personal information management agent (PIMA) that manages the personal information of the individual agents in the multi-agent system. The other is the message management agent (MMA) that manages and transfers messages between agents. Figure 6 illustrates the communication between the personal information management agent and individual agents. When an agent starts in the multi-agent system, it sends a “register” message that contains its personal information (e.g. ID, name, creator, birthdate, and actions) to the personal information management agent. The personal information management agent saves the agent personal information in the personal information database. Similarly, before an agent stops running, it sends a “un-register” message to the personal information management agent so that the personal information management agent deletes the personal information from the personal information database. The multi-agent system can be dynamically organized because agents can dynamically start or stop running in the multi-agent system. To achieve a user’s goal, an agent interacts with other agents to share actions. Agents interact with each other by exchanging messages. A message carries a piece of information that the sender wants the receiver(s) to know. In the multi-agent system, an message management agent (Figure 7) is situated to receive messages from agents and then dispense them to the destination agents. An agent message has a specific format that either obeys KQML (Knowledge Query and Manipulate Language) [2] or ACL (Agent Communication Language) [3] standards. In this multi-agent system, agents use ACL to represent and transfer messages. MMA
Agent 2
Agent 3
Agent 4
Agent 1
MAS
Agent n
Fig. 7. Messages are exchanged between individual agents via the message management agent
An Agent-Based Active Portal Framework
201
have Goals
Plans trigger
activate ECA Rules
Beliefs
I Rules
trigger
Events
Actions
A hybrid intelligent agent architecture produce
change
Resource Repository / other Agents
Fig. 8. The conceptual model of the hybrid intelligent agent architecture (from [5])
4.2
Individual Agent
A number of individual agents are built in the multi-agent system of the agent-based active portal framework. Each agent in the multi-agent system supports a hybrid agent architecture (as show in Figure 8) that combines a reactive reasoning and a BDI (Belief, Desire, and Intention) [1] [7] based proactive reasoning. An agent architecture is a specification of how an agent derives its actions from the events perceived in a given time. A hybrid agent architecture realizes both reactive reasoning and proactive reasoning employs BDI reasoning. Figure 8 illustrates the hybrid agent architecture from the conceptual level. At the conceptual level, events produced in the environment can be perceived by an agent. When an event is detected, the agent uses the event to match the EventCondition-Action (ECA) rule. According to the matched ECA rules, the event could trigger an action to be executed (a reactive reasoning) or activate a goal to be achieved (initializing a BDI based proactive reasoning). If an action is triggered, the action is executed to change the environment. If a goal is activated, the goal, beliefs, and Inference (I) rules are employed to derive a suitable plan to achieve the goal. When a plan is selected, the actions contained in the plan are scheduled and then executed. The results of the execution of the actions change the environment and this change may result in new events and then the events are perceived by the agent. In this architecture, events could trigger beliefs that have been revised. The ECA rules, original beliefs, and I rules are used to decide which beliefs need to be revised and which actions used to revise the beliefs. Besides, ECA rules can trigger I rules that have been used to derive plans in order to achieve a goal or actions to be scheduled. The definitions of the concepts and detailed descriptions about this model can be found in [5]. Figure 9 shows the interface of the Java implementation for individual agents.
Fig. 9. The individual agent interface of the Java implementation
202
A. Lin, I.T. Hawryszkiewycz, and B. Henderson-Sellers
5 An Example (Active Teaching Portal) Built from the AgentBased Active Portal Framework To trial the feasibility of the agent-based active portal framework, it is used here to build an active teaching portal in a university environment. After setting the title (“An Active Teaching Portal”), the logo (a small picture in the top-left), and the copyright, the agent-based active portal framework becomes a teaching portal. Users can start to access it, but at this stage, the resource repository is empty. Each user has an account in the portal. The account could be created by a maintainer of the active teaching portal, or generated automatically by the register function when a user registers herself in the portal. As the account is created, a personal workspace is also created for the user automatically. A user can express the learning goal or intention (such as to_learn(“C#”)) in the personal workspace. A workspace organizing agent is running to look after the workspaces. Suppose a user inputs a goal “to_learn(‘C#’)” in her personal workspace, a “goal_created(Goal g)” event is perceived by the organizing agent. The agent automatically creates a C# workspace for that user and tries to select the C# related documents and activities from the resource element repository and install them into the workspace. If no document and activity related to C# are found in the resource element repository, the organizing agent sends a “request” message to a gathering agent to delegate a knowledge capturing action. When a gathering agent receives the “request” message, it will schedule the execution of the capturing action. When a large number of resources are captured from the Internet or a specific Intranet, they are saved in the resource element repository as the elements. The analyzing agent analyzes the resource elements and provides a related level (a real number between 0.00 and 1.00) value for each element. Then the selecting agent selects some resources based on the related values and fills those elements in the C# workspace. The user can manually select some elements and add them to the workspace. Figure 10 illustrates how the user adds some elements to a workspace. The user can input or upload some resources into the resource element repository and then select them to the workspace by using the interfaces provided in the portal. A workspace is created for a specific user and owned by that user, but any other user can access the workspace by clicking on the link of the workspace. In this way, documents and activities created for or by one user can be shared by other users. Figure 11 is the document interface when the user “alin”, who is not the owner of the workspace, accesses the C# workspace.
6 Related Work and Future Work An agent-based generic framework supports the development of an active portal framework to greatly reduce the development time and costs for specific active portals. The agent-based generic framework is especially useful nowadays because numerous various portals (e.g. information portal, knowledge portal, teaching portal) are being developed or under development. The research of the agent-based active portal framework also identified some related works including: (1) Mack, Ravin, and Byrd in [6] described the concepts and
An Agent-Based Active Portal Framework
203
Fig. 10. The interface that shows how a user selects some resources and adds them to a workspace
Fig. 11. User “alin” access the documents by sharing the C# workspace
architectures of knowledge portals and the tasks to build knowledge portals. However, their work does not focus on a generic tool to assist the development of knowledge portals. (2) The functions provided in traditional information portals such as Yahoo [8] and Google [4]. Those information portals do not provide the storage and management of private information or knowledge for a specific purpose. (3) A virtual collaborative environment (LiveNet [10]) that provides the representation and storage for resources. And (4) a generic intelligent agent framework [5] that support the development of specific agents. The agent-based active portal framework described here has been implemented in Java, JSP, Java Servlet, and XML. It is employed in a university environment to build teaching portals. To make the agent-based active portal framework more reusable, the workspace model (it contains only the goal, documents, and activities currently) will be extended (e.g. role and participant are added to the model) in the future so that the
204
A. Lin, I.T. Hawryszkiewycz, and B. Henderson-Sellers
workspace will suit various situations. Furthermore, to make the agent-based active portal framework more efficient, more agents should be provided in the multi-agent system to support more actions for different purposes. The future direction of this research will also concentrate on building more agents that contains generic actions to process resources. Acknowledgements. We wish to thank the Australian Research Council for providing funding especially for the work concerning agents in collaboration. This is contribution number 03/12 of the Centre for Object Technology Applications and Research.
References 1
M. E. Bratman. Intention, Plans and Practical Reason. Harvard University Press, Cambridge, Massachusets. 1987 2 T. Finin, and Y. Labrou. A Proposal for a new KQML Specification. in University of Maryland Baltimore Count (UMBC). Baltimore, 1997. 3 FIPA specification. “Agent Communication Language”. http://www.fipa.org/specs/ fipa00003/OC00003A.html 4 Google. http://www.google.com 5 A. Lin and B. Henderson-Sellers. A Reusable Intelligent Agent Framework. Proceedings of Agent-Based Technologies and Systems (ATS 2003), Calgary, August 27–29, 2003 (in press). 6 R. Mack, Y. Ravin, and R. J. Byrd. Knowledge Portals and the Emerging Digital Knowledge Workplace. IBM Systems Journal 40, No. 4, 925–955, 2001. 7 A. S. Rao and M. P. Georgeff. Modeling rational agents within a BDI-architecture. In Proceedings of the Second International Conference on Principles of Knowledge Representation and Reasoning, KR ’91, pages 473–484, Cambridge, MA, 1991. 8 Yahoo. http://www.yahoo.com 9 First-Order Logic. http://www.cs.wisc.edu/~dyer/cs540/notes/fopc.html 10 I.T. Hawryszkiewycz: “Knowledge Sharing through Workspace Networks”, Proceedings of the Special Interest Group on Computer Personnel Research (SIGCPR99), April 1999, New Orleans (ISBN 1-58133-063-5), pp. 79–85.
Agent-Oriented Modeling and Agent-Based Simulation Gerd Wagner and Florin Tulba Faculty of Technology Management, I&T, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands {G.Wagner,F.Tulba}@tm.tue.nl http://www.tm.tue.nl/it/staff/gwagner/
Abstract. Agent-oriented modeling of software and information systems and agent-based simulation are commonly viewed as two separate fields with different concepts and techniques. We argue that a sufficiently expressive agent-oriented modeling language for information systems analysis and design should – with some minor extensions – also be usable for specifying simulation models that can be executed by an agent-based simulation system. Specifically, we investigate the suitability of the Agent-Object-Relationship modeling language (AORML) proposed in [Wag03] for simulation. We show that the AOR meta-model and the meta-model of discrete event simulation can be combined into a model of agent-based discrete event simulation in a natural way.
1 Introduction Agent-based simulation (ABS) is a new paradigm, which has been applied to simulation problems in biology, engineering, economics and sociology. In ABS, a scenario of systems that interact with each other and with their environment is being modeled, and simulated, as a multiagent system. The participating agents – animals, humans, social institutions, software systems or machines – can perform actions, perceive their environment and react to changes in it. They also have a mental state comprising components such as knowledge/beliefs, goals, memories and commitments. Compared to traditional simulation methods – like mathematical equations, discrete event-simulation, cellular automata and game theory – ABS is less abstract and closer-to-reality, since it explicitly attempts to model the specific behavior of individuals, in contrast to macro simulation techniques that are typically based on mathematical models averaging the behavior effects of individuals or of entire populations (see [Dav02]). There are many different formalisms and implemented systems that are all subsumed under the title ‘agent-based simulation’. E.g., one of the most prominent ABS systems is SWARM [MBLA96], an object-oriented programming library that provides special support for event management, but does not support any cognitive agent concept. Like SWARM, also many other ABS systems do not have a theoretical foundation in the form of a metamodel. They do therefore not allow to specify a simulation model in a high-level declarative language, but require to specify simulation models in a rather low-level programming language. M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 205–216, 2003. © Springer-Verlag Berlin Heidelberg 2003
206
G. Wagner and F. Tulba
We take a different approach and aim at establishing an ABS framework based on a high-level declarative specification language with a UML-based visual syntax and an underlying simulation metamodel as well as an abstract simulator architecture and execution model. The starting point for this enterprise is an extension and refinement of the classical discrete event simulation paradigm by enriching it with the basic concepts of the Agent-Object-Relationship (AOR) metamodel [Wag03].1
2 Modeling Agents and Multiagent Systems 2.1 AOR Modeling The AOR modeling language is based on an ontological distinction between active and passive entities, that is, between agents and (non-agentive) objects of the real world. The agent metaphor subsumes both artificial (software and robotic) and natural (human and animal) agents. In AORML, an entity is either an agent, an event, an action, a claim, a commitment, or an ordinary object. Only agents can communicate, perceive, act, make commitments and satisfy claims. Objects are passive entities with no such capabilities. Besides human and artificial agents, AORML also includes the concept of institutional agents, which are composed of a number of other agents that act on their behalf. Organizations and organizational units are important examples of institutional agents. There are two basic types of AOR models: external and internal models. An external AOR model adopts the perspective of an external observer who is looking at the (prototypical) agents and their interactions in the problem domain under consideration. In an internal AOR model, we adopt the internal (first-person) view of a particular agent to be modeled. While a (business) domain model corresponds to an external model, a design model (for a specific information system) corresponds to an internal model, which can be derived from the external one. Fig. 1 shows the most important elements of external AOR state structure modeling. There is a distinction between action events and non-action events, and between a communicative action event (or message) and a non-communicative action event. Fig. 1 also shows that a commitment/claim is coupled with the action event that fulfills that commitment (or satisfies that claim). The most important behavior modeling element of AORML are reaction rules, which are used to express interaction patterns. A reaction rule is visualized as a circle with incoming and outgoing arrows drawn within the agent rectangle whose reaction pattern it represents. Each reaction rule has exactly one incoming arrow with a solid arrowhead: it specifies the triggering event type. In addition, there may be ordinary incoming arrows representing state conditions (referring to corresponding instances of other entity types). There are two kinds of outgoing arrows: one for specifying mental effects (changing beliefs and/or commitments) and one for specifying the performance of (physical and communicative) actions. An outgoing arrow with a double arrowhead denotes a mental effect. An outgoing connector to an action event type denotes the performance of an action of that type. Fig. 2 shows an example of an interaction pattern diagram. 1
See also http://AOR.rezearch.info.
Agent-Oriented Modeling and Agent-Based Simulation
207
Fig. 1. The core state structure modeling elements of external AOR diagrams. Elevator Floor
HaltFloor[0..1] DownFloor[0..1] UpFloor[0..1] R1
arriveAt Floor halt
Fig. 2. An interaction pattern (when arriving at a floor, the elevator halts and sets its HaltFloor attribute to the new position) expressed by means of the reaction rule R1: the triggering event type is arriveAt, the triggered action event type is halt, and the effect is an update of the HaltFloor attribute whose value is set to the value of Floor (the postcondition is HaltFloor=Floor).
In symbolic form, a reaction rule is defined as a quadruple e, C —> a, P where a denotes an event term (the triggering event), C denotes a logical formula (the mental state condition), a denotes an action term (the triggered action), and P denotes a logical formula (the mental effect or postcondition). 2.2 Example: Communicating Elevators pr
We consider an example scenario where two elevators operate in the same shaft and must take care to avoid collisions. For simplicity, we restrict our consideration to the case with three floors. Elevator A is serving floor 1 and 2, and elevator B is serving floor 2 and 3, and hence the critical zone, requiring coordination, is floor 2. This scenario is depicted in Fig.3.
208
G. Wagner and F. Tulba
B
A
Fig. 3. Two elevators operating in one shaft.
The external AOR diagram in Fig. 4 models this scenario. Shaft 1 2
Elevator reqPerm
HaltFloor[0..1] DownFloor[0..1] UpFloor[0..1]
reqTransp
ElevatorUser
TargetFloor move Direction move
grantPerm grantPerm
Direction arriveAt Floor halt
Fig. 4. An external AOR diagram modeling the elevator scenario from Fig. 3. There are two agent types (Elevator and ElevatorUser), one object type (Shaft), three message types (reqTransp, reqPerm, grantPerm), one non-action event type (arriveAt), two noncommunicative action event types (move, halt) and commitment/claim types for move and for grantPerm.
2.3 Modeling and Simulating Communication and Perception For modeling and simulating communication between agents, we do not consider nonverbal communication and abstract away from the physical layer, where the speaker realizes a communication act (or, synonymously, sends a message) by performing some physical action (such as making an utterance), and the listener has to perceive this action event, implying that, due to the physical signal transmission, there can be noise in the listener’s percepts referring to the message send event. In general, for each (external) event E and each agent, the simulation system has to compute a corresponding potential percept PP (according to physical laws), from which the
Agent-Oriented Modeling and Agent-Based Simulation
209
actual percept AP has to be derived according to the perceptive capability of the agent. The mental availability of the actual percept, then, is the (internal) perception event corresponding to the external event. There are two options, how to simplify the E—>PP—>AP chain: we can either assume that 1. E = PP = AP, so we don’t have to distinguish between an external event and the corresponding perception event; or 2. PP = AP, that is, all agents have perfect perception capabilities. For communication events, it makes sense to assume that E = PP = AP, i.e. the message received is equal to the corresponding message sent. Yet, there may be a delay between these two events, depending on the type of the message transport channel and the current physical state of the speaker and listener. For the perception of a non-communicative action event such an assumption may be not justified and would mean a severe simplification. However, the less severe simplification expressed by the assumption that PP = AP may be justified for many purposes.
3 Agent-Based Discrete Event Simulation In Discrete Event Simulation (DES) systems are modeled in terms of system states and discrete events, i.e. as discrete dynamic systems. Since a system is composed of entities, its state consists of the combination (Cartesian product) of all states of its entities. All state changes are brought about by events. DES is a very generally applicable and powerful approach, since many systems, in particular technical and social systems, can be viewed as discrete dynamic systems. In event-driven DES, the simulated time is advanced according to the occurrence time of the next event. In time-driven DES, the simulated time is advanced in regular time steps. In many ABS approaches, the basic DES model is refined in some way by making certain additional conceptual distinctions, including the fundamental distinction between interacting agents and passive objects. These simulation approaches may be classified as Agent-Based Discrete Event Simulation (ABDES). In our version of ABDES, extending and refining the basic DES model into a model of Agent-Object-Relationship Simulation (AORS), we start with time-driven DES (since we need small regular time steps for simulating the perception-reaction cycle of agents) and adopt a number of essential ontological distinctions from AORML: The enduring entities of a system (also called endurants in foundational ontrologies) are distinguished into agents and objects. Agents maintain beliefs (referring to the state of their environment) and process percepts (referring to events). Events can be either action events or non-action events. Action events can be either communicative (messages) or non-communicative.
210
G. Wagner and F. Tulba
SysTF: S,e
S’
e SysTF
S
Event OccurenceTime
SystemState S’
System 1
Time
*
*
* *
*
EntityState
Entity
Time
*
1
Fig. 5. The basic model of discrete event simulation as a UML class diagram. As a system consists of a number of entities, its state consists of all the states of these entities. A system transition function SysTF takes a system state S and a state-changing event e and determines the resulting successor state S’.
SysTF: S,e
S’
e SysTF
S SystemState S’
System 1
*
Time
*
1 ActionEvent
*
NonActionEvent
*
*
Message
TimeEvent
EntityState
Entity 0..1
Event OccurenceTime
1
*
Time
*
*
NonCommActEvt
sends 1
*
1 Object
1
receives
Agent creates
1
{xor}
* EntityBelief
Percept * *
Fig. 6. A UML class diagram describing the basic ontology of AORS: Agents maintain beliefs about entities, send/receive messages, and process percepts, which refer either to a noncommunicative action event or to a non-action event. Notice that communication (sending/receiving messages) is separated from perception (perceiving non-communicative action events and non-action events).
In addition to these conceptual distinctions from AORML, we need to introduce the notion of exogenous events, which drive the simulation and which are generated at random. Exogenous events are either non-action events that are not caused by other events or exogenous action events in the sense that their actor is not included in the simulation. Fig. 6 shows the AORS extension of the basic DES model of Fig. 5.
Agent-Oriented Modeling and Agent-Based Simulation
211
3.1 An Abstract Architecture and Execution Model for ABDES Systems In ABDES, it is natural to partition the simulation system into 1. 2.
the environment simulator responsible for managing the state of all external (or physical) objects and for the external/physical state of each agent; a number of agent simulators responsible for managing the internal (or mental) state of agents.
The state of an ABDES system consists of: the simulated time t the environment state representing − the environment (as a collection of objects) and − the external states of all agents (e.g., their physical state, their geographic position, etc.) the internal agent states (e.g., representing perceptions, beliefs, memory, goals, etc.). a (possibly empty) list of future events A simulation cycle consists of the following steps: 1. At the beginning of a new simulation cycle, say at simulated time t, the environment simulator determines the current events comprising a) all the events of the future events list whose occurrence time is now b) exogenous events whose occurrence time is now (e.g. stochastic non-action events or events created by actors that do not belong to the simulated system) 2. The environment simulator computes, on the basis of the current environment state and the current events, a) a new environment state, b) a set of successor events to be added to the future events list at different moments (representing events physically caused by the current events), c) for each agent, its perception events 3. Each agent simulator computes, on the basis of the current internal agent state and its current perceptions, a) the new internal agent state, b) a set of action events representing the actions performed by this agent (which are added to the future events list with time stamp t+1) 4. The future events list is updated by removing all the processed events and adding the computed action and successor events 5. The environment simulator sets the simulated time t to t+1, and starts over with step 1 of the simulation cycle. The simulation ends when the future events list is empty. This abstract architecture and execution model for ABDES systems can be instantiated by different concrete architectures and systems. In section 4, we present a Prolog program, which implements an AORS system and instantiates this architecture.
212
G. Wagner and F. Tulba
The AORML distinction between external and internal models provides the means needed for modeling both the environment and the agents involved in a simulation scenario. An external model describes the perspective of the environment simulator, whereas the internal models derived from the external one describe the perspectives of the involved agents. This suggests the following methodology for developing an AOR simulation model: 1. In the domain analysis of the simulation problem, develop an external AOR model of the scenario from the perspective of an external observer. This model is the basis both for designing the environment simulation and for deriving the specification of the involved agent simulators. 2. For each involved agent, transform the external AOR model of the simulation scenario into an internal AOR model for specifying the corresponding agent simulator. 3.2 Advantages of ABDES and AORS ABDES and AORS support structure-preserving modeling and closer-to-reality simulation: − Passive entities with certain properties are modeled as objects with corresponding attributes. − Interactive entities (actors) are modeled as agents, which have beliefs and perceptions , and interact with each other and with their environment. functionally distributed simulation where any of the participating simulators (the environment simulator and all involved agent simulators) may be deployed to different threads or processes, possibly running on different machines (realizing vertical distribution). interactive simulation where any of the involved agent simulators may be replaced by its real counterpart. modeling and simulating pro-active behavior, in addition to the basic reactive behavior.
4 A Prolog Prototype of an AORS System Implemented as a Prolog program, the AORS simulation cycle yields the following procedure: 1: 2: 3: 4: 5: 6: 7: 8:
cycle( _, _, _, []) :- !. cycle( Now, EnvSt, IntAgtSts, EvtList) :extractCrtEvts( Now, EvtList, CrtEnvEvts, CrtPercEvts), envSimulator( Now, CrtEnvEvts, EnvSt, NewEnvSt, TranslCausEvts), agtsSimulator( Now, CrtPercEvts, IntAgtSts, NewIntAgtSts, TranslActEvts), computeNewEvtList( EvtList, CrtEnvEvts, TranslCausEvts, TranslActEvts, NewEvtList), NextMoment is Now+1, cycle( NextMoment, NewEnvSt, NewIntAgtSts, NewEvtList).
Agent-Oriented Modeling and Agent-Based Simulation
213
Line 1 represents the exit condition (when the future events list is empty). In line 3, the current environment events (steps 1a and 1b of the simulation cycle) and also the current perception events are extracted from the future events list. Lines 4 and 5 simulate the system in the current cycle by first calling the environment simulator and then calling all agents simulators. In line 6, the future events list is updated (step 4). The last two lines update the time and start a new cycle (step 5). NewEnvSt and NewIntAgtSts stand for the new environment state and the new internal states of agents. We represent physical causality as a transition function, which takes an environment state and an event and provides a new environment state and a set of caused events. This function is specified as a set of reaction rules for the environment simulator in the form of rrEnv( RuleName, Now, Evt, Cond, CausEvt, Eff)
with obvious parameter meanings. Agent behavior, as a function from a mental state and a perception event to a new mental state and a set of action events, is also specified by a set of reaction rules: rr( AgentName, RuleName, OwnTime, Evt, Cond, ActEvt, Eff)
For processing these rules we use two meta-predicates: 1. prove( X, P) where X is a list of atomic propositions (representing an environment state or an internal agent state) and P is a proposition. 2. update( X, P, X’) where X’ is the new state resulting from updating X by assimilating P (in our simple example this means asserting/retracting atoms). When E is a current event, and there is an environment simulator rule, whose event term matches E such that prove( EnvSt, Cond) holds, then the specified CausEvt is added to the caused events list of step 2c) and the environment state is updated by performing update( EnvSt, Eff, NewEnvSt)
In a similar way, the reaction rules of each agent are applied, updating its internal state by update( IntAgtSt, Eff, NewIntAgtSt)
Concerning step 2c), notice that if there are only communication events (messages), then the perceptions of an agent are the messages sent to it. We now present the environment simulator: 1: 2:
envSimulator( Now, CrtEvts, EnvSt, NewEnvSt, TranslCausEvts) :findall( [CausEvt, Eff], ( member( Evt/_, CrtEvts), rrEnv( RuleName, Now, Evt, Cond, CausEvt,
Eff),
3: 4: 5: 6:
prove( EnvSt, Cond) ), ListOfResults), extractEffects( ListOfResults, Effects), computeNewEnvState( EnvSt, Effects, NewEnvSt), extractEvents( ListOfResults, CausEvts), translateCausEvts( Now, CausEvts, TranslCausEvts).
214
G. Wagner and F. Tulba
In line 2 all events (and their accompanying effects) that are caused by an event from the CrtEvts list are collected in ListOfResults. Based on the effects of the current environment events (extracted on line 3) the new environment state is determined (line 4). After extracting also the caused events from ListOfResults (in line 5), their absolute time stamp is computed with respect to the current moment (line 6). A similar procedure is performed for each agent: 1:
agtSimulator( AgtName, Now, CrtPercEvts, IntAgtSt, NewIntAgtSt, ActEvts) :2: timeFunction( AgtName, Now, OwnTime), 3: findall( [ActEvt, Eff], ( member( Evt, CrtPercEvents), rr( AgtName, RuleName, OwnTime, Evt, Cond, ActEvt, Eff), prove(IntAgtSt, Cond), ), ListOfResults), 4: extractEvents( ListOfResults, ActionEvents), 5: extractEffects( ListOfResults, Effects), 6: computeNewState( IntAgtSt, Effects, NewIntAgtSt).
5 Simulating the Communicating Elevators Scenario When making a simulation model, we have to draw a boundary around those entities we want to include in our simulation and those we want to exclude. In our simulation of the communicating elevators we choose not to include the shaft and the elevator users depicted in Fig. 4. This modeling decision turns the reqTransp messages of Fig. 4 into exogenous action events, which have to be generated at random (on the basis of some probability distribution). In AORS, a simulation model is expressed by means of 1. a model of the environment (obtained from the external AOR model of the scenario), consisting of a state structure model specifying all entity types, including exogenous event types a causality model, which is specified by means of reaction rules 2. a model for each involved agent (obtained from internalizing the external AOR model of the scenario into a suitable projection to the mental state of the agent under consideration), consisting of a state structure model a behavior model, which is specified by means of reaction rules 3. a specification of the initial states for the environment and for all agents The environment and agent models can be defined visually by means of AORML diagrams. Also the initial states can be defined by means of instance diagrams (similar to UML object diagrams). The encoding of a simulation model by means of a
Agent-Oriented Modeling and Agent-Based Simulation
215
high-level UML-based modeling language provides a platform-independent representation and allows to generate platform-specific code automatically. In the case of our Prolog simulation platform, we have to generate Prolog predicates from the AOR agent diagram shown in Fig. 6. We also have to generate the reaction rules for specifying causality and agent behavior in the format of the simulator. Please consult the web page http://tmitwww.tm.tue.nl/staff/gwagner/AORS
for obtaining further information about AORS and for downloading our Prolog AORS system.
6 Related Work Agent-Based Simulation is being used in various research areas, today. E.g., in
-
Biology, e.g. for investigating eco-systems or in population ethology (especially with respect to ants and other insects), see, e.g., [Klü01]; Engineering: for analyzing and designing complex (socio-) technical systems, such as Automatically Guided Vehicle Transport Systems [RW02];
-
Economics: e.g. in the simulation of auctions and markets (see Trading Agent Competition [TAC02]) and in the simulation of supply chains [LTFE03]; - Social Sciences: e.g. in [CD01] the phenomena of social monitoring and normbased social influence and in [Hal02] the cooperation in teams is studied. Some well-known platforms for Agent-Based Simulation are Swarm [MBLA96], SDML [MGWE98], Sesam [Klü01], MadKit [MadKit00] and CORMAS [Cormas01, Cormas00]. A particularly interesting class of simulation systems is formed by international technology competitions such as RoboCup [Robo98] and Trading Agent Competition (TAC) [TAC02]. Both RoboCup and TAC can be classified as interactive agent-based realtime simulation systems.
7 Conclusions We have presented a general approach to modeling and simulating scenarios of interacting systems as multiagent systems, based on the Agent-Object-Relationship (AOR) modeling language. Although there is a large body of work on agent-based simulation, our AORS approach appears to be the first general UML-based declarative approach to agent-based discrete event simulation. Our Prolog implementation of the AOR simulation system is still in an early prototype stage. In the future, we will transfer it to the Java platform.
216
G. Wagner and F. Tulba
References [Boo99] [Cormas00]
[CD01] [Dav02] [Dav02] [Den71] [EW02] [FG98]
[Hal02] [HLA02] [Jac94] [Klü01] [LTFE03] [MadKit00] [MBLA96] [MGWE98] [Robo98] [TAC02] [Wag03]
G. Booth: CourseWare Programmer’s Guide, Yale Institute for Biospheric Studies, 1999. C. Le Page, F. Bousquet, I. Bakam, A. Bah, C. Baron: CORMAS: A multiagent simulation toolkit to model natural and social dynamics at multiple scales. In Proceedings of Workshop "The ecology of scales", Wageningen (The Netherlands), 2000. Rosaria Conte and Frank Dignum. From Social Monitoring to Normative Influence. Journal of Artificial Societies and Social Simulation 4:2 (2001). Paul Davidsson. Agent Based Social Simulation: A Computer Science View. Journal of Artificial Societies and Social Simulation, 5:1 (2002). [http://jasss.soc.surrey.ac.uk/5/1/7.html] A. Davies: EcoSim: An Interactive Simulation, Duquesne Universität, Pittsburgh, 2002. D.C. Dennett: Intentional Systems. The Journal of Philosophy, 68 (1971). B. Edmonds, S. Wallis: Towards an Ideal Social Simulation Language, Manchester Metropolitan University, 2002. J. Ferber, O. Gutknecht: A meta-model for the analysis and design of organizations in multi-agent systems. Proceedings of the Third International Conference on Multi-Agent Systems (ICMAS´98), IEEE Computer Society Press, pp. 128–135, 1998. D. Hales: Evolving Specialisation, Altruism and Group-Level Optimisation Using Tags. Presented to the MABS'02 workshop at the AAMAS 2002 Conference. Springer-Verlag, LNCS, 2002. Defense Modelling and Simulation Office: High Level Architecture, 2002. I. Jacobson. The Object Advantage. Addison-Wesley, Workingham (England), 1994. F. Klügl: Multiagentensimulation, Addison-Wesley Verlag, 2001. O. Labarthe, E. Tranvouez, A. Ferrarini and B. Espinasse. A Heterogeneous Multi-Agent Modeling for Distributed Simulation of Supply Chains. Proc. of HOLOMAS 2003 Workshop. J. Ferber, O. Gutknecht, F. Michel: MadKit Development Guide, 2002. N. Minar, R. Burkhart, C. Langton, M. Askenazi: The Swarm Simulation System: A Toolkit For Building Multi-Agent Simulations, 1996. Moss, Scott , Helen Gaylard, Steve Wallis and Bruce Edmonds, SDML: A MultiAgent Language for Organizational Modelling, Computational and MathematicalOrganization Theory 4:1 (1998), 43–70. I. Noda, H. Matsubara, K. Hiraki, I. Frank: Soccer Server: a tool for research on multiagent systems, Applied Artificial Intelligence, 12:2-3 (1998). SICS: Trading Agent Competition 2002. See http://www.sics.se/tac/. G. Wagner. The Agent-Object-Relationship Meta-Model: Towards a Unified View of State and Behavior, Information Systems 28:5 (2003), 475–504.
REF: A Practical Agent-Based Requirement Engineering Framework Paolo Bresciani1 and Paolo Donzelli2 1
2
ITC-irst Via Sommarive 18, I-38050 Trento-Povo (Italy) [email protected] Department of Computer Science – University of Maryland College Park - MD (USA) [email protected]
Abstract. Requirements Engineering techniques, based on the fundamental notions of agency, i.e., Agent, Goal, and Intentional Dependency, have been recognized as having the potential to lead towards a more homogeneous and natural software engineering process, ranging from high-level organization needs to system deployment. However, the availability of simple representational tools for Requirements Engineering still remains a key factor to guarantee stakeholders involvement, facilitating their understanding and participation. This paper introduces REF, an agent-based Requirements Engineering Framework designed around the adoption of a simple, but effective, representational graphical notation. Nevertheless, a limited expressiveness of the graphical language may constrain the analysis process, reducing its flexibility and effectiveness. Some extensions are proposed to enhance REF capability to support requirements engineers in planning and implementing their analysis strategies, without affecting however REF clarity and intuitiveness.
1
Introduction
Agent- and goal-based Requirements Engineering (RE) approaches have the potential to fill the gap between RE and Software Engineering [5,4]. The concepts of Agent, Goal, and Intentional Dependency, in fact, applied to describe the social setting in which the system has to operate, lead towards a smooth and natural system development process, spanning from high-level organizational needs to system deployment [4]. Goals are valuable in identifying, organizing and justifying system requirements [14,2], whereas the notion of agent provides a quite flexible mechanism to model the stakeholders. However, the concrete application of such approaches has been until now limited only to few case studies. Several causes of this still immature adoption of agent- and goal-based paradigms for RE may be identified. Below we consider only two of them. First, although the notion of goal is central in some RE consolidated approaches like i* [15], GBRAM [1,2], and KAOS [8], an integrated and comprehensive requirements analysis methodology, clearly linked, or link-able, to the subsequent phases of software development, still is an open issue. At best of our knowledge, only the Tropos methodology [5,4] fully addresses this issue. Yet, not full consideration has been given by Tropos ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 217–228, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
218
P. Bresciani and P. Donzelli
itself to the design of a precise process for the RE phases (early requirements and late requirements), due to the wide set of aspects that have to be captured. Second, concerning the RE component of Tropos (or i*, to which Tropos RE is largely inspired), it is worth noticing that its considerably rich modeling framework, although promises to be capable of capturing several aspects relevant for the following phases, shows a certain level of complexity, so resulting understandable to only a strict group of practitioners. When the use of an i*-like modeling language has to be extended to non-technical stakeholders, it may be appropriate to give up with the full language expressiveness and modeling flexibility, in favor of a more straightforward and simple way to communicate with the stakeholders. In such a perspective, the paper introduces an agent- and goal-based RE Framework (called REF) previously applied to an extensive project for the definition of the requirements of a simulation environment [11]. Simple, yet reasonably expressive, REF allows non technical stakeholders to elicitate requirements, in collaboration with a requirements engineer which, at the same time, is provided with an effective methodology and process for requirements acquisition, analysis and refinement, and for communicating, in an easily intelligible way, the results of her analysis to the stakeholders. In the following, after a brief introduction to REF (Sections 2), a case study, is adopted (Sections 3) to critically revise the analysis process underlying the current methodology, to point out some of its current limits, and, finally, to suggest some notational and methodological extensions (Section 4). The tradeoff between the REF simplicity (and usability) and its expressiveness is carefully analyzed. Finally, observed advantages are discussed in the conclusive Section.
2
REF
REF is designed to provide the analysts and the stakeholders with a powerful tool to capture high-level organizational needs and to transform them into system requirements, while redesigning the organizational structure to better exploit the new system. The framework tackles the modeling effort by breaking the activity down into more intellectually manageable components, and by adopting a combination of different approaches, on the basis of a common conceptual notation. Agents are used to model the organization [9,11,16]. The organizational context is modeled as a network of interacting agents (any kind of active entity, e.g., teams, humans and machines, one of which is the target system), collaborating or conflicting in order to achieve both individual and organizational goals. Goals [9,11,8] are used to model agents relationships, and, eventually, to link organizational needs to system requirements. According to the nature of a goal, a distinction is made between hard-goals and softgoals. A goal is classified as hard when its achievement criterion is sharply defined. For example the goal “document be available” is a hard-goal, being easy to check whether or not it has been achieved (i.e., is the document available, or not?). For a soft-goal, instead, it is up to the goal originator, or to an agreement between the involved agents, to decide when the goal is considered to have been achieved. For example, the goal “document easily and promptly available” is a soft-goal, given that when we introduce concepts such as easy and prompt , different persons usually have different opinions.
REF: A Practical Agent-Based Requirement Engineering Framework
Soft Goal Modelling
Hard Goal Modelling
goal modeling phase
hard goals, constraints
mapping to organisation
hard goals
RE Framework
219
Development Flow
soft goals
Elicitation & Validation Flow
stakeholders & analysts
organization modeling phase
Organisation Modelling
start−up phase
Fig. 1. The Requirements Engineering Framework (REF)
REF, tackles the modeling effort by supporting three inter-related activities as listed below (see also Figure 1). The three modeling activities do not exist in isolation, rather they are different views of the same modeling effort, linked by a continuous flow of information, schematized as Development, and Elicitation & Validation flows. Organization Modeling, during which the organizational context is analyzed and the agents and their goals identified. Any agent may generate its own goals, may operate to achieve goals on the behalf of some other agents, may decide to collaborate with or delegate to other agents for a specific goal, and might clash on some other ones. The resulting goals will then be refined, through interaction with the involved agents, by hard- and soft-goal modeling. Hard-Goal Modeling seeks to determine how an agent can achieve a received hardgoal, by decomposing it into more elementary subordinate hard-goals, tasks1 , and resources2 . Supported by the REF graphical notation, the analyst and the agent will work together to understand and formalize how the agent thinks to achieve the goal, in terms of subordinate hard-goals and tasks that he or she will have to achieve and perform directly, or indirectly, by passing them to other agents. Soft-Goal Modeling aims at producing the operational definitions of the soft-goals, sufficient to capture and make explicit the semantics that are usually assigned implicitly by the involved agents [3,6,7]. Unlike for an hard-goal, for a soft-goal the achievement criterium is not, by definition, sharply defined, but implicit in the originator intentions. The analyst’s objective during soft-goal modeling is to make explicit 1 2
A task is a well-specified prescriptive activity. A resource is any concrete or information item necessary to perform tasks or achieve goals.
220
P. Bresciani and P. Donzelli
such intentions, in collaboration with the goal originator. However, depending on the issue at hand, and the corresponding role played by the two agents (i.e., the originator and the recipient) within the organization, also the recipient may be involved in the process, to reach a sharply defined achievement criterium upon which both of them can agree. Again, the analyst and the agents will cooperate through the support of the REF graphical notation. In the three modeling activities, REF uses a diagrammatic notations which immediately convey the dependencies among different agents and allow for a detailed analysis of the goals, upon which the agents depend. The adopted graphical notation is widely inspired by the i*framework [15] for RE [16] and business analysis and re-engineering [17], and thus open to be integrated in or extended by the Tropos methodology. An important aspect of REF is to adopt the i* notational ingredients at a basic and essential level, in favor of a higher usability and acceptability by the stakeholders. In the next Sections the notation and the methodology is briefly introduced by means of a case study. Mainly, Soft-Goal Modeling will be considered. The main aim during Soft-Goal Modeling is to iteratively refine each soft-goal in terms of subordinate elements, until only hard-goals, tasks, resources, and constraints are obtained (that is, until all the soft aspects have been dealt with) or each not refined soft-goal is passed on another agent, in the context of which will then be refined. Constraints may be associated with hard-goals, tasks, and resources to specify the corresponding quality attributes. Thus, the resulting set of constraints represents the final and operationalized views of the involved quality attributes, i.e., the quality models that formalize the attributes for the specific context [3, 6].
3 The Case Study We refer to an on-going project aiming at introducing an Electronic Record Management System (ERMS) within a government unit. The impact of such a system on the common practices of the communities and sub-communities of knowledge workers is quite relevant. A ERMS is a complex Information and Communication Technology (ICT) system which allows for efficient storage and retrieval of document-based unstructured information, by combining classical filing strategies (e.g., classification of documents on a multi-level directory, cross-reference between documents, etc.) with modern information retrieval techniques. Moreover, it usually provides mechanisms for facilitating routing and notification of information/document among the users, and supporting interoperability with similar (typically remote) systems, through e-mail and XML. Several factors (international benchmarking studies, citizens demand, shrink budgets, etc.) called for the decision of leveraging new technologies to transform the organization into a more creative, and knowledgeable environment. The initial organization model is shown in Figure 2. Circles represent agents, and dotted lines are used to bound the internal structure of complex agents; that is, agents containing other agents. In Figure 2, the complex agent Organization Unit corresponds to the organizational fragment into which it is planned to introduce the new ERMS, whereas the Head of Unit is the agent, acting within the Organizational Unit, responsible for achieving the required
REF: A Practical Agent-Based Requirement Engineering Framework
legend Agent
resource
221
soft goal
hard goal
task
exploit ICT to increase performance while avoiding risks
cost/effective and quick solution
dependency link
Head of Unit
Organisational Unit
Fig. 2. Introducing the ERMS: the initial organization model
organizational improvement (modeled by the soft-goals exploit ICT to increase performance while avoiding risks, and cost/effective and quick solution). Goals, tasks, resources and agents (see also next Figures) are connected by dependency-links, represented by arrowhead lines. An agent is linked to a goal when it needs or wants that goal to be achieved; a goal is linked to an agent when it depends on that agent to be achieved. Similarly, an agent is linked to a task when it wants the task to be performed; a task is linked to an agent when the agent is committed at performing the task. Again, an agent is linked to a resource when it needs that resource; a resource is linked to an agent when the agent has to provide it. By combining dependency-links, we can establish dependencies among two or more agents. As mentioned, the soft-goals modeling process allow the analysts and the stakeholders to operationalize all the soft aspects implicitly included in the meaning of the soft-goal. Thus, for example, Figure 3 describes how the soft-goal exploit ICT to increase performance while avoiding risks is iteratively top-down decomposed to finally produce a set of tasks, hard-goals, and constraints that precisely defines the meaning of the soft-goal, i.e., the way to achieve it. Figure 3, in other terms, represents the strategy that the Head of Unit (as result of a personal choice or of a negotiation with the upper organizational level) will apply to achieve the assigned goal. Again, the arrowhead lines indicate dependency links. A softgoal depends on a sub-ordinate soft-goal, hard-goal, task, resource or constraint, when it requires that goal, task, resource or constraint to be achieved, performed, provided, or implemented in order to be achieved itself. These dependency links may be seen as a kind of top-down decomposition of the soft-goal. Soft-goals decompositions may be conjunctive (all the sub-components must be satisfied, to satisfy the original soft-goal), indicated by the label A on the dependency link, or disjunctive (it is sufficient that only one of the components is satisfied), indicated by the label O on the dependency link (see Figure 5).
222
P. Bresciani and P. Donzelli
exploit ICT to increase performance while avoiding risks
Head of Unit
A
A A
increase personal performance
soft goal
hard goal
resource
task
increase productivity
A
A
avoid risks due to new technology A
A
A
easy document access
multi−channel access
O A dependency link
be more productive
A
increase process visibility
A
no filters from secretary
A
A
A
e−document as paper document
A
.....
guarantee smooth transaction
provide process performance
A PDA for reading documents
guarantee security
A
A A
constraint
(And / Or decomposition)
legend
reduce process constraints
A A
..... notify new documents by SMS
A
provide employee’s performance A A
weekly update
.....
A
daily update
number of documents waiting
mantain process structure
provide employee’s number of documents
Fig. 3. The “exploit ICT to increase performance while avoiding risks” Soft-Goal Model
According to Figure 3, the Head of Unit has to increase personal performance, to increase productivity of the whole unit, and also to avoid risks due to new technology. Let’s consider in details only the first sub-soft-goal, i.e., increase personal performance. It spawns two subordinate soft-goals, easy document access, for which the Head of Unit will require a multi-channel access system in order to be able to check and transfer the documents to the employees also when away from the office, and increase process visibility, to take better informed decisions. In particular, the soft-goal increase process visibility will eventually lead to the identification of some tasks (functionalities) the system will have to implement in order to collect and make available some data about the process (e.g., number of documents waiting) and about the employees (provide employee’s number of documents that have been assigned), and of some associated constraints, represented by a rounded-rectangle with a horizontal line, characterizing such data. In Figure 3, for example, they specify the frequency of update: daily for the process data and weekly for the employee’s ones.
4 Adding Special Links to Support the Analysis Process As described in Section 2, REF aims at providing a representational framework for requirements discovery and analysis, characterized by a sufficiently expressive graphical notation that, at the same time, be simple enough to be easily and quickly grasped by the stakeholders, even by those unfamiliar with RE. Indeed, these are very important aspects that, as demonstrated by several case studies [13,12,11], make REF applicable
REF: A Practical Agent-Based Requirement Engineering Framework
223
exploit ICT to increase performance while avoiding risks
Head of Unit
A
A A
increase personal performance
increase productivity
A
A
avoid risks due to new technology A
A
A
easy document access
be more productive
A
multi−channel access
reduce process constraints
increase process visibility
A
A
A
A
e−document as paper document
A
.....
A PDA for reading documents
..... notify new documents by SMS
guarantee security
A A
no filters from secretary
A
A
provide process performance provide employee’s performance
guarantee smooth transaction
A
..... A mantain process structure
A A
number of documents waiting
A
provide employee’s number of documents
twice a week update
Fig. 4. The “exploit ICT to increase performance while avoiding risks” Soft-Goal Model revised
..... Head of Unit
A
increase process visibility
A A
provide employee’s performance
S
provide process performance
Fig. 5. A sharing between goals of the same agent
to real projects, and ensure a concrete involvement of the stakeholders, allowing for a quicker and more effective knowledge acquisition and requirements elicitation process. REF simplicity and effectiveness is mainly based on two key points: 1) the use of only one type of link (the dependency link); 2) the focus, during both hard- and soft-goal analysis, on only one goal (and the involved agents) at time; this leads to the drawing of very simple goal analysis diagrams, strictly generated by a top-down process. These two aspects make REF different from other approaches, in particular from i*. Indeed, we believe that these two characteristics allow for a very easy reading of the goal models; the second feature, in particular, allows the stakeholders (and the analysts, as well) to
224
P. Bresciani and P. Donzelli
concentrate the attention on one problem at time, and not being worried about the order in which different analyses of different sub-diagrams should be interleaved in order to obtain different possible diagrams: the goal diagram is always a tree, and it is always generate in the same shape, whatever node expansion sequence is followed. In the following, we analyze whether or not these two very important REF characteristics may represent a limit to some relevant aspects of the process of domain description. In particular, we tackle two possible cases in which the present version of REF tends to show some limits, describe them by means of our case study ERMS, and propose simple extensions to REF, to allow for a finer control during the process of model description and requirements elicitation. 4.1
Sharing Goals (Tasks, Constraints . . . )
Let us here analyze the fact that REF produces only trees, as goal diagrams. Thus, there is no the explicit possibility to deal with sub-goals (or constraints, or tasks, or resources) that may be shared by different upper-level goals. This situation may be further distinguished in at least three different sub-cases. First Case: top-down tree expansion and analysis induces at introducing different subgoals (or constraints, or tasks, or resources) for any different goal that is found during the goal analysis activity, even if different goals could be satisfied by the same sub-goal (or constraint, or task, or resource). For example, in Figure 3, two distinct constraints have been introduced for satisfying the two soft-goals provide process performance and provide employee’s performance, namely the constraints daily update and weekly update. Instead, for example, the Head of Unit could have accepted that the two softgoals, rather than requiring two different specialized constraints (as in Figure 3), would have shared the same constraint, e.g., a twice a week update (as in Figure 4). After all, according to REF, any sequence may have been followed in analyzing the two softgoals, and the two constraints may have been introduced in two very different moments, making it very difficult to spot that a common (although slightly different) constraint could have been adopted. This compromise, instead, could have been identified and judged as acceptable if considered by the analyst together with the Head of Unit at the proper moment during the design activity. The difference between Figure 3 and Figure 4 is minimal, regarding only leaf nodes, as highlighted by the dotted circle. Thus, Figure 4 can be obtained as a simple transformation of Figure 3. But let us consider a more complex hypothetical case, in which the two nodes collapsing in one are non-leaf nodes, with possibly deep trees expanding from them: relevant parts of the two sub trees, rooted in the two nodes, would have to be revised, in order to consider an alternative non-tree-based analysis. Thus, in this case, it would be strategic to be able to introduce a common child for the two different nodes before proceeding with the analysis of the nodes sub-trees. It is clear that, now, different diagram evolution strategies, and thus development sequences, may lead to quite different results or, even when producing the same result, this may be obtained with different degrees of efficiency. For example, a top-down bread-first diagram expansion could be probably preferred to a top-down depth-first strategy. In this way, it may appear appropriate to develop a shared sub-tree
REF: A Practical Agent-Based Requirement Engineering Framework
225
only once, with two advantages: 1) at the design level, the analysis has not to be carried out twice; 2) at the implementation level, the complexity of the system to be implemented will be reduced, being two potentially different requirements, and all the derived artifacts – from architectural design down to implemented code – collapsed in one. Second Case: as a specialization of the first one, we can consider the case in which the similar sub-goal sharing happens among goals attached to the same agent already since its introduction in the organizational model, and not as a result of goal modeling. In this case, the REF methodology would lead the analyst to duplicate the sub-goal in two different diagrams, possibly with slightly different labels, although with the same semantics. Catching these cases as early as possible is very important in order to avoid duplicated analysis and assign higher priority and relevance to the analysis of the shared items. Third Case: a more intriguing situation may arise when the very same sub-goal can be shared among two different agents, as a consequence of two different and autonomous analyses of two different goals of the two agents (there is no room here to present the figures to illustrate an example in this case, but see [10]). Again, the analysis of such a shared soft-goal immediately assume a higher relevance and priority over the analysis of other goals. Its satisfaction is desired by two agents! For example, leading to the selection, among all the possible available tools on the market, of only one kind of mobile access channel able to satisfy both the agents. From the analysis of the previous three cases, clearly emerges the need of introducing in REF some mechanism to support the analysts during their refinement and navigation through the goals and sub-goals models. In particular, we propose to provide the analysts with a specific notation to be used to highlight situations where they believe that some commonalities could be hidden, i.e., that shared-goals could arise during the analysis. In other terms, to introduce in REF a notation suitable to act as a high-level reasoning support tool to enable the analysts to record their intuitions while building the goals models, by making notes to drive their strategies, e.g., to highlight where a top-down breath-first diagram expansion may be preferable to a top-down depth-first one. As such a notation, to denote situations in which a possible sharing could, we introduce what we call the S-connection, a link that does not have arrows, being the relationship perfectly symmetric, and that is marked by the label “S”, that stands for Sharing. Figure 5 shows a fragment of Figure 3 where the S-connection has been adopted. In particular, it shows how the S-connection could have been used during the analysis of the soft-goal to highlight in advance the possibility of sharing between the softgoals provide employee performance and provide process performance (the first example case previously analyzed). In the same way, in Figure 6 is depicted the use of the S-notation to highlight, within the soft-goal analyzed in Figure 3, a possible sharing between the soft-goal increase personal performance, that the Head of Unit wants to achieve, and the soft-goal be more productive, that the Head of Unit imposes, transfers to, the Employee (the third example case previously analyzed). It is worth noting how the S-notation is only a reasoning support mechanism that tend to disappear once the
226
P. Bresciani and P. Donzelli
Head of Unit
exploit ICT to increase performance while avoiding risks
A
A
increase personal performance
A
avoid risks due to new technology
increase productivity
A
S
A
reduce process constraints
be more productive
Fig. 6. A sharing between goals of different agents
exploit ICT to increase performance while avoiding risks
Head of Unit be more productive guarantee security
PDA for reading documents
provide employee’s performance
.....
Archivist
H
cost/effective and quick solution
ERMS
Employee protect my privacy
IT easy to integrate
Organisational Unit apply public administration standard
Fig. 7. A conflict between goals of different agents
analysis proceeds. In other terms, the S-notation purpose is to mark possible sharing situations to drive the analysis (e.g., bread-first, multi-agents analysis, repeated back to back comparisons, and so on), but does not have any reason to exist any more once the goals have been exploded: the initial REF notation, with its simplicity, is sufficient for that regard.
REF: A Practical Agent-Based Requirement Engineering Framework
4.2
227
Clashing Goals (Tasks, Constraints . . . )
Another common situation regards the possibility of efficiently dealing with clashing needs (goals, or constraints, tasks, and resources). As well as during a top-down analysis a top-down introduced sub-goal may be recognized as helpful for another goal (possibly of another agent), similarly, the introduced sub-goal may be recognized as (possibly) harmful for another goal. In addition, during the analysis, new agents may have to be introduced into the context (e.g., the Head of Unit requires the Employee to be more productive), and such new agents may express their own needs by introducing new goals that may very easily clash with other goals already in the context. Indeed, REF already provide the tool for the detailed recognition of such situations. In fact, when fully operationalized in terms of tasks and constraints, goals models can be adopted to detect clashing situations and to resolve them. Nevertheless, it is critical to foresee such possibly clashing situations as early as possible, even only at a very qualitative level. To enable the analysts to mark possible conflicting situations (and build their refinement strategy to deal with them), we introduce the H-connection (“H” for hurting). This is a powerful tool to detect possible conflicts and try to reconcile different stakeholders points of view, allowing to evolve the analyses only along the most promising alternatives. An example of application is given in Figure 7, where a Hconnection is used to highlight a possible conflict between two goals before proceeding in their analysis (i.e., the soft-goal provide employee’s performance is not broken down into tasks before taking into account the protect my privacy one – see also [10]).
5
Conclusions
The paper introduced an agent-oriented Requirements Engineering Framework (REF), explicitly designed to support the analysts in transforming high-level organizational needs into system requirements, while redesigning the organizational structure itself. The underlying concepts and the adopted notations make of REF a very effective and easy to deal with (usable) tool, able to tackle complex real case situations, while remaining simple enough to allow a concrete and effective stakeholders involvement. REF is strongly based upon i*, the modeling framework suggested by Eric Yu [15, 17,16]. However, it introduces some simplifications and tends to adopt a more pragmatic approach in order to obtain a greater and more active involvement of the stakeholders during the requirements discovery, elicitation and formalization process. However, we felt that REF could be improved with the regard to the support it provided to the analysts in dealing with more complex, and system/organizational design related issues, such as shared and clashing stakeholders needs. In both cases, an early detection of such a situation could lead to better analysis results: shared needs could be objects of a more intensive analysis effort to exploit commonalities to reduce complexity and increase re-usability; clashing needs could be solved at a very early stage, to focus then the analysis only towards the most promising alternatives. Two graphical notations (i.e., the S-connection and the H-connection) have therefore been introduced to allow the analysts to mark such situations and better reason about how to build their strategy, while performing them.
228
P. Bresciani and P. Donzelli
References 1. A. I. Ant´on. Goal-based requirements analysis. In Proceedings of the IEEE International Conference on Requirements Engineering (ICRE ’96), Colorado Springs, USA, Apr. 1996. 2. A. I. Ant´on and C. Potts. Requirements for evolving systems. In Proceedings of the International Conference on Software Engineering (ICSE ’98), Kyoto, Japan, Apr. 1998. 3. V. R. Basili, G. Caldiera, and H. D. Rombach. Encyclopedia of Software Engineering, chapter The Goal Question Metric Approach. Wiley&Sons Inc, 1994. 4. P. Bresciani, P. Giorgini, F. Giunchiglia, J. Mylopoulos, and A. Perini. TROPOS: An agentoriented software development methodology. Autonomous Agents and Multi-Agent Systems, 2003. in Press. 5. P. Bresciani, A. Perini, , F. Giunchiglia, P. Giorgini, and J. Mylopoulos. A Knowledge Level Software Engineering Methodology for Agent Oriented Programming. In Proceedings of the Fifth International Conference on Autonomous Agents, Montreal, Canada, May 2001. 6. G. Cantone and P. Donzelli. Production and maintenance of goal-oriented software measurement models. International Journal of Knowledge Engineering and Software Engineering, 10(5):605–626, 2000. 7. L. K. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos. Non-Functional Requirements in Software Engineering. Kluwer Publishing, 2000. 8. A. Dardenne, A. van Lamsweerde, and S. Fickas. Goal-directed requirements acquisition. Science of Computer Programming, 20(1–2):3–50, 1993. 9. M. D’Inverno and M. Luck. Development and application of an agent based framework. In Proceedings of the First IEEE International Conference on Formal Engineering Methods, Hiroshima, Japan, 1997. 10. P. Donzelli and P. Bresciani. Goal-oriented requirements engineering: a case study in egovernment. In J. Eder and M. Missikoff, editors, Advanced Information Systems Engineering (CAiSE’03), number 2681 in LNCS, pages 605–620, Klagenfurt/Velden, Austria, June 2003. Springer-Verlag. 11. P. Donzelli and M. Moulding. Developments in application domain modelling for the verification and validation of synthetic environments: A formal requirements engineering framework. In Proceedings of the Spring 99 Simulation Interoperability Workshop, LNCS, Orlando, FL, 2000. Springer-Verlag. 12. P. Donzelli and R. Setola. Putting the customer at the center of the IT system – a case study. In Proceedings of the Euro-Web 2001 Conference – The Web in the Public Administration, Pisa, Italy, Dec. 2001. 13. P. Donzelli and R. Setola. Handling the knowledge acquired during the requirements engineering process. In Proceedings of the Fourteenth International Conference on Knowledge Engineering and Software Engineering (SEKE), 2002. 14. A. van Lamsweerde. Goal-oriented requirements engineering: A guided tour. In Proceedings of RE’01 – International Joint Conference on Requirements Engineering, pages 249–263, Toronto, aug 2001. IEEE. 15. E. Yu. Modeling Strategic Relationships for Process Reengineering. PhD thesis, University of Toronto, Department of Computer Science, University of Toronto, 1995. 16. E. Yu. Why agent-oriented requirements engineering. In Proceedings of 3rd Workshop on Requirements Engineering For Software Quality, Barcelona, Catalonia, June 1997. 17. E. Yu and J. Mylopoulos. Using goals, rules, and methods to support reasoning in business process reengineering. International Journal of Intelligent Systems in Accounting, Finance and Management, 1(5):1–13, Jan. 1996.
Patterns for Motivating an Agent-Based Approach Michael Weiss School of Computer Science, Carleton University, Ottawa, Canada [email protected]
Abstract. The advantages of the agent-based approach are still not widely recognized outside the agent research community. In this paper we use patterns as a way of motivating the use of agents. Patterns have proven to be an effective means for communicating design knowledge, describing not only solutions, but also documenting the context and motivation for applying these solutions. The agent community has already started to use patterns for describing best practices of agent design. However, these patterns tend to pre-suppose that the decision to follow an agent approach has already been made. Yet, as this author has experienced on many occasions, that is usually far from a given. There is a need for guidelines that summarize the key benefits of the agent approach, and serve as a context for more specific agent patterns. Our response to this need is a pattern language – a set of patterns that build on each other – that introduces the concepts of agent society, roles, common vocabulary, delegation, and mediation. We also argue that authors of agent patterns should aim to organize their patterns in the form of pattern languages, and present a template for pattern languages for agents.
1
Introduction
Agents are rapidly emerging as a new paradigm for developing software applications. They are being used in an increasing variety of applications, ranging from relatively small systems such as personal assistants to large and open mission-critical systems such as switches, electronic marketplaces, or health care information systems. There is no universally accepted definition of the notion of an agent. However, the following four properties are widely accepted to characterize agents: autonomy, social ability, reactivity and proactiveness [29]. Agents are autonomous computational entities (autonomy), which interact with their environment (reactivity) and other agents (social ability) in order to achieve their own goals (proactiveness). Agents typically represent different users, on whose behalf they act. Most interesting agent-based systems are thus collections of collaborating autonomous agents (typically referred to as multi-agent systems), each representing an independent locus of control. Multiple agents can, of course, be acting on behalf of the same user. Agents also provide an appropriate metaphor for conceptualizing certain applications, as the behavior of agents more closely reflects that of the users whose work they are delegated to perform or support. This reflects the fact that most complex software systems support the activities of a group of users, not individual users. Such agents are then treated as actors in a system that comprises human actors as well. M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 229–240, 2003. © Springer-Verlag Berlin Heidelberg 2003
230
M. Weiss
The following domain characteristics are commonly quoted as reasons for adopting agent technology: an inherent distribution of data, control, knowledge, or resources; the system can be naturally regarded as a society of autonomous collaborating entities; and legacy components must be made to interoperate with new applications [24]. However, the advantages of the agent-based approach are still not widely recognized outside the agent research community. While there are several papers discussing the differences between agents and objects [12, 13], on the one side, and agents and components [8], on the other, these papers do not provide actual guidelines for assessing whether a particular development project can benefit from using an agent-based approach. Patterns, on the other hand, are an effective way of guiding non-experts [11]. In this paper we use patterns as a way of motivating the use of agents. In the following sections we first summarize related work on agent patterns, then document a set of patterns that introduce key agent concepts. These serve as a conceptual framework and context for documenting more specific agent patterns (eg security patterns). We also make a case that agent pattern authors should organize their patterns in the form of pattern languages, and present a template for pattern languages for agents.
2
Related Work
Patterns are reusable solutions to recurring design problems, and provide a vocabulary for communicating these solutions to others. The documentation of a pattern goes beyond documenting a problem and its solution. It also describes the forces or design constraints that give rise to the proposed solution [1]. These are the undocumented and generally misunderstood features of a design. Forces can be thought of as pushing or pulling the problem towards different solutions. A good pattern balances the forces. There is by now a growing literature on the use of patterns to capture common design practices for agent systems [2, 12, 9, 16, 28]. The separate notion of an agent pattern can be justified by differences between the way agents and objects communicate, their level of autonomy, and social ability [7]. Agent patterns are documented in a similar manner as other software patterns, except for the structure of an agent pattern where we will make use of role models [20, 13]. The distinction between role models and collaboration diagrams is the level of abstraction: a collaboration diagram shows the interaction of instances, whereas a role model shows the interaction of roles. Aridor and Lange [2] describe a set of domain-independent patterns for the design of mobile agent systems. They classify mobile agent patterns into traveling, task, and interaction patterns. Kendall et al [12] use patterns to capture common building blocks for the architecture of agents. They integrate these patterns into the Layered Agent pattern, which serves as a starting point for a pattern language for agent systems based on the strong notion of agency. Schelfthout et al [21], on the other hand, document agent implementation patterns suitable for developing weak agents. Deugo et al [9] identify a set of patterns for agent coordination, which are, again, domain-independent. They classify agent patterns into architectural, communication, traveling, and coordination patterns. They also describe an initial set of global forces
Patterns for Motivating an Agent-Based Approach
231
that push and pull solutions for coordination. Kolp et al [15] document domainindependent organizational styles for multi-agent systems using the Tropos methodology. On the other hand, Kendall [13] reports on work on a domain-specific pattern catalog developed at BT Exact. Several of these patterns are described in the ZEUS Agent Building Kit documentation [5] using role models. Shu and Norrie [22], and Weiss [27] have also documented domain-specific patterns, respectively for agent-based manufacturing, and electronic commerce. However, unlike most other authors, their patterns are organized in the form of a pattern language. This means that the patterns are connected to each other in such a way that they guide a developer through the process of designing a system. Lind [16], and Mouratidis et al [18] suggest that we can benefit from integrating patterns with a development process, while Tahara et al [25], and Weiss [28] propose pattern-driven development processes. Tahara et al [25] propose a development method based on agent patterns, and distinguish between macro and micro architecture patterns. Weiss [27] documents a process for mining for, and applying agent patterns. Lind [16] suggests a view-based categorization scheme for patterns based on the MASSIVE methodology. Mouratidis et al [18] document a pattern language for secure agent systems that uses the modeling concepts of the Tropos methodology. As the overview of related work has shown, most agent patterns are documented in the form of pattern catalogs. Usually, the patterns are loosely related, but there is a lack of cohesion between them. Such collections of patterns provide point solutions to particular problems, but do not guide the developer through the process of designing a system using those patterns. This can only be achieved by a pattern language. We argue that agent pattern authors have to put more emphasis on organizing their patterns in the form of pattern languages for them to become truly useful. We therefore next suggest a template for pattern languages. A secondary goal of our pattern language for motivating the use of agents is to illustrate the use of that template.
3
Template for Pattern Languages
Patterns are not used in isolation. Although individual patterns are useful at solving specific design problems, we can benefit further from positioning them among one another to form a pattern language. Each pattern occupies a position in a network of related patterns, in which each pattern contributes to the completion of patterns “preceding” it in the network, and is completed by patterns “succeeding” it. A pattern language guides developers through the process of generating a system. Beck and Johnson [3] describe this generative quality of patterns: “Describing an architecture with patterns is like the process of cell division and specialization that drives growth in biological organisms. The design starts as a fuzzy cloud representing the system to be realized. As patterns are applied to the cloud, parts of it come into focus. When no more patterns are applicable, the design is finished.” Unlike a pattern catalog that classifies patterns into categories, the goal of a good pattern language is, foremost, to create cohesion among the patterns. We want the patterns to be closely related to each other. References to patterns should therefore largely be to other patterns in the same pattern language; and the patterns should be
232
M. Weiss
organized from higher-level to lower-level patterns in a refinement process. We can also expect a pattern language to have a reasonable degree of coverage of its application domain. We want to be able to generate most of the possible designs. Finally, the goal of a pattern language is to make the links between the patterns easy to use and understandable. This we refer to as the navigability of a pattern language. A pattern language should contain: a roadmap, a set of global forces, references to other patterns in the language in the context section of each pattern, and a resulting context section in each pattern. The roadmap shows the structure of the pattern language. The arrows in the roadmap point from a pattern to a set of patterns that system designers may want to consult next, once this pattern has been applied. It is also often useful to identify the forces that need to be resolved in the design of the systems targeted by the pattern language. These global forces establish a common vocabulary among the patterns, and can be used to summarize their contributions. The context section of each pattern in the pattern language describes a (specific) situation in which the pattern should be considered. In particular, the context includes references to other patterns in the language in whose context the pattern can be applied (“You are using pattern X, and now wish to address concern Y”). More than just referring to “related patterns” (usually external to this set of patterns), the resulting context section of each pattern similarly refers to other patterns in the same language that should be consulted next, together with a rationale (that is, the trade-off addressed) for each choice (“Also consult pattern X for dealing with concern Y”).
4
Pattern Language for Motivating an Agent-Based Approach
The structure of our pattern language is shown in the roadmap in Figure 1. The starting point (root) for the pattern language is AGENT SOCIETY. This pattern depends on AGENT AS DELEGATE, AGENT AS MEDIATOR, and COMMON VOCABULARY. The nature of these dependencies, that is, the rationale for applying each of these patterns, is documented in the Related Context section of the AGENT SOCIETY pattern.
Fig. 1. Roadmap for the pattern language
Patterns for Motivating an Agent-Based Approach
233
Although we lack the space for a full description of the global forces involved in agent-based design, we can identify the following trade-offs between: • the autonomy of an agent and its need to interact • the user’s information overload and the degree of control the user has over agents acting on his behalf • the openness of an agent society and the resulting dynamicity and heterogeneity • the need for intermediaries to facilitate agent interactions and the concern for the privacy of users’ sensitive data (one form of trust) • the heterogeneity and concern for quality of service (another form of trust) The patterns themselves will elaborate on these trade-offs in more detail. However, it is important to note that these trade-offs are independent of the application domain. As a motivating example for using the pattern language, consider the design of information agents for filtering news. The example uses AGENT AS DELEGATE in as much users are represented by USER AGENTS that maintain their profiles, and filter search results against them. Search requests are represented by TASK AGENTS: this includes one-shot searches, as well as subscriptions to periodically repeated searches. The example also requires AGENT AS MEDIATOR in as much users can obtain recommendations from each other. A recommender agent mediates between the users. Together these agents form an AGENT SOCIETY, and they therefore need to agree on a COMMON VOCABULARY to communicate with each other, and with news sources. AGENT SOCIETY Context Your application domain satisfies at least one of the following criteria: your domain data, control, knowledge, or resources are decentralized; your application can be naturally thought of as a system of autonomous cooperating entities; or you have legacy components that must be made to interoperate with new applications. Problem How do you model systems of autonomous cooperating entities in software? Forces • The entities are autonomous in the sense that they do not require the user’s approval at every step of executing their tasks, but can act on their own. • However, they rely on other entities to achieve goals that are outside their scope or reach, and need to cooperate with each other. • They also need to coordinate their behaviors with those of others to ensure that their own goals can be met, avoiding interference with each other. Solution Model your application as a society of agents. Agents are autonomous computational entities (autonomy), which interact with their environment (reactivity) and other agents (social ability) in order to achieve their own goals (proactiveness). Often,
234
M. Weiss
agents will be able to adapt to their environment, and have some degree of intelligence, although these are not considered mandatory characteristics. These computational entities act on behalf of users, or groups of users [17]. Thus agents can be classified as delegates, representing a single user, and acting on her behalf, or mediators, acting on behalf of a group of users, facilitating between them. The key differentiator between agents and objects is their autonomy. Autonomy is here used in an extended sense. It not only comprises the notion that agents operate in their own thread of control, but also implies that agents are long-lived (they execute unattended for long periods), they take initiative (they do not simply act in response to their environment), they react to stimuli from the environment as guided by their goals (the receiving agent decides whether and how to respond to a stimulus), and interact with other agents to leverage their abilities in support of their own as well as collective goals. Active objects, on the other hand, are autonomous only in the first of these senses. They are not guided by individual, and/or collective goals. A society of agents can be viewed from two dual perspectives: either a society of agents emerges as a result of the interaction of agents; or the society imposes constraints and policies on its constituent agents. Both perspectives, which we can refer to as micro and macro view of the society, respectively, mutually reinforce each other, as shown in Fig. 2. Specifically, emerging agent specialization leads to the notion of roles. Roles, in turn, impose restrictions on the possible behaviors of agents [10].
Fig. 2. Micro-macro view of an agent society
This suggests two approaches to systematically designing agent societies. In the first approach, we identify top-level goals for the system, and decompose them recursively, until we can assign them to individual agents (as exemplified by the Gaia methodology [30]). In the second approach, we construct an agent society incrementally from a catalog of interaction patterns, as exemplified by [13]. These interaction patterns are described in terms of roles that agents can play and their interactions, and may also specify any societal constraints or policies that need to be satisfied. Roles are abstract loci of control [13, 10, 20]. Protocols (or patterns of interaction) describe the way the roles interact. Policies define constraints imposed by the society on these roles. As an example of a policy, consider an agent-mediated auction, which specifies conventions specific to its auction type (for example, regarding the order of
Patterns for Motivating an Agent-Based Approach
235
bids; ascending in an English auction, descending in a Dutch auction) that participating agents must comply with in order for the auction to function correctly. Roles and their subtypes can be documented in a role diagram, using the notation introduced in [13]. Role diagrams are more abstract than class diagrams. Each role in the diagram defines a position, and a set of responsibilities. A role has collaborators – other roles it interacts with. Arrows between roles indicate dynamic interactions between roles; the direction of an arrow represents the direction in which messages are sent between the roles. The triangle indicates a subtyping relationship between roles; subtypes inherit the responsibilities of their parent roles Many types of applications, such as call control [19], groupware [26], or electronic commerce applications [27] can be modeled using user, task, service, and resource roles, and their subtypes. The user role encapsulates the behavior of managing a user’s task agents, and controlling access to the user’s data. The task role represents users in a specific task. This is typically a long-lived, rather than one-shot, transaction. Agents in the service role typically provide a service to a group of users. They mediate the interaction between two or more agents through this service. The resource role abstracts information sources. These could be legacy data sources wrapped by “glue” code that converts standardized requests to the API of the data source. Resulting Context • For members of an agent society to understand each other, they need to agree on common exchange formats, as described in Common Vocabulary. • If you are dealing with an agent that acts on behalf of a single user, consult Agent as Delegate. • For the motivation of agents that facilitate between a group of users, and their respective agents, refer to Agent as Mediator. AGENT AS DELEGATE Context You are designing your system as a society of autonomous agents using AGENT SOCIETY, and you wish to delegate a single user’s time-consuming, peripheral tasks. Problem How do you instruct agents on what to do? How much discretion (authority) should you give to an agent? How do agents interact with their environment? Forces • Repetitive, time-consuming tasks should be delegated to agents that can perform the tasks on behalf of their users, and require only minimal intervention. • However, when delegating a task to an agent, users must be able to trust the agent to perform the task in an informed and unbiased manner. • The user also wants to control what actions the agent can perform on the user’s behalf and which it cannot (its degree of autonomy).
236
M. Weiss
Solution Use agents to act on behalf of the user performing a specific task. Such user agents manage a user’s task agents, and control access to the user’s data. The structure of this pattern is shown in Fig. 3. Task agents represent the user in different task contexts. For example, in the call control domain, a user placing, and a user receiving a call could both be represented as task agents. Each Concrete Task is a subtype of the generic Task role. The generic role contains beliefs and behaviors common to all concrete tasks. In the electronic commerce domain, we might have a Trader role, and Buyer and Seller roles that share the common belief of a desired price.
Fig. 3. Role diagram for AGENT AS DELEGATE
Resulting Context • For organizing the interaction with the user to gather their requirements and feedback on the performance of a task, consult User Agent1. • Also consult User Agent for measures to control access to user data. • For the design of task agents, consult Task Agent1. AGENT AS MEDIATOR Context You are designing your system as a society of autonomous agents using AGENT SOCIETY, and you wish to facilitate between a group of users, and their agents. Problem How can agents find each other, and coordinate their behaviors? Forces • In a closed agent society of known composition, agents can maintain lists of acquaintances with whom they need to interact (to obtain data or services). • However, in an open agent society, whose composition changes dynamically, agents need help locating other agents with which they can interact. • Agents that have no mutual history of interaction may need the help of trusted intermediaries to protect sensitive data, and ensure service quality. 1
Descriptions of USER AGENT and TASK AGENT have been omitted from this paper for space restrictions, but will be included in a future version of this language.
Patterns for Motivating an Agent-Based Approach
237
• Sometimes, the agents do not simply need to locate each other, but their interaction needs to follow a coordination protocol (for example, in an auction). • A special case is that agents need to gain access to resources, which creates the need for an intermediary that can find, and forward queries to relevant resources. Solution Use a mediator to facilitate between the members of a group of agents. Examples of mediators are directories, translators, market makers, and rating services. We distinguish two cases, one where task agents need to locate other task agents, and another where task agents need to gain access to relevant resources [14, 26]. The mediator can either just pair up agents with each other (task agent-task agent, or task agent-resource agent), or coordinate their interactions beyond the initial introduction.
Fig. 4. Role diagram for AGENT AS MEDIATOR
In [14], mediators are referred to as middle-agents. Different types of middleagents can be distinguished based what services they provide. Basic mediation services comprise matching agents based on search criteria, and translation services. Interaction services include the capability to coordinate the behaviors of task agents according to given protocols, conventions, and policies, for example, the rules of an auction. Finally, reliability services comprise trustworthiness (of the mediator itself), and quality assurance (of the mediated services and data, as well as of the mediator). Resulting Context • The Agent as Mediator pattern is a starting point for many specific agent patterns, such as for search agents, recommender systems, or auctions. COMMON VOCABULARY Context When agents in an AGENT SOCIETY interact, they need to agree on common exchange formats. One scenario is that you are using agents to represent users in individual, long-living transactions as described in TASK AGENT. These task agents (for example, buyer and seller agents) need to understand each other in order to exchange messages with one other (for example, when negotiating about a price).
238
M. Weiss
Problem How do you enable agents (for example, task agents) to exchange information? Forces • Agents may use different internal representations of concepts. • To exchange information, agents need to agree on common exchange formats. • However, common exchange formats must be widely adopted to be useful. • If an agent needs to use multiple exchange formats to interact with different agents, it may not be able to perform all the required mappings itself. Solution For agents to understand each other, they need to agree on a common message format that is grounded in a common ontology. The ontology defines concepts that each party must use during the interaction, their attributes and valid value ranges. The purpose of this ontology is agent interaction, and it does not impose any restrictions on the internal representations of the agents. In a heterogeneous, open environment, agents may even need to use multiple ontologies to interact with different agents. The structure of this pattern is shown in Fig. 5.
Fig. 5. Role diagram for COMMON VOCABULARY
It is generally impractical to define general-purpose ontologies for agent interaction. These are unlikely to include the intricacies of all possible domains. Instead, the common ontology will be application-specific. Given such a shared ontology, the communicating agents need to map their internal representations to the shared ontology. Much progress has been made on XML-based ontologies, for example, in the e-commerce domain xCBL, cXML, and RosettaNet are quite popular [4]. If agents need to interact with many agents using different common ontologies, it becomes impractical for the agent to be aware of all the different mappings. In this case, the need for translation agents arises that can map between ontologies on behalf of other agents. Fortunately, these are relatively straightforward to build using XSLT, a language for transforming XML documents into other XML documents.
Patterns for Motivating an Agent-Based Approach
239
Resulting Context • If agents need to interact with many agents using different common ontologies, apply Agent as Mediator to the construction of translation agents.
5
Conclusion
There are three main take home messages from this paper: • As the advantages of the agent-based approach are still not widely recognized outside our community, we need to educate non-experts in its use. • We need guidelines for non-agent technology experts that summarize the key benefits of the agent approach, and agent patterns can provide that guidance. • This pattern language at the center of this paper is intended to provide such guidelines, and serve as a starting point for more specific pattern languages. We also urge authors of agent patterns to organize their patterns in the form of pattern languages. To this end a template for pattern languages has been provided.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Alexander, C., A Pattern Language, Oxford University Press, 1977 Aridor, Y., Lange, D., Agent Design Patterns: Elements of Agent Application Design, Second Intl. Conference on Autonomous Agents, IEEE, 1998 Beck, K., and Johnson, R., Patterns Generate Architectures, European Conference on Object Oriented Programming (ECOOP), 139–149, 1994 Carlson, D., Modeling XML Applications with UML : Practical e-Business Applications, Addison-Wesley, 2001 Collis, J., and Ndumu, D., The ZEUS Role Modelling Guide, BT Exact, 1999 Coplien, J., Software Patterns, SIGS Books, 1996 Deugo, D., and Weiss, M., A Case for Mobile Agent Patterns, Mobile Agents in the Context of Competition and Cooperation (MAC3) Workshop Notes, 19–22, 1999 Deugo, D., Oppacher, F., Ashfield, B., Weiss, M., Communication as a Means to Differentiate Objects, Components and Agents, Technology of Object-Oriented Languages and Systems Conference (TOOLS), IEEE, 376–386, 1999 Deugo, D., Weiss, M., and Kendall, E., Reusable Patterns for Agent Coordination, in: Omicini, A., et al (eds.), Coordination of Internet Agents, Springer, 2001 Ferber, J., Multi-Agent Systems: An Introduction to Distributed Artificial Intelligence, Addison-Wesley, 13–16, 1999 Fernandez, E., and Pan, R., A Pattern Language for Security Models, Conference on Pattern Languages of Programming (PLoP), 2001 Kendall, E., Murali Krishna, P., Pathak, C., et al, Patterns of Intelligent and Mobile Agents, Conference on Autonomous Agents, IEEE, 1998 Kendall, E., Role Models: Patterns of Agent System Analysis and Design, Symposium on Agent Systems and Applications/Mobile Agents (ASA/MA), ACM, 1999 Klusch, M., and Sycara, K., Brokering and Matchmaking for Coordination of Agent Societies: A Survey, in: Omicini, A., et al (eds.), Coordination of Internet Agents, Springer, 2001
240
M. Weiss
15. Kolp, M., Giorgini, P., and Mylopoulos, J., A Goal-Based Organizational Perspective on Multi-Agent Architectures, Workshop on Agent Theories, Architectures, and Languages (ATAL), 2001 16. Lind, J., Patterns in Agent-Oriented Software Engineering, Workshop on Agent-Oriented Software Engineering (AOSE), 2002 17. Maes, P., Agents that Reduce Work and Information Overload, Communications of the ACM, 31–41, July 1994 18. Mouratidis, H., Giorgini, P., Schumacher, M., and Weiss, M., Integrating Security Patterns in the Development of Secure Agent-Based Systems, submitted, 2003 19. Pinard, D., Gray, T., Mankovski, S., and Weiss, M., Issues in Using an Agent Framework for Converged Voice and Data Applications, Conference on Practical Applications of Agents and Multi-Agents (PAAM), 1997 20. Riehle, D., and Gross, T., Role Model Based Framework Design and Integration, Conference on Object-Oriented Programs, Systems, Languages, and Applications (OOPSLA), 1998 21. Schelfthout, K., Coninx, T., et al, Agent Implementation Patterns, OOPSLA Workshop on Agent-Oriented Methodologies, 2002 22. Shu, S., and Norrie, D., Patterns for Adaptive Multi-Agent Systems in Intelligent Manufacturing, Intl. Workshop on Intelligent Manufacturing Systems (IMS), 1999 23. Silva, A., and Delgado, J., The Agent Pattern, European Conference on Pattern Languages of Programming and Computing (EuroPLoP), 1998 24. Sycara, K., Multiagent Systems, AI Magazine, 79–92, Summer 1998 25. Tahara, Y., Oshuga, A., and Hiniden, S., Agent System Development Method Based on Agent Patterns, Intl. Conference on Software Engineering (ICSE), ACM, 1999 26. Voss, A., and Kreifelts, T., SOaP: Social Agents Providing People with Useful Information, Conference on Supporting Groupwork (GROUP), ACM, 1997 27. Weiss, M., Patterns for e-Commerce Agent Architectures: Using Agents as Delegates, Conference on Pattern Languages of Programming (PLoP), 2001 28. Weiss, M., Pattern-Driven Design of Agent Systems: Approach and Case Study, Conference on Advanced Information System Engineering (CAiSE), Springer, 2003 29. Wooldridge, M., and Jennings, N., Intelligent Agents: Theory and Practice, The Knowledge Engineering Review, 10(2):115–152, 1995 30. Wooldridge, M., Jennings, N., et al, The Gaia Methodology for Agent-oriented Analysis and Design, Journal Autonomous Agents and Multi-Agent Systems, 2002
Using Scenarios for Contextual Design in Agent-Oriented Information Systems Kibum Kim, John M. Carroll, and Mary Beth Rosson Departement of Computer Science and Center for Human-Computer Interaction, Virginia Tech, Blacksburg, 24061, USA {kikim,carroll,rosson}@vt.edu
Abstract. In this position paper, we argue that current agent-oriented development methodologies are limited in their ability to model social aspects of the agents and human-software agent interactions. We identify how these limitations can be rectified using scenarios for contextual design in agentoriented information systems (AOIS).
1 Limitations in Traditional Approach for Agent Oriented Development Methodologies Currently, the primary approaches to agent-oriented development methodologies involve either adopting conventional software engineering methodologies—for example, Agent UML (AUML) and Methodology for Engineering Systems of Software Agents (MESSAGE)—or extending knowledge-engineering methodologies with agent-related concepts such as Conceptual Modeling of Multi-Agent Systems (CoMoMas), Gaia, and MAS-CommonKads [1]. However, because these approaches possess system-centric rather than user-centric natures or designs, they remain inappropriate for dealing with the important developmental process of humansoftware agent interactions, as well as the human factors for developing interactive agents. Most problems associated with the adoption of conventional software engineering methodologies derive from the essential differences between distributed objects and agents. In particular, although objects are not social, agents are characterized by their social aspects, and existing software development techniques usually do not adapt to this purpose. In addition, while extending knowledge-engineering methodologies to agent development can provide techniques for modeling agent knowledge, they do not effectively deal with the distributed or social aspects of the agents, or modeling such social interactions. Therefore, theoretical frameworks must be presented that will analyze how people communicate and interact with the variety of agents that constitute their work environments.
M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 241–243, 2003. © Springer-Verlag Berlin Heidelberg 2003
242
K. Kim, J.M. Carroll, and M.B. Rosson
2 Scenario-Based Contextual Design in Agent Oriented Information Systems Increasingly, information technology companies are moving from being technologydriven to being customer-driven, focused on ensuring system functions and structures that will work for the customer. A broad range of analysis involving human-computer interaction has already recognized that system design can profit from explicitly studying the context in which users work [2]. To achieve systems that are more “customer-centered,” one must depend upon Contextual Design as the backbone methodology for front-end design [3]. AOIS can achieve its goal most effectively when its design methodology takes into account what customers need, as well as how human-software agent interactions and social interactions between agents are structured within a usage context. Scenarios—descriptions of meaningful usage episodes—have become key to making abstract models understandable. They help designers examine and test their ideas in practice, with each narrative created to evoke an image of people doing things, pursuing goals, and using technology to support these goals. In Scenario-Based Design (SBD), designers also evaluate scenarios through claims analysis, wherein the positive and negative implications of design features are considered through “what if” discussions and the scenarios serve as a usage context for considering interaction options. Scenarios and claims analysis are useful in describing initiatives or actions taken by a software agent and considering their usability implications and emphasize the context of work in the real world. The Point-of-View Scenarios (POVs) describe each agent’s responsibilities in the scenario, including the extent of its collaboration with other agents [4]. Creating POVs encourages software designers to anthropomorphize agents and their behaviors, as a heuristic for reasoning about what the agent could or should do to support user needs (See Table 1). They help designers construct an initial analysis of the attributes and responsibilities of individual agents, which might lead them to consider how different agents might influence what users will be able to do with the system. In light of the POV analysis described in Table 1, Table 2 depicts usability tradeoffs that must be considered. Table 1. Points of veiw scenarios (POVs) created from the perspective of a software agent Scenario Agent Social Network Visualizer
Point of View Scenarios I was created based on Mrs. Parry’s constant email correspondence with her colleagues. When she first opened me, I asked a database manager for information about her social networks and displayed her personal connections and groups. When she sent email to a new person, I worked with it to set up a new relationship in her social network. Whenever I was asked display myself, I made sure all my nodes and links were shown correctly within the frame.
Using Scenarios for Contextual Design in Agent-Oriented Information Systems
243
Table 2. Examples of usability tradeoffs to consider in light of the POV analysis Scenario Feature Automatical creation based on email correspondece
Possible Positive (+) and Negative (-) Attributes of Feature + assists users by maintaining networks of colleagues, acquaintances, and friends based on their personal histories or behaviors. + can quickly and conveniently explore who knows whom through social networks. - user intentions might be incorrectly determined by the agent. - there is potential for losing user’s control, predictability, and comprehensibility.
3 Conclusion Developing a sound solution for well-defined agent-oriented development methodologies is not an easy task. We have briefly explored the possible contributions that a scenario-based contextual design approach for AOIS might provide to designers interested in pursuing a methodology that can facilitate modeling social aspects of the agents and human-software agent interactions. For AOIS development, we recommend using Point-of-View Scenarios created with the aid of anthropomorphism, because they can envision a user’s task in terms of a usage context relevant to the problem domain. Several usability issues are also raised by these POVs. Such a methodology will provide a well-organized framework within which software engineers can more effectively develop agent-oriented information systems.
References 1. Iglesias, C.A., Garijo, M., Gonzalez, J.C.: A Survey of Agent-Oriented Methodologies. In 2. 3. 4.
Intelligent Agents V: Proceedings of the ATAL’98, LNAI, vol. 1555. Springer, Berlin Heidelberg NY. (1999) Nardi, B.A.: Context and Consciousness: Activity Theory and Human Computer Interaction, 1st edn. MIT Press, Cambridge, MA. (1996) Beyer,H., Holtzblatt, K.: Contextual Design: Defining Customer-Centered Systems, Morgan Kaufmann Publishers, Inc., San Francisco, CA. (1998) Rosson, M. B., Carroll, J. M.: Scenarios, objects, and points-of-view in user interface design, In M. van Harmelen (ed.): Object modeling and user interface design. AddisonWesley, London. (2000)
Dynamic Matchmaking between Messages and Services in Multi-agent Information Systems Muhammed Al-Muhammed and David W. Embley Department of Computer Science, Brigham Young University Provo, UT 84602 USA {mja47, embley}@cs.byu.edu
1 Problem Statement Agents do not work in isolation; instead they work in cooperative groups to accomplish their assigned tasks. In a multi-agent information system, we assume that each of the agents has and acquires knowledge. We further assume that it is important and useful to be able to share this knowledge and to provide useful knowledge sources to enable activities such as comparison shopping, meeting scheduling, and supply-chain management. In order for agents to cooperate, they need to be able to communicate with one another. Communication essentially needs mutual understanding among agents. To achieve this mutual understanding among agents, researchers frequently make three assumptions: 1. Agents share ontologies that define the concepts used by these agents; 2. Agents communicate with the same agent communication language so that they can understand the semantics of the messages; and 3. Agents pre-agree on a message format so that they can correctly parse and understand the communicative act. These three assumptions are sufficient for agents to communicate; however, they impose many problems. First and foremost, they imply that agents cannot communicate (interoperate with one another) without agreeing in advance on these three assumptions. Hence, these assumptions preclude agents from interoperating on the fly (without a-priori agreement). Second, they explicitly mean that unless one designer or a group of designers (with full communication among them) develops these agents, the communication among agents is not likely to succeed because all or some of the three assumptions will not hold. Third, the assumptions require a designer who develops a new agent for a multi-agent system to know what ontologies other agents in that system use, what language they speak, and what message format they use. This imposes a stiff requirement on an outside developer. The importance of making agents interoperate on the fly becomes paramount. Indeed, in an interesting paper on agent interoperability, Uschold says that “the holy grail of semantic integration in architectures” is to “allow two agents to generate needed mappings between them on the fly without a-priori agreement and without them having built-in knowledge of any common ontology.” Consequently, in our research we are working on eliminating all three assumptions and allowing agents to interoperate on the fly without having to share knowledge of any ontology, language, or message format. M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 244–246, 2003. © Springer-Verlag Berlin Heidelberg 2003
Dynamic Matchmaking between Messages and Services
245
2 Research Questions To achieve interoperability among agents on the fly, we must have answers to five major research issues: (1) translating between different ontologies, (2) mapping between services and messages, (3) reconciling differences in data formats, (4) reconciling type mismatches, and (5) handling outputs of services so that only the information requested by a message is provided. 1. Given that agents do not share ontologies – they may represent the same concept using different vocabularies – how can translation among different ontologies be done? 1.a. How can the concepts of independent ontologies – related to the same domain – be matched? In particular, answers for the following subquestions are vital. 1.a.1. How can the semantically related concepts be determined? 1.a.2. How can the concepts whose names are the same but whose semantics are different be distinguished? 1.b. What is the information needed to make the translation work? 1.b.1. Can this information be extracted from the agents themselves? 1.b.2. What other resources are needed? 1.b.3. How much information from a multi-agent system is sufficient to do the translation correctly? 2. How can a message be mapped to an appropriate service? 2.a. How can the semantics of a message be captured? 2.a.1. What is the provided information – what will be the values for the service input parameters? 2.a.2. What is the required output—what is the message is asking for? 2.a.3. What are the constraints imposed on input (output) parameters? 2.b. How can a message be mapped to some service provided by a receiving agent? This question requires the ability to know the semantics of a service, which requires answers to the following. 2.b.1. What are the semantics of the input parameters and how do these input parameters match with those of a message? 2.b.2. What are semantics of the outputs and do the outputs constitute an answer for a message? 2.b.3. What are the constraints imposed on inputs and outputs and how can the mismatches between service input (output) constraints and message input (output) constraints be resolved? 3. How can differences among data formats (different date formats, time formats, currencies) of the communicating agents be recognized the and then converted correctly? 3.a. What are the problems that arise when converting from one format to another and how can they be resolved? 3.b. How can alternative value representations be guaranteed to match under various conversions?
246
M. Al-Muhammed and D.W. Embley
4. How can the mismatches between types be handled? 4.a. Can the proper conversion be guaranteed? 4.b. Can the loss of precision be recovered? 5. How can any unwanted output of a service be filtered out? 5.a. How can the expected output, which is represented in the local vocabulary of a receiving agent, be recognized? 5.b. Then, how can this output be sifted so that only the required information is delivered to a requesting agent? We realize that fully resolving these questions is an extremely hard problem. Nevertheless, we believe that the benefits of resolving these issues are of great value. These benefits include: (1) easy development of agents because developers need not be concerned with ontologies other agents use, agent communication languages other agents use, and the format of messages and (2) increased interoperability as agents can generate needed mappings on the fly. Thus, tackling these questions and solving the heterogeneous, semantic-mapping problem would be of great benefit.
Preface to XSDM 2003
XSDM’03 (XML Schema and Data Management) was the first International workshop held in conjunction with 22nd International Conference on Conceptual Modeling on 13th October in Chicago, USA. Web data management systems are rapidly being influenced by XML technologies and are driven by their growth and proliferation to create next generation web information systems. The purpose of XSDM workshop is to provide a forum for the exchange of ideas, and experiences among the theoretical and practitioners of XML technologies, who are involved in design, management and implementation of XML based web information systems. Topics of interest in XSDM’03 include, but were not limited to – – – – – –
XML schema discovery, XML data integration Indexing XML data, XML query languages XML data semantics, Semantic web and XML Mining of XML data, XML change management XML views and data mappings, Securing XML data XML in new domains- sensor and biological data management
The workshop received overwhelming response from many different countries. We received a total of 30 papers and the international program committee members reviewed the papers and finally selected 12 full papers and 2 short papers for the presentation and inclusion in the proceedings. The workshop program consisted of 4 sessions, XML Change Management and Indexing, Querying and Storing XML Data, XML Transformation and Generation, and XML Mapping and Extraction. I thank all the authors who contributed to these workshops. I also thank the program committee members of the workshop who selected such quality papers, which resulted in an excellent technical program. I thank to workshop co-chairs Manfred Jeusfeld and Oscar Pastor for selecting this workshop, and for excellent co-ordination and co-operation during this period. I would also like to thank ER Conference organization committees for their support and help. Finally, I would like to thank the local organizing committee for wonderful arrangement and all the participants for attending these workshops and stimulating the technical discussions. We hope that all participants enjoyed the workshops and local sightseeing. You should shortly introduce all papers presented at your workshop.
October 2003
Sanjay Madria
A Sufficient and Necessary Condition for the Consistency of XML DTDs Shiyong Lu, Yezhou Sun, Mustafa Atay, and Farshad Fotouhi Department Of Computer Science Wayne State University Detroit, MI 48202 {shiyong,sunny,matay,fotouhi}@cs.wayne.edu
Abstract. Recently, XML has emerged as a standard for representing and exchanging data on the World Wide Web. As a result, there is a trend of increasing amount of XML documents that publish information on the Web from various data sources. A Document Type Definition (DTD) describes the structure of a set of similar XML documents and serves as the schema for XML documents. The World Wide Web Consortium has defined the grammar for specifying DTDs; however, even a syntactically correct DTD might be inconsistent in the sense that there exist no XML documents conforming to the structure imposed by the DTD. In this paper, we formalize the notion of the consistency of DTDs, and identify a sufficient and necessary condition for a DTD to be consistent.
1
Introduction
XML [2] is rapidly emerging on the World Wide Web as a standard for representing and exchanging data. In contrast to HTML, which describes how data should be displayed to humans, XML describes the meanings and structures of data elements themselves, and therefore makes data self-describing and interpretable to programs. Currently, XML has been used in a wide range of applications as this is facilitated by standard interfaces such as SAX [13] and DOM [1], and the development of techniques and tools for XML such as XSL (Extensible Stylesheet Language) [5], XSLT (XSL Transformation) [3], XPath [4], XLink [11], XPointer [12] and XML parsers. It is well recognized that XML will continue to play an essential role in the development of the Semantic Web [9], the next generation web. XML Document Type Definitions (DTDs) [2] describe the structure of XML documents. With a DTD, independent groups of people can agree to use a common DTD for interchanging data. In addition, an application can use a standard DTD to verify if the data that the application receives from the outside world is valid or not. The World Wide Web Consortium has defined the grammar for DTDs [2]. Essentially, a DTD defines the constraints on the logical structure of XML documents, and an XML document is valid if it has an associated DTD and if the document complies with the constraints expressed in the DTD. Unfortunately, a syntactically correct DTD might be inconsistent in the sense ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 250–260, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
A Sufficient and Necessary Condition for the Consistency of XML DTDs
251
]>
Fig. 1. A DTD example
that there exist no XML documents conforming to the structure imposed by the DTD. Figure 4 shows some of such inconsistent DTDs. Inconsistent DTDs are of course useless, and they should be avoided. In practice, the consistency of small DTDs can be ensured by careful observation based on common sense; tools of checking consistency of DTDs might be desirable for large DTDs or for DTDs that are generated automatically from other data models such as the ER model and the relational model. In this paper, we first formalize the notion of consistency of an XML DTD: an XML DTD is consistent if and only if there exists at least one XML document that is valid w.r.t. it. We then identify a sufficient and necessary condition for an XML DTD to be consistent, and this condition also implies an algorithm for checking the consistency of XML DTDs. We believe that these results are fundamentally important for the XML theory and the semistructured data model. Organization. The rest of the paper is organized as follows. Section 2 gives a brief overview of XML Document Type Definitions (DTDs), and formalizes the notion of consistency of DTDs. Section 3 identifies a sufficient and necessary condition for the consistency of DTDs, which also implies an algorithm that checks the consistency of DTDs. Section 4 discusses our implementation of the algorithm and related work. Finally, Section 5 concludes the paper.
2
Consistency of DTDs
XML Document Type Definitions (DTDs) [2] describe the structure of XML documents and are considered as the schemas for XML documents. A DTD example is shown in Figure 1 for memorandum XML documents. In this paper, we model both XML elements and XML attributes as XML elements since XML attributes can be considered as XML elements without further nesting structure. A DTD D is modeled as a set of XML element definitions {d1 , d2 , · · · , dk }. Each XML element definition di (i = 1, · · · , k) is in the form of ni = ei , where ni is the name of an XML element, and ei is a DTD expression.
252
S. Lu et al.
Each DTD expression is composed from XML element names (called primitive DTD expressions) and other DTD subexpressions using the following operators: – Tuple operator. (e1 , e2 , · · · , en ) denotes a tuple of DTD subexpressions. In particular, we consider (e) is a singleton tuple. The tuple operator is denoted by “,”. – Star operator. e∗ represents zero or more occurrences of subexpression e. – Plus operator. e+ represents one or more occurrences of subexpression e. – Optional operator. e? represents an optional occurrence (0 or 1) of subexpression e. – Or operator. (e1 | e2 | · · · | en ) represents one occurrence of one of the subexpressions e1 , e2 , · · ·, en . We ignore the encoding mechanisms that are used in data types PCDATA and CDATA and model both of them as data type string. The DOCTYPE declaration states which XML element will be used as the schema for XML documents. This XML element is called the root element. We define a DTD expression formally as follows. Definition 1. A DTD expression e is defined recursively in the following BNF notation where n range over XML element names and e1 , · · ·, en range over DTD expressions. e ::= string | n | e+ | e∗ | e? | (e1 , · · · , en ) | (e1 | · · · |en ) where the symbol “::=” should be read as “is defined as” and “|” as “or”. Example 1. With our modeling notations, the DTD shown in Figure 1 can be represented as the following set of XML element definitions { memo = (to, from, date, subject?, body, security, lang), security = string, lang = string, to = string, from = string, date = string, subject = string, body = (para+), para = string }. Informally, an XML document is valid with respect to a DTD if the structure of the XML document conforms to the constraints expressed in the DTD (see [2] for a formal definition of validity), and invalid otherwise. Figure 2 shows a valid XML document with respect to the DTD shown in Figure 1. Figure 3 illustrates an invalid XML document with respect to the same DTD since XML element cc is not defined in the DTD and the required element date is missing. However, some DTDs are inconsistent in the sense that there exist no XML documents that are valid with respect to them. For example, the four DTDs shown in Figure 4 are inconsistent. DTD1 is inconsistent since it requires that element a contains an element b and vice versa. This is impossible for any XML document with a finite size. For similar reasons, other DTDs are inconsistent. We formalize the consistency of DTDs as follows. Definition 2. A DTD is consistent if and only if there exists at least one XML document that is valid with respect to it; otherwise, it is inconsistent.
A Sufficient and Necessary Condition for the Consistency of XML DTDs
253
<memo security="5" lang="eng"> yezhou mustafa October 27, 2002 <subject> Why XML <para> Is XML really better than HTML?
Fig. 2. A valid XML document
<memo security="5" lang="eng"> mustafa shiyong yezhou <subject> Re: Why XML What HTML are you talking about?
Fig. 3. An invalid XML document
3
Condition for the Consistency of DTDs
Obviously, inconsistent DTDs are not useful in real life although they are syntactically correct. Hence, it is important to characterize consistent DTDs. To do this, we introduce the notion of DTD graph which graphically represents the structure of a DTD. In this graph, nodes represent XML elements and edges represent operators over them and are labeled by the corresponding operators. An edge in a DTD graph is called a ?-edge, ∗-edge, +-edge, ,-edge or |-edge respectively according to the operator label on it. In this paper, we call ,- and +-edges hard edges, and other kinds of edges soft edges. Similarly, a path (or a cycle) p = e1 → e2 → · · · en is hard if each edge ei → ei+1 (for i = 1, · · ·, n-1) in p is hard, and soft otherwise. For the brevity of a graph, the label of a ,-edge is omitted and made implicit. Nodes without incoming edges are called sources and nodes without outgoing edges are called terminals. Note that our notion of DTD graph differs from the one defined in [14] in which operators are also treated as nodes of the graph.
254
S. Lu et al.
DTD1: ]>
DTD2: ]>
DTD3: ]>
DTD4: ]>
Fig. 4. Inconsistent DTDs
a
a
c +
+ b
b
DTD1
DTD2
d e
a
a
+
+ c
b
c
b
*
f
g DTD3
DTD4
Fig. 5. Cyclic and inconsistent DTD Graphs
Definition 3. Given a DTD graph g, we say XML element e leads to f if e = f or if there is a hard path from e to f in g. Definition 4. XML element e leads to a cycle c if e leads to any XML element in c. Example 2. The DTD graphs for the DTDs in Figure 4 are illustrated in Figure 5. In all of these graphs, the root element a leads to a cycle. All of the four DTDs are inconsistent.
A Sufficient and Necessary Condition for the Consistency of XML DTDs DTD5: ]>
DTD6: ]>
DTD7: ]>
DTD8: ]>
255
Fig. 6. Consistent DTDs despite mutual recursions
However, not all cycles imply the inconsistency of DTDs since soft edges do not cause the inconsistency of a DTD. This is illustrated by the following example. Example 3. The DTDs shown in Figure 6 are all consistent although they contain mutual recursions and their corresponding DTD graphs (Figure 7) have cycles. In particular, DTD5 and DTD7 contain soft cycles, but DTD6 and DTD8 contain hard cycles.
a
a
b
b
? DTD5
c DTD6 |
a
a
d
|
* b
c
b
c
? DTD7
DTD8
Fig. 7. Cyclic but consistent DTD Graphs
256
S. Lu et al.
The operator | introduces complexity. For example, in DTD8 of Figure 7, the DTD is consistent despite the presence of a hard cycle since XML document Hello world! is valid w.r.t. it. In the following, to identify the condition for a DTD to be consistent, we first consider those DTDs that do not involve operator “|”. We call these DTDs |-free DTDs. We will consider DTDs that involve operator “|” later. In Figure 6, DTD5, DTD6 and DTD7 are all |-free DTDs while DTD8 is not. The following lemma identifies a necessary condition for an |-free DTD to be consistent. Lemma 1. A |-free DTD D with root r is consistent only if r does not lead to a hard cycle in its DTD graph. Proof. Suppose r leads to a hard cycle e0 → e1 → · · · → en → e0 , then any XML document that is valid w.r.t. D must contain all the elements in this cycle. In the following, we prove by contradiction that D is inconsistent. Suppose D is consistent, then there exists an XML document x that is valid w.r.t. D. x must contain all the elements e0 , · · ·, en . Let ei be the innermost element of x in the sense that ei does not contain any other elements from e0 , · · · , en . The finite size of x implies that such an innermost element ei exists. However, the edge from ei to ei+1 is either a ,-edge or a +-edge. This implies that ei must contain at least one occurrence of ei+1 . This contradicts the assumption that ei is the innermost element of x. Therefore, there exists no XML document x that is valid w.r.t. D and hence D is inconsistent. The following lemma identifies a sufficient condition for a DTD (including |-free DTDs) to be consistent. Lemma 2. A DTD D with root r is consistent if r does not lead to any hard cycle. Proof. Suppose r does not lead to any hard cycle in the DTD graph of D. In the following, we prove D is consistent by constructing an XML document that is valid w.r.t. D. First, we convert D into another DTD D using the following transformation rules where empty represents the empty element, i.e., e? ≡ e | empty, and e, e1 , · · · , en range over DTD expressions: (1) e+ ⇒ e; (2) e∗ ⇒ empty; (3) e? ⇒ empty. (4) (e1 | · · · | en ) ⇒ e1 . Obviously, the DTD graph of D only contains ,-edges. In addition, the strongly connected subgraph g of the DTD graph of D that contains r must be acyclic since r does not lead to any hard cycle in the DTD graph of D. Based on the acyclicity of g , it is straightforward to create an XML document x that conforms to g (we leave the detail of the creation to the readers) and thus is valid w.r.t. to D . Since any document that is valid w.r.t. D must also be valid w.r.t. D, x is valid w.r.t. D as well. Hence, D is consistent. Based on Lemma 2 and 1, the following theorem states a sufficient and necessary condition for an |-free DTD to be consistent. Theorem 1. An |-free DTD with root r is consistent if and only if r does not lead to any hard cycle.
A Sufficient and Necessary Condition for the Consistency of XML DTDs
257
Proof. According to Lemmas 1 and 2. With the above theorem, the algorithm to decide if an |-free DTD D with root r is consistent or not is straightforward: create the DTD graph for D and do a depth first traversal of this graph starting at r following only hard edges. Each node is marked as “visited” the first time it is reached. If a node is visited twice during this traversal, then r leads to a hard cycle and D is inconsistent; otherwise, D is consistent. Example 4. Consider the DTDs and their DTD graphs shown in Figure 6 and 7 respectively. DTD5 is consistent since there is no hard cycle present in its DTD graph; DTD6 is consistent although there is a hard cycle in its DTD graph since the root element a does not lead to that hard cycle; DTD7 is also consistent since the cycle is not a hard one. The consistency problem of DTD8 will be discussed later. Theorem 2 (Complexity). The time complexity of checking whether an |-free DTD D is consistent or not is O(n) where n is the size of D. Proof. Both the creation of the DTD graph for D and the checking of the cyclicity of this graph can be done in O(n) where n is the size of D. We leave the detail of the proof to the readers. To deal with | operators, we split a DTD D to a set of |-free DTDs D1 , D2 , · · · , Dm such that D is consistent if and only if one of D1 , D2 , · · · , Dm is consistent. We first formalize the notion of a split of a DTD expression, and then we extend this notion to one for XML element definition and one for XML DTD. Definition 5. The split of a DTD expression e is a set of |-free DTD expressions split(e) that is defined recursively by the following rules: – – – – – – –
split(string) = {string}. split(n) = {n}. split(e+ ) = {g + | g ∈ split(e)}. split(e∗ ) = {g ∗ | g ∈ split(e)}. split(e?) = {g? | g ∈ split(e)}. split((e1 , e2 , · · · , en )) = {(g1 , g2 , · · · , gn ) | gi ∈ split(ei ) f or i = 1, · · · , n}. split((e1 | e2 | · · · | en )) = split(e1 ) ∪ split(e2 ) ∪ · · · ∪ split(en ).
Example 5. split((a+, b?)*) = {(a+, b?)*} and split((a?, b* | c , d | e)) = {(a?, b*, d), (a?, b*, e), (a?, c, d), (a?, c, e)}. The following lemma indicates that |-free DTD expressions are fixpoints of the split transformation function. Lemma 3. All DTD expressions in split(e) are |-free and for each |-free DTD expression e, we have split(e) = e.
258
S. Lu et al.
Proof. We can easily prove it by applying a structural induction on e. We extend the notion of split to one for an XML element definition. Definition 6. The split of an XML element definition di (with the form ni = ei ) is a set of |-free XML element definitions split(d) that is defined as split(ni = ei ) = {ni = g | g ∈ split(ei )}. Example 6. split(n=(a+, b?)*) = {n=(a+, b?)*} and split(n=(a?, b* | c , d | e)) = {n=(a?, b*, d), n=(a?, b*, e), n=(a?, c, d), n=(a?, c, e)}. The following lemma indicates that |-free XML element definitions are fixpoints of the split transformation function. Lemma 4. All XML element definitions in split(d) are |-free and for each |-free XML element definition d, we have split(d) = d. Proof. It follows from Lemma 3 and the notion of split for XML element definitions in Definition 6. Finally, we extend the notion of split to one for a DTD. Definition 7. The split of an XML DTD D = {d1 , d2 , · · · , dk } is a set of |free DTDs split(D) that is defined as split({d1 , d2 , · · · , dk }) = {{d1 , d2 , · · · , dk } | di ∈ split(di ) f or i = 1, 2, · · · , k}. Example 7. DTD8 in Figure 6 can be represented as D = {a = b | d, b = c, c = b, d = string} and we have split(D) = {{a = b, b = c, c = b, d = string}, {a = d, b = c, c = b, d = string}}. Notice that the first DTD in split(D) is inconsistent since root a leads to a hard cycle, but the second one is consistent since root a does not lead to any hard cycle (despite the presence of a hard cycle). The following theorem indicates that |-free DTDs are fixpoints of the split transformation function. Theorem 3. All DTDs in split(D) are |-free and for each |-free DTD D, we have split(D) = D. Proof. It follows from Lemma 4 and the notion of split for XML DTDs in Definition 7. Finally, we identify a sufficient and necessary condition for an arbitrary DTD to be consistent. Theorem 4. A DTD D with root r is consistent if and only if at least one of the DTDs in split(D) with root r is consistent. Proof. It follows from the split notions defined in Definition 5, 6 and 7.
A Sufficient and Necessary Condition for the Consistency of XML DTDs
259
For example, DTD8 shown in Figure 7 is consistent as one of its splits, the one with the cycle of b and c deleted, is consistent. With the above theorem, the algorithm to decide if an arbitrary D with root r is consistent or not is straightforward: calculate split(D) which is a set of |-free DTDs, and check if there exists a DTD in split(D) that is consistent using the decision procedure for an |-free DTD. The following theorem states the complexity of checking the consistency of an arbitrary DTD; it shows that | might increase the complexity in an exponential fashion. Theorem 5 (Complexity). The time complexity of checking whether a DTD D is consistent or not is O(n ∗ 2m ) where n is the size of D and m is the number of | in D. Proof. Based on Theorem 2 and the fact that split(D) contains at most 2m |-free DTDs, each of which has the size of n.
4
Implementation and Related Work
We have implemented our algorithm in Java and have used it in one of our XML projects. The source code is downloadable at http://database.cs.wayne.edu/download/dtd con.zip. Recently, we noticed that Fan and Libkin have studied the consistency problem in a more general context in which the structural constraints enforced by XML DTDs might interact with integrity constraints [7] [8]. They have shown that the consistency problem of a DTD can be reduced to the emptiness problem for a context free grammar, which is decidable in linear time. In contrast, our algorithm is linear for |-free DTDs but exponential for DTDs in which |’s are present. Our contribution is that we identified an explicit sufficient and necessary condition for the consistency of XML DTDs in terms of the cyclicity of DTD graphs. Therefore, our algorithm and implementation are useful in a context in which DTD graphs are used, such as [14] [10]. An experimental comparison study between these two approaches is an interesting future work but is beyond the scope of this paper.
5
Conclusions and Future Work
In this paper, we have formalized the notion of consistency of XML DTDs and identified a sufficient and necessary condition for a DTD to be consistent. This condition implies an algorithm for checking the consistency of DTDs. We have implemented the algorithm with Java and have used it in one of our projects on XML. We expect that this algorithm might be integrated into various XML tools where the consistency of XML DTDs is critical. XML Schema [15] is a recent W3C standard. In addition to the features in DTDs, XML Schema supports typing of values and set size specification. There are several studies to convert a DTD into an XML Schema [6]. A DTD still needs
260
S. Lu et al.
to be consistent before it is converted to an XML Schema, and the algorithm introduced in this paper will be applicable to XML Schema with minor extension as a future work. Acknowledgements. We are grateful to the anonymous reviewers for their helpful comments and suggestions. We are also thankful to Lei Chen who implemented our proposed consistency checking algorithm in Java.
References 1. Document Object Model (DOM), October 1998. http://www.w3.org/DOM/. 2. T. Bray, J. Paoli, C. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0, October 2000. http://www.w3.org/TR/REC-xml. 3. J. Clark. XSL Transformation (XSLT) Version 1.0., November 1999. http://www.w3.org/TR/xslt. 4. J. Clark and S. DeRose. XML Path Language (XPath) Recommendation. http://www.w3.org/TR/xpath. 5. S. Deach. Extensible Stylesheet Language (XSL) Specification. http://www.w3.org/TR/xsl. 6. R. dos Santos Mello and C. A. Heuser. A rule-based conversion of a DTD to a conceptual schema. Lecture Notes in Computer Science, 2224:133, 2001. 7. W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. In In: Proc. ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pages 114–125, Santa Barbara, California, May 2001. 8. W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. Journal of the ACM, 49(3):368–406, 2002. 9. S. Lu, M. Dong, and F. Fotouhi. The semantic web: Opportunities and challenges for next-generation web applications. International Journal of Information Research, 7(4), 2002. Special Issue on the Semantic Web. 10. S. Lu, Y. Sun, M. Atay, and F. Fotouhi. A new inlining algorithm for mapping XML DTDs to relational schemas. In Proc. of the 1st International Workshop on XML Schema and Data Management (Lecture Notes in Computer Science), Chicago, Illinois, USA, October 2003. To appear. 11. E. Maler and S. DeRose. XML Linking Language (XLink). http://www.w3.org/TR/xlink. 12. E. Maler and S. DeRose. XML Pointer Language (XPointer). http://www.w3.org/TR/xptr. 13. D. Megginson. SAX – The Simple API for XML. http://www.saxproject.org/. 14. J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and opportunities. The VLDB Journal, pages 302–314, 1999. 15. C. Sperberg-MCQueen and H. Thompson. W3C XML Schema, April 2000. http://www.w3.org/XML/Schema.
Index Selection for Efficient XML Path Expression Processing Zhimao Guo, Zhengchuan Xu, Shuigeng Zhou, Aoying Zhou, and Ming Li Dept. of Computer Science and Engineering, Fudan University, China {zmguo,zcxu,sgzhou,ayzhou,mingli}@fudan.edu.cn
Abstract. One approach to building an efficient XML query processor is to use RDBMSs to store and query XML documents. XML queries contain a number of features that are either hard to translate into SQLs or for which the resulting SQL is complex and inefficient. Among them, path expressions pose a new challenge for efficient XML query processing in RDBMSs. Building index structures for path expressions is necessary. Meanwhile, indexes occupy much disk space. There is a tradeoff between the consumption of disk space and the efficiency of query evaluation. In this paper, we present a cost model for the space consumption of indexes and their benefit to XML queries. Making use of the statistics of XML data and the characteristics of the target application, we adopt greedy algorithm to select some map indexes to be built. Our experimental study demonstrates that query performance get comparatively significant improvement over the case without indexes while only consuming disk space of modest size.
1
Introduction
Due to its flexibility, XML is rapidly emerging as the de facto standard for representing, exchanging and accessing data over the Internet. XML data is an instance of semi-structured data[1]. It comprises hierarchically nested collections of elements. Tags stored with elements in XML data describes the semantics of the data. Thus, XML data, like semi-structured data, is hierarchically structured and self-describing. As Web applications are processing an increasing amount of XML data, there is a growing interest in storing XML data in relational databases so that these applications can use a complete set of data management services (including concurrency control and scalability, etc.) and benefit from the highly optimized relational query processors. Recently storing XML data into relational databases has been extensively studied[2,3,4]. However, how to build an efficient XML query processor is still largely an open problem.
The work was supported by the Hi-Tech Research and Development Program of China under grant No. 2002AA413110. Shuigeng Zhou was also supported by the HiTech Research and Development Program of China under grant No. 2002AA135340 and partially supported by the Open Research Fund Program of State Key Lab of Software Engineering of China under grant No. SKL(4)003.
´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 261–272, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
262
Z. Guo et al.
Various languages for querying semi-structured data or XML have been proposed[5,6,7,8]. One of the most important features of these languages is path expressions. XML query languages such as XPath[7] and XQuery[8] use path expressions to traverse XML data. Processing path expressions plays a key role for XML query evaluation. Naive evaluation of path expressions requires to exhaustively search the entire XML document, which is inefficient. Particularly, in relational databases, path traversal needs to be realized by joining relevant tables. It is well-known that the join operations are most costly, which substantially compromise the benefit of using a relational engine for processing XML data. An efficient way to process path expression queries, in order to reduce join costs, is to use indexes. Index technique has been extensively studied in various contexts. The join index was first proposed in relational databases[9]. Several data structures were also proposed to handle path queries in object-oriented databases, e.g., access support relations[10] and join index hierarchy[11]. However, different from relational databases and object-oriented databases, the structure of XML documents is highly flexible, and path expressions in XML query languages are highly expressive, so that the existing index structures are not directly applicable to XML query processing. In so-called native systems, several indexes were proposed: strong DataGuide[12] in Lore system[13], and the T-index proposed in [14]. They can be used to process simple path expressions. However, strong DataGuide and T-index may be very large in size while only offering a little improvement over the naive evaluation of path expressions. In [15], Zheng et al. proposed a new index structure, called structural map, for efficiently processing path expression queries on top of any existing database systems. The structural map consists of two parts—a guide and a set of maps. They present two kinds of maps, 1-way map for evaluating simple path expressions, and n-way map for evaluating regular path expressions. They create maps for all path expressions appearing in the query workload. Structural map can significantly speed up the evaluation of path expression queries. Meanwhile, these maps (referred to as map indexes below) will occupy some disk space. If more map indexes are built, some queries are likely to be processed more efficiently. However, the overhead of disk space consumption will be overwhelming. If only map indexes for part of path expressions are built, then the evaluation of some queries will be slowed down more or less. Thus there is a tradeoff between the consumption of disk space and the efficiency of query evaluation. A group of map indexes chosen to be built is called an index scheme. We know that variant XML data hold varied statistics, and different applications may present a variety of access patterns. For example, a Web site may perform a large volume of simple, localized lookup queries, whereas a catalog printing application may require large and complex queries with deeply nested results[16]. In this paper, we introduce a novel cost-based approach to index selection for efficiently processing XML path expressions. We describe the design and implementation of our approach, that automatically finds a near-optimal index scheme for a specified target application.
Index Selection for Efficient XML Path Expression Processing
263
Fig. 1. The tree structure of an example XML document
The design principles underlying our approach are cost-based search for a near-optimal index scheme and reuse of existing technologies. The first principle is to take the application into consideration. More precisely, a query workload contains a set of queries and an associated weight that reflects the importance of each query for the application. The second principle is to leverage existing relational database technologies whenever possible. We use relational optimizer to obtain cost estimates. Our work is based on that of [15]. We present a cost model for the space consumption of map indexes and their resulting benefit to XML queries by making use of the statistics of XML data and the characteristics of the workload. We adopt greedy algorithm to select a near-optimal set of map indexes to be created, i.e., a near-optimal index scheme. Under certain predefined constraint of disk space, this group of map indexes can bring the most benefit. Our experimental study demonstrates that query performance gets comparatively significant improvement over the case without indexes in the cost of modest overhead of disk space. To our best knowledge, we are the first one to take characteristics of applications and disk space constraint of indexes together into consideration.
2
Structural Map
First we briefly review XML storage model and the work of [15]. An XML document can be represented as a data graph. In the data graph, a node represents an element or attribute. Each node has a unique identifier and is labelled by its tag name. A leaf node is also attached with a string, which represents the text content of an element or attribute. An edge represents a parent-child relationship. For example, Fig. 1 shows a data graph for a fragment of an XML document specified in XML benchmark, XMach-1[17]. The XML storage model has been extensively studied recently[2,3,4]. In our study, for simplicity and without loss of generality, we assume a rather simple DTD-based storage model that maps an element type in DTD into a relation and handles the parent-child relationships as pairs of identifiers. For example, the XML fragment in Fig. 1 is stored in five relations: document, chapter, head, section and paragraph. Fig. 2 shows the relations for document, chapter, section and paragraph. The from and to field of each relation represent the identifiers
264
Z. Guo et al.
Fig. 2. The storage example for the XML document
of the parent and of the node itself which the tuple refers to, respectively. In a relation for an atomic element type, a text field is added to handle the content string. In structural map[15], there are two kinds of map indexes, 1-way-map and n-way-map. Given a label path l1 /l2 / . . . /ln , a map is a set of pairs of identifiers id1 , idn . More specifically, for each label pair (li , lj ) in the guide, if there is only a simple path li /li+1 / . . . /lj , then build a 1-way-map for it; if there are more than one path from li to lj , then build a n-way-map for regular path li //lj . For more detail on structural map, please refer to [15]. Although mapping XQuery statements to SQL statements is an important task of the query evaluation process, it is not the focus of this paper. Hence we omit any further discussion on this issue. We refer the interested readers to recently proposed mapping algorithms from XML query languages to SQL[18, 19]. Next, we examine how to evaluate an XML query with the assistance of map indexes. As an example, the XQuery statement is Qx for $x in /document/chapter[@id=‘c1’]//head return $x. If no map indexes have been built, then the corresponding SQL statement would be something like Q1 1 select h1.text from chapter c1, head h1 where c1.to=h1.from and c1.id=‘c1’ union select h2.text from chapter c2, section s2, head h2 where c2.to=s2.from and s2.to=h2.from and c2.id=‘c1’ union ... However, if the map index mi for the label path chapter//head have been built, the SQL statement would be like Q2 select h.text from mi, chapter c, head h where mi.from = c.to and mi.to = h.to and c.id=‘c1’. It is obvious that this SQL query is more efficient than the former tedious one. Of course, in order to build the map index mi, we have to commit the following SQL statement Qmat to the RDBMS: 1
To avoid cluttering the queries, we have omitted sorting and tagging here.
Index Selection for Efficient XML Path Expression Processing
265
Table 1. Size of map indexes XML Data (M) Database (M) Map Indexes (M) 2 2.1 7.9 4 4.0 16.1 6 6.0 24.0 8 8.1 32.2 10 9.9 40.1
create table mi as select c1.to as from, h1.to as to from chapter c1, head h1 where c1.to=h1.from union select c2.to as from, h2.to as to from chapter c2, section s2, head h2 where c2.to=s2.from and s2.to=h2.from union ... In fact, the idea is much like that of materialization in query optimization. Since the “materialized” map indexes will occupy much disk spaces, we have to tradeoff the overhead of disk space consumption and their benefit to query evaluation. We only choose part of map indexes, then build them, rather than build map indexes for all label pairs. In order to have a more clear understanding of its justification, We next report the result of our preliminary experiments. We evaluated the detail of disk consumption if map indexes were all built. The experimental result is shown in Table 1. We varied the size of the original XML document from 2M to 10M, and measured the disk size required by database and map indexes, respectively. As shown in Table 1, the overhead brought by map indexes is overwhelming. In the case of the XML document of 10M, all of the map indexes require 40M disk space. That is hardly bearable. Therefore, we should carefully choose an optimal group of map indexes to be built.
3
Cost Model
Now we formally state the problem of map indexes selection. First, we model the query workload wkld of the specified application as wkld = {(Qi , wi ), i = 1, 2, · · · n},
(1)
where Qi is an XML query, wi is its weight, reflecting its relative importance within the workload wkld. A high weight means that the query is more frequently requested, or it should be executed with less cost (i.e., holds high priority). With different set of map indexes built, the cost of the workload would be different. The cost of the workload wkld against a particular index scheme S, is
266
Z. Guo et al.
the weighted sum of the cost of different queries in wkld, cost(wkld, S) =
n
wi ∗ cost(Qi , S).
(2)
i=1
Here S is the set of map indexes chosen. If no map index is chosen, S = ∅. The map indexes selection problem can then be defined as follows: Definition 1. Given an XML document D, space constraint Cons on disk size of map indexes, and a workload wkld, determine the optimal index scheme S for this document, so that cost(wkld, S) is minimal. 2 Unfortunately, this problem can be proven to be NP-hard. An index scheme consists of some map indexes. If there are n map indexes which can be built, then each of 2n sets of map indexes will be a candidate of the optimal index scheme. Namely, there are totally 2n states in the entire search space. Therefore, exhaustive search is not practical. Thus our search strategy is based on greedy heuristics. We adopt greedy algorithm to find a near-optimal set of map indexes so that it can bring the most benefit to the evaluation of the specified workload under certain predefined disk space constraint. Virtually, we pay little attention to the exact value of cost(wkld, S). Rather we are more interested in the difference between cost(wkld, S1 ) and cost(wkld, S2 ). If cost(wkld, S1 ) is less than cost(wkld, S2 ), then the index scheme S1 is more beneficial than S2 . The benefit from index scheme S is defined as follows: bf (S) = cost(wkld, ∅) − cost(wkld, S). (3) However, this definition of bf (S) is not conveniently operational. Therefore, we present its operational definition as bf (S) =
n
i wi ∗ bfind (S).
(4)
i=1 i Here, bfind (S) denotes the benefit resulted from S with respect to Qi , individually. It is defined as follows i bfind (S) = db cost(Qjmat ). (5) j Qmat is used by Qi
In this definition, Qmat is the select part of the SQL statements, which are used to build the map indexes. db cost(Q) denotes the cost for evaluating a SQL query Q returned by the RDBMS optimizer. (Most commercial DBMSs provide support for such statistics). Also, let db size(Q) denote the total size of the result of a query Q. DBMSs can also provide this kind of estimates. It is noteworthy to point out that the above definition is approximate. The rationale behind this approximation is that, if the map index corresponding to Qjmat has been built, then we can use them to rewrite Qi , and then during the evaluation process, the cost spent on evaluating the “map index” part could be reduced.
Index Selection for Efficient XML Path Expression Processing
267
Procedure 1 MaxBenefit(G, S) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
for p = 1 to n do for q = 1 to n do if mipq ∈ S then mi A[p, q] = db cost(Qmatpq ); P [p, q] = {mipq } else A[p, q] = 0; P [p, q] = ∅ end if end for end for for r = 1 to n do for p = 1 to n do for q = 1 to n do if (A[p, r] + A[r, q] > A[p, q]) and (p = q) then A[p, q] = A[p, r] + A[r, q]; P [p, q] = P [p, r] + P [r, q] end if end for end for end for return A and P
Even if the set of map indexes has been fixed, how to make use of them to assist query evaluation is still a problem. With respect to a query, we maybe have many options. For example, regarding to path expression l1 /l2 /l3 /l4 /l5 , if map indexes for label path l1 /l2 /l3 and l2 /l3 /l4 have been built respectively, if we select the first one, then we cannot make any use of the latter, while vice versa. Thus it is not easy to make a proper decision between the two choices. In order to tackle this problem, we denote each label li as a vertex in a directed graph G. In the graph G, the edge connecting the vertexes li and lj is labelled ij by a distance dij = db cost(Qij mat ). Here, Qmat is the select part of the SQL statement to build the map index for label path li /lj or li //lj . In order to improve the query evaluation process as much as possible, we wish to find paths between each pair of vertexes of the graph G, which maximize the benefit between the two vertexes. Hence, we devise a procedure MaxBenefit, as shown in Procedure 1, that, given the graph G and the index scheme S, attempts to find these paths. The input of Procedure MaxBenefit consists of 2 arguments in which G is the corresponding directed graph, S is the index scheme which are being examined. Lines 1–6 initialize matrices A and P . A[p, q] represents the maximal benefit already computed to the label path lp / . . . /lq or lp //lq , while P [p, q] indicates the corresponding map indexes which are used in evaluating this label path. Lines 7–11 use dynamic programming technique to calculate the finally maximal benefit and the set of map indexes being used. ij After carrying out this algorithm, we obtain {Qij mat |Qmat is used by Qi } for i each i = 1, 2, · · · , n. Then we can get bfind (S) and bf (S). In order to exploit greedy algorithm, we should distinguish the benefit contributed by an individual map index mi from that by others. However, with respect to varied index
268
Z. Guo et al.
schemes, the benefit contributed by mi may be different. The reason for this phenomenon is that, though mi1 is used by Q1 in the case of an index scheme S1 , it may not be used by Q1 in the case of another index scheme S2 , since S2 contains another more beneficial map index mi2 . In order to measure the contributions of each individual map index mi, we propose the following definition, similar to that of [20]. Definition 2 (Marginal Benefit). mbf (mi, S) is a function for the calculation (measurement) of the marginal benefit contributed by mi, given that all map indexes in S − {mi} are already built, where mi ∈ S. More formally (In terms of mathematical formulae), bf (S) − bf (S − {mi}) if mi ∈ S mbf (mi, S) = (6) bf (S + {mi}) − bf (S) if mi ∈ /S Therefore, we can evaluate the benefit of building a map index, and then, in view of the cost model, develop the optimal index selection algorithm.
4
Search Strategies
Assuming that S is the current group of map indexes to be built, let us examine how to get another index scheme S which is better than S, according to whether the total size of S exceeds the disk space constraint or not. The principles which guide us to make choices are that (a)if the total size of S exceeds the space constraint, then we have to discard the map index which brings little benefit, while occupying much disk space among all of the currently used map indexes; and that (b)if the total size of S is within the predefined space constraint, then we can pick another map index from the remaining candidate ones which can bring much benefit with little space storage overhead; and that (c)each time we changed S, we should recompute the total size of S and the marginal benefit of each individual map index mi, then rearrange them in ascending order by mbf (mi, S)/size(mi). The entire procedure is an iterative one. We will iteratively add or discard map indexes, though only one at a time. Since this problem is NP-hard, our search strategies are based on greedy heuristics. Guided by a certain search strategy, we obtain a sequence of index schemes, S0 , S1 , . . ., and finally get a near-optimal index scheme Sf . In our approach, the selection of initial index scheme S0 is important. It influences significantly the final Sf . Therefore, we experiment with three different initial index schemes to examine its effects: empty, full and random. In the case of empty, the initial set S0 of map indexes is empty, that is to say, we continually insert more map indexes into this set in a greedy fashion until the disk space limit does not allow us to do so. In the second case full, we assume that in the beginning, all map indexes are chosen. Surely their total disk consumption will exceed the space constraint, so we have to discard some poor map indexes gradually. Here, “poor” map indexes means that they can only bring about little benefit while devouring much disk storage. As for the case random, we choose an initial index scheme S0 randomly.
Index Selection for Efficient XML Path Expression Processing
269
If S0 ’s total size goes beyond the predefined space constraint, we discard poor map indexes only one at a time, while if its size is below the space constraint, we add some good map indexes gradually. It is noteworthy to point out that, in these three cases, the sequence of S0 , S1 , · · · , Sf is monotonic. Namely, either S0 ⊂ S1 ⊂ · · · ⊂ Sf or S0 ⊃ S1 ⊃ · · · ⊃ Sf holds. The set S of chosen map indexes grows larger and larger, or shrinks continually. We do not consider this unidirectional search will always yield best solutions. Therefore, we propose a bidirectional search strategy, called backforth strategy. Here, backforth means that we maybe add a map index in this iteration, but after we calculate again the marginal benefit function of map indexes, and rearrange them, probably we will remove a poor map index in the next iteration. In the following iteration, we maybe add or remove a map index depending on whether the total size of the currently examined index scheme S is below or beyond the predefined disk space constraint. However, some delicate problems arise from this back and forth strategy (it is the origin of the name of backforth strategy). Assuming that the current index scheme is Sc , we add a map index mic into Sc . After we rearrange the currently chosen map indexes in ascending order by mbf (mii , Sc )/size(mii ), or in other words, reorganize the heap of Sc , we maybe remove mic from Sc in the next iteration. If this case happens, we will fall into an endless iterative process. It is evident that this sort of adding or removing operations is fruitless at all. Thus we must take this delicate case carefully in our algorithms. It is noted that the selection of the initial index scheme S and the adoption of backforth strategy are orthogonal. That is to say, we can always employ backforth strategy no matter which initial index scheme (empty, full, or random) is being adopted. We design the algorithm Empty for the case of empty S0 , as shown in Procedure 2, which does not employ backforth strategy. Lines 5–10 are the main body of the algorithm Empty, which is based on greedy heuristics. mi is the best one among the remaining map indexes, hence, we pick it and add it into the current index scheme S. This iterative process continues until the disk space consumed by S exceeds the predefined disk space constraint Cons. Next in order to show how to combine different selection policy of initial index scheme with the backforth strategy, we present the algorithm RandomBF, in which the initial index scheme is constructed randomly, and the backforth strategy is being employed. Procedure 3 presents the pseudo code of the algorithm RandomBF. This algorithm naturally introduces some perturbation during the process of searching the optimal solution, which can overcome some weakness of greedy algorithm in a sense. Lines 6–7 show the operations of adding a beneficial map index into the current index scheme S when the disk space constraint allows. Lines 9–10 give those of discarding a poor map index when S’s size is larger than the disk space constraint. The mechanism for avoiding an infinite loop is shown in lines 11 and 12. If a certain index scheme S, which appeared in an earlier iteration, appears once again, the infinite loop should terminate. Otherwise, it will run ceaselessly.
270
Z. Guo et al.
Procedure 2 Empty() 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
Cands ← (the set of all candidate map indexes) Cons ← (disk space constraint) S ← ∅ {S0 is empty} Recalculate(Cands, S) loop find mi ∈ (Cands − S) with maximal mbf (mi, S)/size(mi) if disk size(S + {mi}) ≤ Cons then S ← S + {mi}; Recalculate(Cands, S) else exit loop end if end loop return S
Procedure 3 RandomBF() 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
5
... S ← (a random subset of Cands) {S0 is randomly constructed} Recalculate(Cands, S) loop if disk size(S) ≤ Cons then find mi ∈ (Cands − S) with maximal mbf (mi, S)/size(mi) S ← S + {mi} else find mi ∈ S with minimal mbf (mi, S)/size(mi) S ← S − {mi} end if if S already appeared once then exit loop {avoid an infinite loop} end if Recalculate(Cands, S) end loop return S
Performance Analysis
In this section, we present preliminary results on the performance of our index selection algorithms. We have implemented these algorithm in an XML-RDBMS system, VXMLR[21]. And we also conducted experiments to evaluate the effectiveness of our approach. As emphasized in previous sections, one of distinguished aspects of our approach is that we tradeoff disk space overhead of map indexes and their benefit to query processing. The results of our experiments presented in this section support our argument, and demonstrate that judiciously building part of map indexes can still significantly improve query evaluation process. In our experiments, the data sets and the workload used are from the XML benchmark, XMach-1[17]. The data sets are generated following the description of XMach-1, which size ranges from 20M to 100M. The XML data is stored in
Index Selection for Efficient XML Path Expression Processing
271
Fig. 3. Benefits of the optimal map indexes
IBM DB2 UDB7.1, which acts as the back-end repository. The workload consists of eight queries, which cover a wide range of processing features on XML data. Without loss of generality, the weight of each query is set to the same value, e.g., all are 1. Our experiments were conducted on a 1.4G Pentium IV machine with 256M of main memory, running Windows 2000 Server. Fig. 3 depicts the average evaluation time of the eight queries in the workload as a function of the size of the XML data, which ranges from 20M to 100M. The evaluation time without any map index and with the optimal index scheme obtained by our index selection algorithm, is shown in Fig. 3. In this experiment, the disk space constraint of all map indexes is set to 10% of the original size of the databases. As indicated by Fig. 3, although we only build part of map indexes, the query performance still gets considerable improvement over the case without map indexes while these map indexes do not consume too much storage space.
6
Conclusions
In this paper, we have proposed an index selection algorithm for choosing an optimal set of map indexes, given the XML data and the workload. We present a cost model to approximate the benefit of an index scheme, and marginal benefit of each individual map index. Based on greedy heuristics, we tradeoff the overhead from disk space consumption of map indexes and their benefit to the specified workload. Preliminary performance study shows that our approach is effective. Even though we do not build map indexes for all of label paths, query performance still get considerable improvement. Acknowledgments. We would like to thank Shihui Zheng of Fudan University and Wei Wang of Hong Kong University of Science and Technology for so many helpful discussions.
272
Z. Guo et al.
References 1. S. Abiteboul. Querying Semi-Structured Data. In Proc. of ICDT ’97, pages 1–18. 2. D. Florescu and D. Kossmann. A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. Technical Report 3680, INRIA, 1999. 3. J. Shanmugasundaram, K. Tufte, C. Zhang, et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In Proc. of VLDB’99, pages 302–314. 4. F. Tian, D. J. DeWitt, J. Chen, et al. The Design and Performance Evaluation of Alternative XML Storage Strategies. SIGMOD Record Special Issue on Data Management Issues in E-commerce, March 2002. 5. S. Abiteboul, D. Quass, J. Mchugh, et al. The Lore Query Language for Semistructured Data. International Journal on Digital Libraries, 1(1):68–88, April 1997. 6. A. Deutsch, M. Fernandez, D. Florescu, et al. XML-QL: A Query Language for XML. W3C Note, 1998. http://www.w3.org/TR/1998/NOTE-xml-ql-19980819. 7. J. Clark and S. DeRose. XML Path Language (XPath). W3C Recommendation, 1999. http://www.w3.org/TR/xpath. 8. S. Boag, D. Chamberlin, M. F. Fernandez, et al. XQuery 1.0: An XML Query Language. W3C Working Draft, 2002. http://www.w3.org/TR/xquery. 9. P. Valduriez. Join Indices. TODS, 12(2):218–246, 1987. 10. A. Kemper and G. Moerkotte. Access Support in Object Bases. In Proc. of SIGMOD’90, pages 364–374. 11. J. Han, Z. Xie, and Y. Fu. Join Index Hierarchy: An Indexing Structure for Efficient Navigation in Object-Oriented Databases. TKDE, 11(2):321–337, 1999. 12. R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proc. of VLDB, pages 436–445, 1997. 13. J. McHugh, S. Abiteboul, R. Goldman, et al. Lore: A Database Management System for Semistructured Data. SIGMOD Record, 26(3):54–66, 1997. 14. T. Milo and D. Suciu. Index Structures for Path Expressions. In Proc. of ICDT, pages 277–295, 1999. 15. S. Zheng, A. Zhou, J. X. Yu, et al. Structural Map: A New Index for Efficient XML Path Expression Processing. In Proc. of WAIM, 2002. 16. P. Bohannon, J. Freire, P. Roy, et al. From XML Schema to Relations: A CostBased Approach to XML Storage. In Proc. of ICDE’02. 17. T. Bohme and E. Rahm. XMach-1: A Benchmark for XML Data Management. In Proc. of GDC, 2001. 18. M. J. Carey, J. Kiernan, J. Shanmugasundaram, et al. XPERANTO: Middleware for publishing object-relational data as XML documents. In Proc. of VLDB, pages 646–648, 2000. 19. M. F. Fernandez, W. C. Tan, and D. Suciu. SilkRoute: Trading between Relations and XML. WWW9/Computer Networks, 33(1-6):723–745, 2000. 20. C. Y. Chang and M. S. Chen. Exploring Aggregate Effect with Weighted Transcoding Graphs for Efficient Cache Replacement in Transcoding Proxies. In Proc. of ICDE, 2002. 21. A. Zhou, H. Lu, S. Zheng, et al. VXMLR: A Visual XML-Relational Database System. In Proc. of VLDB, 2001.
CX-DIFF: A Change Detection Algorithm for XML Content and Change Presentation Issues for WebVigiL* Jyoti Jacob, Alpa Sachde, and Sharma Chakravarthy Computer Science Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76019 {jacob,sachde,sharma}@cse.uta.edu
Abstract. The exponential increase of information on the web has affected the manner in which the information is accessed, disseminated and delivered. The emphasis has shifted from mere viewing of information to efficient retrieval and monitoring of selective changes to information content. Hence, an effective monitoring system for change detection and notification based on user-profile is needed. WebVigiL is a general-purpose, active capability-based information monitoring and notification system, which handles specification, management, and propagation of customized changes as requested by a user. The emphasis of change detection in WebVigiL is to detect customized changes on the content of the document, based on user intent. As XML is an ordered semi-structured language, detecting customized changes to part of the value of the text nodes and even portion of the content spanning multiple text nodes of an ordered XML tree is difficult. In this paper, we propose an algorithm to handle customized change detection to content of XML documents based on userintent. An optimization to the algorithm is presented that has a better performance for XML pages with certain characteristics. We also discuss various change presentation schemes to display the changes computed. We highlight the change detection in the context of WebVigiL and briefly describe the rest of the system.
1 Introduction The Internet is evolving as a repository of information, and the user’s interest has expanded from querying information to monitoring evolution of or changes to pages. The emphasis is on selective change detection, as users are typically interested in changes to a particular portion or section and not the entire page. The need to monitor changes to documents of interest is not only true for the Internet but also for other large heterogeneous repositories. WebVigiL [1-3] provides a powerful way to disseminate information efficiently without sending unnecessary or irrelevant information. The emphasis in WebVigiL is on detecting changes to web pages and notifying the users based on their given profiles. eXtensible Markup Language (XML) is rapidly gaining popularity as the data transportation and exchange * This work was supported, in part, by the Office of Naval Research, the SPAWAR System Center-San Diego & by the Rome Laboratory (grant F30602-01-2-0543), and by NSF (grantsIIS-0123730 and IIS-0097517). M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 273–284, 2003. © Springer-Verlag Berlin Heidelberg 2003
274
J. Jacob, A. Sachde, and S. Chakravarthy
Fig. 1. WebVigiL Architecture
language. The emphasis in this paper is on selective monitoring of XML content. The self-descriptive characteristic of XML gives us some useful information on the semantics of the document and enables us to detect changes at a finer granularity (e.g., an element or text level) rather than at the document level. But the current user is interested in detecting changes at a finer granularity especially at the text level (e.g. keywords, phrases). Hence, a mechanism is needed, which will monitor customized changes on portions of the XML content. The main contribution of this paper is CXDiff, an approach for detecting customized changes to the content of ordered labeled XML documents. Fig. 1 shows the overall system architecture of WebVigiL. A web based user interface is provided to the user to submit his/her profiles termed as sentinels, indicating the pages to monitor, when to monitor, the types of changes and the methods for presentation and notification. Sentinels are validated for syntactic and semantic correctness and populated in Knowledgebase. Once the sentinels are validated semantically, the change detector module generates the ECA rules for the run time management of that sentinel. The fetch module fetches pages for all active (or enabled) sentinels, forwards them to the version management module for adding them to the page repository and notifies the change detection module. Based on the type of the documents, either the HTML change detection [4] or the XML change detection mechanism is called. The presentation module takes these changes and presents it in a user-friendly manner. The remainder of the paper is organized as follows. Section 2 discusses various tools developed for detecting changes to web pages. Section 3 gives the problem overview. In section 4, we have discussed the change operations and algorithm proposed for change detection to XML documents. Section 5 discusses the various presentation modules and section 6 provides current status and conclusions.
CX-DIFF: A Change Detection Algorithm for XML Content
275
2 Related Work Work in change detection to flat-files [5] and detecting differences between two strings [6, 7] in terms of inserts and deletes has been well established. WordPerfect has a “mark changes” facility that can detect changes based on how documents are compared (on either a word, phrase, sentence, or paragraph basis). Due to the semi-structured nature of XML, it can be conveniently represented in a tree structure. Many algorithms have been proposed for tree-tree comparison taking some tree features into consideration [8-10]. Chawathe et al [10] proposed an algorithm for hierarchical structured data wherein minimum costs edit script is generated which transforms tree T1 to T2. This algorithm works for semi-structured documents such as latex. But the assumptions made for latex do not hold for XML documents as they contain duplicate nodes and sub-trees. X-diff [11] detects changes on parsed unordered labeled tree of XML. X-diff finds the equivalent second level sub-trees and compares the nodes using the structural information denoted as signature. In order to detect move operations i.e., if a node is moved from position i in the old tree to position j in the new tree, an unordered tree cannot be considered. In [12], the authors formulated a change detection algorithm called Xydiff to detect changes between given two ordered XML trees T1 and T2. XMLTreeDiff [13], a tool developed by IBM, is a set of JavaBeans and does ordered tree to tree comparison to detect changes between XML documents. DeltaXML [14, 15] developed by Mosnell provides a plug-in solution for detecting and displaying changes to content between two versions of an XML document. They represent changes in a merged delta file by adding additional attributes such as insert/delete to the original XML document. The Diff and Merge tool [16] provided by IBM compares two XML files based on node identification. It represents the differences between the base and the modified XML files using a tree display of the combination in the left-hand, Merged View pane with symbols and colors to highlight the differences using the XPath syntax. DOMMITT [17] is a UNIX diff utility tool that enables the users to view differences between the DOM [18] representations of two XML documents. The Diff algorithms on these DOM representations produces edit scripts, which are merged into the first document to produce an XML document in which the edit operations are represented as insert/delete tags for a user interactive display of differences. Most of the algorithms in the literature detect changes to structure as well as content. WebVigiL handles change management only on the content of XML documents. Detecting changes to the structure in addition to the context would be an overhead. In addition, most of the change detection tools for XML do not support customized changes to the nodes (i.e., change to part of a node or spanning multiple nodes) and hence these algorithms cannot be mapped directly to satisfy the monitoring requirements of WebVigiL.
276
J. Jacob, A. Sachde, and S. Chakravarthy
Fig. 2. XML Document
Fig. 3. Ordered Labeled XML tree
3 Customized Change Detection Issues As XML was defined for semi-structured document containing ordered elements [19, 20], such documents can be mapped into an ordered labeled tree. The ordered tree for the XML document in Fig. 2 is shown in Fig. 3. We consider the tree representation of the XML document similar to the Document Object Model (DOM) [18] representation. In this paper, the nodes of the XML tree will be referenced using the defined label of a DOM Node. In XML, the context (elements nodes) defines the content of the document. In the tree structure for an XML document, a leaf node represents the content of the page for a particular context. Hence, changes are detected to the text nodes and attributes nodes, which constitute the leaf nodes. Change detection for semi-structured, ordered XML tree is complex because of the following issues: 1. WebVigiL supports customized change detection to the contents, such as phrase and keyword change. Keywords and phrases can be part of the node or can span multiple nodes. Hence the algorithm should be capable of extracting the required content of interest and detect changes. 2. Change detection for semi-structured, ordered XML tree is complex on account of duplicate nodes and sub-trees. By duplicate nodes, we mean similar leaf nodes containing the same context. As shown in Fig. 3, the node ‘J K Rowling’ appears twice in the tree for the same context (path) i.e. ‘Books-Section-Book-Author’. Duplicate sub-trees defined for the same context are also possible in XML. Order becomes very critical for such duplicate nodes as a node n, existing at position pi in th the old tree should be compared to the node existing in the equivalent i position in the new tree with respect to their siblings. 3. For an XML tree T1 rooted at R with children pi to pm, a node along with its structural information can be moved from j where i ≤ j ≤ m in T1 to position k in T2 where j k, when considered with respect to the siblings. The change mechanism developed should be capable of detecting such move operations. An algorithm CX-Diff is proposed, taking into consideration an ordered, labeled XML tree and the position of occurrence of the node with respect to its sibling.
CX-DIFF: A Change Detection Algorithm for XML Content
277
Fig. 4. Change Operations on trees T1 and T2
4 CX-Diff: Customized Change Detection for Ordered Documents Given two ordered XML trees T1 and T2, consider the change operations from the set E = {insert, delete, move} which when applied to T1 transforms it into a new tree T2. To detect the change operations, the structure is also taken into consideration. The content of a leaf node is defined as its value and is denoted as v(x) where x is a leaf node. The operations can be defined as follows: th Insert: Insertion of a new leaf node at the i position is denoted as insert. Insert of a th keyword is defined as the appearance of a keyword k in the i leaf node x of the tree T1. Insert of a phrase is defined as the appearance of a complete phrase at position i in tree T1, denoted by (p, i). As structure defines the context for the content in XML, a node of the same value but different ancestral elements is considered inserted. Delete: Given two ordered XML trees T1 and T2, T1 will be same as T2 except that it will not contain leaf node x. Delete of a keyword is defined as the disappearance of th the keyword k in the i leaf node x of the tree T1. Phrase delete is defined as th disappearance of a phrase p at i position in the tree T1, denoted by (p,i). Move: For the tree T1, containing leaf nodes from n1 to nm, a leaf node x containing signature s is shifted from position j in T1 to position k in the new tree T2 where 1<=j<=m and j k with respect to the siblings. Move is only applicable to a complete node. Keyword and phrase changes are changes detected to part of the node or on the contents of more than one node. Hence, move is not applicable to keyword and phrase change. CX-Diff detects customized changes on XML documents based on user interest. According to the definition of XML as defined in [20], the text nodes are ordered but attributes of an element are considered unordered. But as the changes are detected considering the content to be ordered, the attributes are also considered ordered for the proposed change detection algorithm. Attributes defining ID and IDREFS are also considered as simple ordered attributes. The types of changes supported are: i) Any Change: Changes to the leaf nodes of the XML document, ii) Keywords Change: Appearance or disappearance of a set of unique words which may constitute part of the node, and iii) Phrase Change: Appearance or disappearance of contiguous words from the page. In a tree structure, a phrase can be in a leaf node, part of a node and even span multiple nodes. The structural information denoted as path or signature is
278
J. Jacob, A. Sachde, and S. Chakravarthy
defined as the ancestral path of a leaf node from the parent to the root, denoted by path(x) for node x. For attributes, the label of the attribute also becomes a part of the signature. In Fig. 3, the path for the node “Harry Potter and the Chamber of Secrets” is Books-Section-Book-Name. In the rest of the section, we will discuss the algorithm CX-Diff and its associated definitions for matching and non-matching nodes. 4.1 Phases of CX-Diff For customized change detection based on user intent, the objects of interest are extracted and the change detection algorithm is applied. CX-Diff extracts the matching nodes, which satisfy the best match (defined below) definition, between the two given trees. From the non-matching nodes, the change operations are detected. Best Match: For an ordered, semi-structured tree, the best match for ordered leaf nodes are the ones satisfying the following: 1) For ∀ (x,y) ∈ M where M is the set of matching nodes, if xp ∈ path(x) and yp ∈ path(y) where p is the position in the signature list, then xp is the ancestor of x iff yp is the ancestor of y and xp = yp. (ancestor order preservation). 2) For ∀ (x,y) ∈ M, iff v(x) = v(y) and 3) For ∀ (x1, y1) ∈ M, iff x1, y1 ∈ common order subsequence L such that x1 has the same order of occurrence in L as y1. 4.1.1 Object Extraction and Signature Computation For detecting changes, the objects of interest are extracted from the contents of the XML document. The XML document is first transformed into a Document Object Model (DOM) [18]. The Xerces-J 1.4.4 java parser [21] for XML is used for this purpose. The tree is traversed and the leaf nodes consisting of text and attribute nodes are extracted. The objects of interest and their position are extracted from the leaf nodes. In addition, the structural (element) information is extracted and for each leaf node the associated signature is computed. For phrase extraction, contents of all the leaf nodes are divided into words and are extracted in the order of their occurrence. The Knuth-Morris-Pratt (KMP) string-matching algorithm is applied against the sequence of words and the start and end indices of all exact matches for the given phrases are extracted. A range is set for the index, which defines a phrase, and using the range, the sub-tree containing the phrase along with the parent elements is extracted. The old tree is realigned and the newly created tree is inserted in its correct order of occurrence into the old tree and the tree is realigned. 4.1.2 Filtering Unique Inserts/Deletes In a given tree T, a node x containing value v(x) can be distinct or can have multiple occurrences. Insertion/Deletion of distinct nodes can result in unique inserts/deletes unless they are moved, and can be detected for an unordered tree. Similarly, leaf nodes containing non-matching signature can also be considered as unique inserts/deletes as the signature defines the context. Unique Insert/Delete: For each leaf node x in tree T1, if there is no matching node y in tree T2 such that v(x) = v(y) or path(x) = path(y), then x is a unique insert. For a set matching M for old tree T1 and
CX-DIFF: A Change Detection Algorithm for XML Content
279
new tree T2: M(x,∅) where x ∈ T1 = Unique insert, M(∅,y) where y ∈ T2 = Unique delete. In this phase, the unique inserts/deletes are filtered out and matching nodes are extracted using the totalMatch and signatureMatch algorithm. totalMatch algorithm: For each extracted node, the function totalMatch(old_tree, new_tree) extracts the set of matching nodes denoted as M such that for the given trees T1 and T2 and leaf node x in T1 and leaf node y in T2, (x, y) ∈ M if v(x) =v(y) and path(x) = path(y). For a node n, if no match is found, then, it is flagged as ‘insert’ or ‘delete’. For phrase change, the associated phrase for each node is also marked as ‘insert/delete’. To determine inserts/deletes for keywords, further processing is needed. For some cases, though the value of the leaf node may not match, as keyword can be part of the leaf node, instances of the keyword in the leaf node may be matching. In order to detect such matching, we compare the signature of the node. If the signature matches, the instances of the keyword in the leaf node are also considered matched. In addition, as XML is a well-defined document, it can be assumed that the structure is generally stable. Largely, though the contents change, the structure remains the same and this information can be included for optimal detection of common order subsequence between two trees. To detect nodes containing common signature, the signatureMatch algorithm (defined below) is used. The non-matching nodes of totalMatch algorithm are given to the function signatureMatch, to extract common signatures. signatureMatch algorithm: For leaf node x in tree T1 and y in tree T2, if path(x) = path(y) and v(x) v(y), then path(x) and path(y) are included in the match set M. As shown in Fig. 4, at the end of this phase, all unique inserts (for example, node ‘F’) and unique deletes (for example, node ‘G’) are detected and common structural information of such unique inserts/deletes are extracted. For keyword and phrase change, if all the extracted keywords and phrases result in unique insert/delete, then the computation can be considered complete at this stage. 4.1.3 Finding the Common Order Subsequence For change detection to multiple occurrences of a node with common signatures and for moved nodes, it is necessary to consider an ordered tree. The leaf nodes are considered matching only if they belong to the common order subsequence. Due to realignment of the node and inserts and deletes, the order of occurrence needs to be considered with respect to the sibling. Hence, the common order subsequence is computed by running the Longest Common Subsequence (LCS) algorithm [22] between the matched nodes of both the trees. At the end of the LCS computation on the matched nodes in Fig. 4, the deletion of the node ‘D’ at position 2 in T1 as well as the move of node ‘C’ from position 4 in T1 to position 5 in T2 can be detected. This algorithm effectively detects customized changes such as keywords, phrases etc based on user intent. In addition, changes to duplicate leaf nodes containing common structural information and moves are also accurately detected.
280
J. Jacob, A. Sachde, and S. Chakravarthy
Fig. 5. Effect of increase of matching second level sub-trees on wide trees.
Fig. 6. Effect of increase of matching second level sub-trees on deep trees.
4.2 Optimization To improve the time taken by the above algorithm, an additional phase of eliminating common second level sub-tree (i.e., sub-trees formed by the elements just below the root) is introduced. It is observed that LCS incurs high cost, resulting in increase of the overall cost for change detection. Hence, by eliminating the common second level sub-trees, the number of nodes given as input to LCS is reduced, resulting in an overall reduction of the cost for change detection. Sub-trees are computed at the second level as the second level defines the main context of the contents in the document, minimizing the chances of duplicate sub-trees at this level. For given trees T1 and T2, the second level element node is denoted as l(s) where l is the label of node s. if l(s1) is the second level node of T1 and its equivalent node in T2 is l(s2), the subtrees of T1 and T2 are considered matched if l(s1) = l(s2) and all the leaf nodes along with the signature in T1 is equal to the leaf nodes and their associated signature in T2 in the same order of occurrence. All the nodes of the matched sub-trees are removed from the matched set M. Hence, the size of M for LCS is reduced and the cost of computation is improved. The performance results are discussed in the next section. 4.2.1 Performance The performance tests were carried out for the following tree characteristics a) Deep trees to understand the effect on increase of height of a tree. In this case, the path taken from root to leaf node is increased. Hence, the number of element nodes increase considerably. b) Bushy trees to understand the effect of increase of the leaf nodes and the number of second level sub-trees. An XML page generator was implemented for synthetic test data generation. Synthetic test cases were designed based on the observation of actual data on the Internet (e.g. ACM Sigmod repository etc.). Each test case was run 4 times using both “with optimization’ and ‘without optimization’ options. The average of the 4 runs was taken as the final result. Below, we discuss the performance results for various tree characteristics: 1) Bushy trees: a) nd leaf nodes: 100-400 b) 2 level sub-trees: 10-40 c) Common second level sub-trees: nd n-3 where n is the number of 2 level sub-trees d) depth: 4. The observed performance is shown in Fig. 5. But as observed from the results, the optimization improves performance by reducing the total time taken for change detection for trees
CX-DIFF: A Change Detection Algorithm for XML Content
281
Fig. 7. Effect of increasing the matching 2nd level sub-trees on optimization
having more number of leaf nodes and more number of common second level subtrees. For a tree with 5-second level sub-trees, the performance improvement with optimization was 49ms while for a tree with 40 second level sub-trees, the performance improvement was observed to be 400ms. 2) Deep trees: a) leaf node: 20 nd b) 2 level sub-trees: 5 each having size of 4 leaf nodes c) Common second level subtrees: 2 (hence 8 nodes were common) and d) depth was varied from 5– 30. The observed performance is shown in Fig. 6. As can be inferred from the results, the performance improvement due to optimization is not significant for deep trees having small number of leaf nodes. It is observed that the cost after optimization increases in some cases. The increase in cost is due to additional cost incurred for checking the common second level sub-trees, negating the small improvement due to the removal of common second level sub-trees. As the height of the tree increases, the parsing cost increases. Hence, the optimization is not effective for deep trees having small number of leaf nodes. But it has been observed that for deep trees containing large number of leaf nodes, the optimization improves the performance. Hence, we can conclude that the optimization works effectively for trees having large number of leaf nodes and more second level sub-trees The optimization is based on the fact that as the numbers of common second level sub-tree nodes are removed, the amount of time taken for computing the longest common subsequence (LCS) is reduced, decreasing the overall cost of change detection. To test this hypothesis, trees were generated with the following characteristics a) leaf nodes: 300 b) second level sub-trees: 30 c) depth: 5 d) common second level sub-trees: increased in the range 1- 27 and e) Change operations: were increased in consistent manner as the number of common sub-trees decreased. As the performance observed was for the same tree, the parsing cost was not considered. The observed performance on optimization with increase in matching second level subtrees on the same tree is shown in Fig.7. From the graph, it can be clearly observed that there is a considerable improvement in performance with increase in matching second level sub-trees with optimization. For the given dataset, a performance improvement of 2% was observed with optimization when only 1(out of 30) second level sub-tree was matching. When the matching second level sub-trees were increased to 27(out of 30), an improvement of 20% was observed. As shown in Fig. 7, a difference in the cost of change detection was observed for the same tree. This is because, in addition to optimization, the primary algorithm contains a phase for pruning unique inserts and deletes. This phase is intended to reduce the amount of time taken for change detection by pruning the nodes resulting in unique
282
J. Jacob, A. Sachde, and S. Chakravarthy
Fig. 8. Effect of increase of change operations
inserts/deletes and hence decreasing the number of nodes for LCS. To observe the improvement due to pruning, the performance test was carried out for two XML trees with the following characteristics: a) leaf nodes: 300 b) second level sub-trees: 30 c) depth: 5 and d) Change operations: were increased in consistent manner. The performance graph is shown in Fig. 8. As observed from the graph, a performance improvement of 55% in terms of execution time was observed. This improvement is due to the increase of unique inserts/deletes with the increase in change operations, resulting in reduction of time taken for change detection. This is because, as more number of nodes result in unique inserts/deletes, they are filtered out at the second level. Hence, the number of nodes for LCS decreases considerably leading to an overall decrease in the time taken for change detection.
5 User Interface and Change Presentation In order to use the full functionality of WebVigiL, it is important that the user be able to manage the creation and management of sentinels as well as be able to retrieve selected changes as needed. In order to support all of the above functionality in a web-enabled manner, we are developing a “WebVigiL Dashboard”. The basic idea is that the user can keep track of his/her sentinels, disable/enable them, delete them as well as use previously defined sentinels for creating new ones. Also, it should be possible to retrieve sentinels based on their attributes (show me all sentinels that expire on 07/15/2003, for example). In addition, one should be able to retrieve changes (both current and past) using the dashboard. Change presentation is the last phase of web monitoring where the detected changes are notified to the user. The presentation method selected should highlight the detected differences between two XML documents in a meaningful manner. Therefore, the correct choice for presentation of the detected differences is very important. Unlike HTML, the element names in XML have no presentation semantics but instead define the content. Hence, the presentation of an XML document is dependent on a stylesheet. For presentation, XML data can be stored inside HTML pages as "Data Islands", where HTML is used only for formatting and displaying the data. The standard stylesheet language for XML documents is the eXtensible Stylesheet Language (XSL). Stylesheets are used to express on how the content of
CX-DIFF: A Change Detection Algorithm for XML Content
283
XML should be presented. XSLT (XSL Transformation) accepts the XML document and the XSL for that document and presents it. To highlight the detected changes in XML documents, we need to modify their stylesheets. In addition, different types of customized changes such as keywords, phrases etc are supported in WebVigiL, which need to be represented and displayed in different ways. The existence of duplicate nodes in an XML document makes the presentation more complicated. The notification of the detected changes may have to be sent to different devices that have different storage and communication bandwidths. In addition, apart from the pushbased notifications we intend to provide the user a pull-based mechanism to view the detected changes. At present, we are evaluating various schemes to present the changes in a meaningful manner to the user, taking the above issues into consideration. The various schemes are discussed below: Only change Approach: In this approach only the changes will be presented. The traditional method would be a tabular structure with the types of changes (insert/delete/move) as different columns of the table. The changes can also be presented using the XPath expressions of the nodes that have changed, Single Frame Approach: Two versions of the document will be merged and the changes will be presented by adding the attributes to the elements that describe the type of change or by changing the tags of the elements to reflect the type of change (insert/delete/move). It can then be presented using a stylesheet Dual Frame Approach: In this approach we intend to show both the documents side by side highlighting the changes. This approach has an advantage over the other approaches of being easy to interpret. We are also evaluating techniques to highlight the changes by displaying the XML document in a tree structure. We are planning to use a heuristic model, which depending upon the types of changes, number of changes and the notification mechanism (email, fax, PDA), will select the presentation scheme from the above-discussed approaches. We are also planning to provide the capability to the user to select his/her preferred approach for presentation.
6 Conclusions WebVigiL is a change monitoring system for the web that supports specification; management of sentinels and provides presentation of detected changes in multiple ways (batch, interactive, for multiple devices). The first prototype has been completed and includes the following features [2]: web-based sentinel specification, ECA rule based fetch that includes learning to reduce the number of times a page is fetched, population of the Knowledgebase, detection of changes to HTML and XML pages as discussed in this paper. The various presentation schemes outlined in this paper are under the process of evaluation. Currently, the individual modules are being integrated to instrument the first version of a complete WebVigiL system.
284
J. Jacob, A. Sachde, and S. Chakravarthy
References [1] [2] [3] [4] [5] 6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]
Chakravarthy, S., et al. WebVigiL: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments. in Second International Workshop on Web Dynamics. 2002. Hawaii. Chakravarthy, S., et al., WebVigiL: Architecture and Functionality of a Web Monitoring System. http://itlab.uta.edu/sharma/Projects/WebVigil/files/WVFetch.pdf. Jacob, J., et al., WebVigiL: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments(to be published), in Web Dynamics Book. 2003, Springer-Verlag. Pandrangi, N., et al. WebVigiL: User Profile-Based Change Detection for HTML/XML Documents. in Twentieth British National Conference on Databases. 2003. Coventry, UK. J.W.Hunt and M.D.Mcllroy, An algorithm for efficient file comparison. 1975, Bell Laboratories: Murray Hill, N.J. E.Myers, An O(ND) difference algorithm and its variations. Algorithmica, 1986. 1: p. 251–266. S.Wu, U.Manber, and E.Myers, An O(NP) sequence comparision algorithm. Information Processing Letters, 1990. 35: p. 317–323. K.Zhang and D.Shasha, Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM Journal of Computing, 1989. 18(6): p. 1245–1262. K.Zhang, R.Statman, and D.Shasha, On the Editing Distance between Unordered Labeled Trees. Information Processing Letters, 1992. 42: p. 133–139. S.Chawathe, et al. Change detection in hierarchically structured information. in Proceedings of the ACM SIGMOD International Conference on Management of Data. 1996. Montréal, Québec. Y.Wang, D.DeWitt, and J.Cai, X-Diff: An Effective Change Detection Algorithm for XML Documents. 2001, Technical Report, University of Wisconsin. G.Cobena, S.Abiteboul, and A.Marian, Detecting Changes in XML Documents. Data Engineering, 2002. F.P.Curbera and D.A.Epstein, Fast Difference and Update of XML Documents. XTech'99, 1999. Fontaine, R.L. A Delta Format for XML: Identifying Changes in XML Files and Representing the Changes in XML. in XML Europe 2001. May 2001. Berlin. Fontaine, R.L. Merging XML Files: A New Approach Providing Intelligent Merge of XML Data Sets. in XML Europe 2002. May 2002. Barcelona, Spain. XMLDiffMerge, http://www.alphaworks.ibm.com/tech/xmldiffmerge. DOMMITT, http://www.dommitt.com/. Document Object Model., http://www.w3.org/DOM/. XML, Extensible Markup Language(XML)., World Wide Web Consortium, http://www.w3.org/XML/. S.Abiteboul, P.Buneman, and D.Suciu, Data on the Web: From Relations to Semistructured Data and XML. 1999: Morgan Kaufmann. Xerces-J, http://xml.apache.org/xerces2-j/index.html. Hirschberg, D., Algorithms for the longest common subsequence problem. Journal of the ACM, 1977: p. 664–675.
Storing and Querying XML Documents Using a Path Table in Relational Databases* Byung-Joo Shin and Min Jin Department of Computer Engineering, Kyungnam University, Masan, KOREA {raniman,mjin}@zeus.kyungnam.ac.kr
Abstract. Additional processing is required to store XML documents and to query against them in relational databases due to the discrepancy between the hierarchical structure of XML and the flat structure of relational databases. This paper aims to cope with the issues in storing and querying XML documents with DTD in relational databases. We propose association inlining that extends shared inlining and hybrid inlining to reduce relational fragments and excessive joins. Path expressions of XML documents are stored in the Path table. XML queries written in XQuery are translated into SQL statements by exploiting the schema information that is extracted from the Path table. We also propose a simple method that publishes desired XML data stored in relational databases in XML documents. The structural information is extracted from XML queries written in XQuery and is stored using appropriate structures. Together with the resulting tuples produced by corresponding SQL queries, it is used in publishing relational data in XML documents.
1 Introduction XML is widely used as a simple yet flexible means for exchanging data since it has nested, hierarchical, and self-describing structure. This has given rise to the need for storing and querying increased volume of XML data. Relational databases are widely used in conventional data processing and offer inherent services such as transaction management and query optimization that can be exploited in managing XML data. Hence, a lot of studies have been focused on the use of relational databases to store and query XML data[2][4][5]. However, data are represented in flat structures in relational databases, whereas XML data are represented in hierarchical structures with nested and recursive structures. Additional processing is required for storing and querying XML data in relational databases due to the structural discrepancy between XML and relational databases. Storing XML data with hierarchical structures in flat structures in relational databases is likely to lead to excessive fragmentations and redundancy. Another important issue is to represent set sub-elements and recursive elements of XML data in relational databases[9]. Many query languages such as XQL, XML-QL, Quilt, and XQuery have been proposed for XML data. XQuery is in the process of being a standard XML query *
This work was supported by Kyungnam University Research Fund.
M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 285–296, 2003. © Springer-Verlag Berlin Heidelberg 2003
286
B.-J. Shin and M. Jin
language. However, it was yet not designed as a full-fledged query language and is not supported by relational database systems. Therefore, it gives rise to development of a method of translating XQuery into SQL and a method of publishing XML data stored in flat structures as XML documents with hierarchical structures[1][3][6][8][12][13]. The rest of this paper is organized as follows. Section 2 briefly overviews related work concerning storing and querying XML data using relational databases. Section 3 describes how to store XML documents with DTD using relational databases. Section 4 shows the way of querying and publishing XML data stored in relational databases. Section 5 offers conclusions.
2 Related Work 2.1 Model Mapping Approach Methods for storing XML documents in relational databases can roughly be classified into two categories: Model Mapping Approach and Structure Mapping Approach[11]. These approaches are classified by whether XML documents have structural information such as DTD and XML Schema, or not. Model mapping approach is capable of storing XML documents without structural information. Relational schemas are defined regardless of the structural information of XML. In this approach, an XML document can be represented as an ordered labeled directed graph, in which each element is represented as a node and relationships between element and subelement/attributes are represented as edges. Each node is labeled with a unique identifier and each edge is labeled with the name of the sub-element. Values of XML documents are represented as terminal nodes. There are three alternative ways to store the edges of a graph; Edge Approach, Binary Approach, and Universal Approach[4]. Edge Approach stores all edges in a single table and Binary Approach groups all edges with the same label into one table. Universal Approach corresponds to the result of a full outer join of all tables in the Binary Approach. There are two alternative ways to store the values of a graph. One is to establish separate value tables for conceivable data types and the other way is to inline values in the edge tables. Therefore, there are six ways to store XML documents in relational databases in model mapping approach. However, Edge Approach is irrelevant with regard to querying XML data since most of the data is stored in a single edge table. Binary Approach gives rise to data fragmentations by generating a large number of relations. Universal Approach leads to a lot of data redundancy due to generation of many fields with NULL[4][11]. 2.2 Structure Mapping Approach In contrast to the model mapping approach, the structure mapping approach is based on structural information of XML data. Relational schemas are generated based on the structural information extracted from DTD or XML schema. Three techniques, called
Storing and Querying XML Documents
287
Basic Inlining, Shared Inlining, and Hybrid Inlining have been proposed to generate relational schema. A relation is created for each element in Basic Inlining. In order to solve the fragmentation problem, Basic creates a relation that inlines as many descendants of the element as possible for every element. Set-valued attributes and recursions are made into separate relations. Basic is good for certain types of queries and is likely to be grossly inefficient for other queries, and creates a large number of relations. The principal idea behind Shared is to share the element nodes represented in multiple relations in Basic by creating separate relations of these elements. In Shared, relations are created for all elements having an in-degree greater than one in DTD graph. Nodes with an in-degree of one are inlined in the parent node’s relation. Element nodes having an in-degree of zero are also made into separate relations. As in Basic, set sub-elements are made into separate relations. Of the mutually recursive elements all having in-degree one, one is made a separate relation. Representing an element node in exactly one relation in Shared leads to a small number of relations compared to Basic. Shared addressed some shortcomings in Basic, however, Shared performed worse than Basic in one respect: increasing the number of joins. Hybrid is the same as Shared except that it inlines elements with an in-degree greater than one that are not recursive elements or set sub-elements. Although this Hybrid’s property might lead to reduction of the number of joins per SQL query, and also cause more SQL queries to be generated[9][11].
3 Storing XML Documents in Relational Databases 3.1 Creation of Relational Schema In this paper, we adopt the structure-mapping approach for storing XML documents in relational databases. In this section, we describe how to create relational schema from XML DTDs to store XML documents in relational databases. Our approach combines the join reduction properties of Hybrid Inling with the sharing features of Shared Inling[9]. First, we create a DTD graph representing the structure of a DTD. A DTD and the corresponding DTD graph are given in Figure 1. In the DTD graph, nodes represent elements, attributes, or operators and edges represent relationships between elements and sub-elements/attributes. Each element appears only once in the graph, while attributes and operators appear as many times as they appear in the DTD. For each element node with an in-degree of zero, a separate relation is created since it has no parent element. However, although the root element has an in-degree of zero, it isn’t represented as a relation if all of its sub-elements are mapped to separate relations respectively. The information on the structure of XML data is represented in the additional table, which will be discussed in the following section. An element node that has only sub-elements and no attributes is called an elementonly element[10]. The element-only element is not represented as a separate relation with an exception. In Figure 1, authors element is an element-only element and is not represented as a separate relation. If a root element happens to be an element-only element and its child elements are also element-only elements, the root element is
288
B.-J. Shin and M. Jin
represented as a separate relation when no relations are created for some child elements.
SDSHU SDSHUWLWOH
" UHIHUHQFH
ERRN DXWKRUV
ERRNWLWOH
DXWKRU QDPH
"
"
FRXQWU\
XQLYHUVLW\
Fig. 1. A DTD specification and the corresponding DTD graph
Element nodes with an in-degree of one are inlined in the parent node’s relation. However, if an element doesn’t contain any sub-elements and has only attributes which mean data points, it is called an empty element[10]. The empty element having an in-degree of one isn’t represented as an attribute of the parent relation since it hasn’t any meaningful values and has only the structural information of XML documents. In general, relations are created for all elements having an in-degree greater than one with a few exceptions. Hence, we need to analyze the relationship between elements and sub-elements in deciding whether the element node is represented as a relation or not. In Figure 1, if author element is directly connected to paper and book element without authors element and ‘+’ operator, it has an in-degree of two and has three descendant elements such as name, country, and university. In this case, we might assume that author element is an independent entity. Hence, it is mapped to a separate relation with the name, country, and university attributes. However, if an element node with an in-degree greater than one has no child element, it can’t be regarded as an independent entity and we represent it as an attribute of its parent relations. If an element with an in-degree greater than one has more than one subelement or one attribute, it is classified as an independent entity. When the element has one sub-element or one attribute, it is classified as a dependent entity of the parent element. In the DTD graph, an edge marked with ‘*’ or ‘+’ indicates that the element of a destination node can occur many times. For such an element, a separate relation is created since relational databases don’t support set-valued attributes. One of the mutually recursive elements having an in-degree of one, such as paper and reference in Figure 1, is made a separate relation. Figure 3 shows the created relations for the DTD graph in Figure 1 for storing corresponding XML documents. Each relation has an ID field that serves as a key of the relation. The Author relation has a parentID field that serves as a foreign key that corresponds to its parent element node. The parentCode field points to the
Storing and Querying XML Documents
289
corresponding parent relation among multiple parent relations. The order field represents the occurrence order within the element. The nested field indicates the degree of recursions on the parentCode table. <paper> <papertitle>XML Query… B. Shin Korea Kyungnam M. Jin Korea Kyungnam <paper> <papertitle>Efficiently Publishing…
J. Shanmugasundaram Professional XML… K. Williams : :
Fig. 2. An XML document Paper paperID parentID parentCode order 1 2 :
NULL 1 :
NULL 1 :
authorID parentID parentCode 1 2 3 4 :
1 1 2 1 :
1 1 1 2 :
papertitle
nested docID
NULL XML Query Processing System in Relational Database… 1 Efficiently Publishing Relational Data as XML… : :
order 1 2 1 1 :
Author name B.Shin M.Jin J.Shanmugasundaram K.Williams :
0 1 :
1 1 :
country
university
docID
Korea Korea NULL NULL :
Kyungnam Kyungnam NULL NULL :
1 1 1 1 :
Book bookID
parentID
order
booktitle
docID
1 :
1 :
1 :
Professional XML Databases :
1 :
Fig. 3. Relations for storing XML documents with DTD in Figure 1
3.2 Path Table In this paper, we use XQuery to query XML data stored in relational databases. XQuery is a notable language since it integrates the features of many languages such as XQL, Lorel, XML-QL, Quilt, and XPath. XQuery frequently uses path expressions to represent the path from the root node to an element node[11][12][13]. Unfortunately relational database systems don’t support XQuery and path expressions of XQuery. It is also needed to translate XQuery queries into SQL statements for processing queries written in XQuery in relational databases. To expedite translating XQuery queries into SQL queries and publishing relational data in XML documents, we create a Path table that stores the information on all paths from the root node to certain nodes. Figure 4 shows the Path table storing the
290
B.-J. Shin and M. Jin
information of path expressions of XML DTD in Figure 1. The Path table stores all path expressions that occurred in XML documents. Note that ‘#/’ is used as a delimiter of steps instead of ‘/’ that is used in path expressions of XQuery[11]. ParentCode indicates the parent table of the node having an in-degree greater than one. Recursion indicates the number of recursions on the parentCode table for recursive elements. Meanwhile, recursive path expressions give rise to a problem due to the fact that they can be of arbitrary complexity and they might have an infinite path length. To address the problem induced by recursive path expressions, we store the information of all path expressions generated via recursive paths. pathID 1 2 3 4 5 6 7 … 16 17 18 19 20 21 22
pathExp #/paper #/paper#/papertitle #/paper#/authors #/paper#/authors#/author #/paper#/authors#/author#/name #/paper#/authors#/author#/country #/paper#/authors#/author#/university … #/paper#/reference#/paper #/paper#/reference#/paper#/papertitle #/paper#/reference#/paper#/authors #/paper#/reference#/paper#/authors#/author #/paper#/reference#/paper#/authors#/author#/name #/paper#/reference#/paper#/authors#/author#/country #/paper#/reference#/paper#/authors#/author#/university
table
column
parentCode
recursions
Paper Paper NULL Author Author Author Author … Paper Paper NULL Author Author Author Author
NULL papertitle NULL NULL name country university … NULL papertitle NULL NULL name country university
NULL NULL NULL NULL 1 1 1 … NULL 1 NULL NULL 1 1 1
NULL NULL NULL NULL NULL NULL NULL … NULL 1 NULL NULL 1 1 1
Fig. 4. Path table
4 Query Processing and Publishing XML Documents The XML query-processing system is depicted in Figure 5. It shows the components of the system and how XQuery queries are processed and XML documents are published. 4.1 XQuery Parser User queries written in XQuery are passed to the XQuery Parser, which extracts the information of the corresponding relational schema for getting appropriate XML data and the structural information of the desired XML documents to be used in publishing them. The former is transferred to the SQL Generator and the latter is transferred to the XML Generator, respectively. 4.1.1 Schema Information for Extracting XML Data Path expressions are important parts in processing XQuery queries. We have stored the information on the path from the root node to the certain nodes in the Path table as shown in Figure 4. We replace occurrences of ‘/’ and ‘//” by ‘#/’ and ‘#%/’ respectively in path expressions of XQuery. This enables us to use LIKE clauses of SQL in finding path expressions to get the schema information of the desired XML data[11].
Storing and Querying XML Documents
291
We focus on FLWOR expressions of XQuery, which are the core expressions of XQuery. We get the schema information necessary for extracting the desired XML data represented in XQuery queries by exploiting the Path table. Figure 6 shows an XQuery query that extracts all bibliography. Path expressions that are used to extract the desired XML data represented in Figure 6 are as follows; //paper/papertitle, //paper/authors/author/name, //paper/reference/book/booktitle, //paper/reference/book/authors/author/name, //paper/reference/paper/papertitle, //paper/reference/paper/authors/author/name. These path expressions are used to obtain the schema information of the desired XML data such as tables, columns, and the hierarchical structure of tables from the Path table. The extracted schema information is passed to the SQL Generator and is used in generating SQL statements to extract the desired XML data stored in relational databases. ;4XHU\TXHU\ 3DWK7DEOH
;4XHU\3DUVHU 6FKHPD,QIRUPDWLRQ
6WUXFWXUH,QIRUPDWLRQ
64/*HQHUDWRU ;0/&RQVWUXFWRU
64/ 9LHZ
7DEOH
9LHZ
9LHZ
9LHZ
7DEOH
7DEOH
7DEOH
$XWKRU
3DSHU
%RRN
3DWK
5HVXOWLQJ7XSOHV
;0/*HQHUDWRU ;0/'RFXPHQW
Fig. 5. XML query-processing system for $p in document("paper.xml")//paper return <paper> <papertitle>$p/papertitle { for $a in $p/authors/author return $a/name } for $b in $p/reference/book return $b/booktitle { for $ba in $b/author return
$ba/name } for $pp in $p/reference/paper return <paper> <papertitle>$pp/papertitle { for $ppa in $pp/authors/author return $ppa/name }
Fig. 6. A query written in Xquery
292
B.-J. Shin and M. Jin
0 1 2 3 4 5 6 7 8 FOR <paper> <papertitle> Data(1,papertitle) FOR 9 10 11 12 13 14 15 16 17 Data(2,name) ENDFOR(6) FOR : 41 42 43 44 45 46 47 48 49 ENDFOR(37) FOREND(31) FOREND(0)
Fig. 7. Structure information for publishing XML documents
4.1.2 Structure Information for Publishing XML Documents The XQuery Parser extracts the structure information as well as the relational schema information of the desired XML document. Here, we focus on publishing the desired XML data written in FLWOR expressions of XQuery. The structure information is passed to the XML Generator to be used in publishing the desired XML document. The structure information extracted from the XQuery query given in Figure 6 is shown in Figure 7. 4.2 SQL Generator To extract the desired XML data written in XQuery queries from relational databases, XQuery queries should be translated into the corresponding SQL statements. SQL Generator generates SQL queries for creating views using the extracted schema information. First, the schema information such as the tables, columns, and hierarchical structure of tables should be drawn from the path expressions. The following SQL queries get the schema information. The drawn schema information is shown in Figure 8. The hierarchical structure of tables is also described in the figure. The hierarchical structure of tables is represented through the parentID and parentCode attributes. ParentCode is introduced to accommodate multiple parent tables and recursive structures. SELECT table, column, parentCode FROM Path WHERE pathExp LIKE “#%/paper#/papertitle”
(1)
SELECT table, column, parentCode FROM Path WHERE pathExp LIKE “#%/paper#/authors#/author#/name”
(2)
Now, SQL Generator generates SQL queries for extracting the desired XML data using the drawn schema information. The following are some SQL queries necessary for extracting the desired XML data of the query in Figure 6. CREATE VIEW Paper_View( paperID, papertitle ) AS SELECT paperID, papertitle FROM Paper
(3)
CREATE VIEW Author_View( paperID, authorID, name ) AS SELECT P.paperID, A.authorID, A.name FROM Paper_View P, Author A WHERE P.paperID = A.parentID AND A.parentCode = 1
(4)
Storing and Querying XML Documents
293
Note that each view has the identifier of the tuple and its parents if they exist. We use the Node Outer Union technique[7] that has been known to be one of the most efficient strategies for materializing relational data. Figure 9 shows the query execution plan for the Node Outer Union technique for our running example. /HYHO
3DSHU
3DUHQW&RGH
3DUHQW&RGH
3DSHU,G 3DUHQW,G 3DUHQW&RGH 2UGHU 3DSHU7LWOH 'RF,G
$XWKRU
3DSHU
$XWKRU,G 3DUHQW,G 3DUHQW&RGH 2UGHU 1DPH &RXQWU\ 8QLYHUVLW\ 'RF,G
/HYHO
3DSHU,G 3DUHQW,G 3DUHQW&RGH 2UGHU 3DSHU7LWOH 'RF,G
%RRN %RRN,G 3DUHQW,G 2UGHU %RRN7LWOH 'RF,G
$XWKRU 3DUHQW&RGH
/HYHO
$XWKRU,G 3DUHQW,G 3DUHQW&RGH 2UGHU 1DPH &RXQWU\ 8QLYHUVLW\ 'RF,G
3DUHQW&RGH
Fig. 8. Extracted schema information
7\SH3DSHU,'3DSHU7LWOH$XWKRU,'1DPH%RRN,'%RRN7LWOH3DSHUB3DSHU,'
2XWHU8QLRQ
3DSHU,'3DSHUB3DSHU,'$XWKRU,'1DPH
3DSHU,'%RRN,'$XWKRU,'1DPH
3393DSHUB3DSHU' $3DUHQW,'$1'$3DUHQW&RGH
$3DUHQW,' %9%RRN,'$1'$3DUHQW&RGH
3DSHUB3DSHUB$XWKRUB9LHZ -RLQ
%RRNB$XWKRUB9LHZ -RLQ
3DSHU,'3DSHUB3DSHU,'3DSHU7LWOH
3DSHU,'$XWKRU,'1DPH
3DSHU,'%RRN,'%RRN7LWOH
33DUHQW,' 393DSHU,'$1'33DUHQW&RGH
393DSHU,' $3DUHQW,'$1'$3DUHQW&RGH
393DSHU,' %3DUHQW,'
3DSHUB3DSHUB9LHZ339 -RLQ
$XWKRUB9LHZ -RLQ
%RRNB9LHZ%9 -RLQ
3DSHU,'3DSHU7LWOH
3DSHUB9LHZ39 $XWKRU$
3DSHU3
%RRN%
Fig. 9. SQL query execution for extracting the desired XML data
4.3 XML Constructor The desired XML data is extracted by executing SQL queries generated by SQL Generator in the relational database. The result data is represented in flat structures. Hence, we have to put the result in a hierarchical structure in order to publish it as an XML document. We use the Sorted Outer Union technique[7]. In the Sorted Outer
294
B.-J. Shin and M. Jin
Union technique, the key to structuring the relational data is to order it the way that it needs to appear in the resulting XML document. Thus, in our running example, sorting the result of node outer union by the sort sequence (paperID, paper_paperID, bookID, authorID) will ensure that the final result is in the desired document order. Figure 10 shows the result tuples of executing the corresponding SQL queries in the Sorted Outer Union technique. The type column in Figure 10 is added to the result of the Sorted Outer Union to indicate the corresponding view that was defined by the SQL Generator. This is useful in tagging to be performed by the XML Generator. type 1 2 2 3 6 4 5 1 2
paperID papertitle authorID name bookID booktitle paper_paperID 1 XML Query Proc... NULL NULL NULL NULL NULL 1 NULL 1 B.Shin NULL NULL NULL 1 NULL 2 M.Jin NULL NULL NULL 1 NULL NULL NULL 1 Professional XML… NULL 1 NULL 4 K.Williams 1 NULL NULL 1 Efficiently pubis… NULL NULL NULL NULL 2 1 NULL 3 J.Shanmugasundaram NULL NULL 2 2 Efficiently Publis… NULL NULL NULL NULL NULL 2 NULL 3 J.Shanmugasundaram NULL NULL NULL
Fig. 10. Resulting tuples XMLGenerator( list_pointer XQuery, recordset record_set ) { boolen inner_for = false; list_morpheme w; int front = -1, rear = -1; for( i = 0; i < MAX_MORPHEME; i++ ) { w = XQuery[i]; if( w->*morpheme == "FOR" || w->*morpheme == "LET" ) { if( inner_for = true && front != rear ) queue_output( &front, rear ); else inner_for = true; } else if( w->*morpheme == "DATA" ) { if( w->tuple_type == record_set("type") ) { if( front != rear ) queue_output( &front, rear ); printf("%s", record_set(*att_name)); } else { record_set.movenext; if( w->tuple_type == record_set("type") ) { if( front != rear ) queue_output( &front, rear ); printf("%s", record_set(*att_name)); } else { i = break_for( i ); front = rear;
} } } else if( w->morpheme == "FOREND" || w->morpheme == "LETEND" ) i = w->start_for - 1; else { if( inner_for = false ) printf("%s", w>morpheme); else { rear = rear + 1; queue[rear] = *morpheme; } } } } void queue_output( int *front, int rear ) { while( *front == rear ) { *front = *front + 1; printf("%s", queue[*front]; } } int break_for( int i ) { while( XQuery[i]->morpheme != "FOREND" && XQuery[i]->morpheme != "LETEND" ) i++; return i; }
Fig. 11. Algorithm to generate the final XML document
4.4 XML Generator XML Generator generates the final XML document by using the structure information and the resulting tuples. Figure 11 shows the algorithm to tag and generate the final XML document. As shown in the figure, the algorithm takes the extracted structure
Storing and Querying XML Documents
295
information and the resulting tuples as inputs, and publishes the final XML document. It generates a tagged XML document traversing the extracted structure information that is like the one shown in Figure 7. While traversing the structure information, appropriate values from the resulting tuples are chosen and inserted in the final XML document.
5 Conclusion In this paper, we have proposed a method for storing and querying XML data using relational databases. We proposed an association lining that extends shared inlining and hybrid inlining to reduce relational fragments and excessive joins. We aim to cope with the problems originated due to the discrepancy between the hierarchical structure of XML and the flat structure of relational databases. Additionally, we stored the structure information of XML data in the Path table, which will be used in publishing the desired XML documents. We developed a technique to translate XML queries written in XQuery into SQL statements by exploiting extracted schema information drawn from the Path table. The information on the structure of XML data and resulting tuples produced by executing corresponding SQL queries is exploited in generating the desired XML documents represented in queries written in XQuery FLWOR expressions. Thus, the desired XML documents are published simply. The efficiency of our association inlining technique is to be further verified in terms of relational fragments and the number of joins. The simplicity of our technique for generating XML documents using schema information based on path expressions and structure information is to be evaluated against reasonable datasets.
References 1. 2. 3. 4. 5. 6. 7.
Carey, D., Florescu, D., Ives, Z., Lu, Y., Shanmugasundaram, J., Shekita, E., Subramanion, S.: XPREANTO: Publishing Object-Relational Data as XML. Informal Proceedings of the International Workshop on the Web and Databases (2000) 105–110 Deutsch, A., Fernandez, M., Suciu, D.: Storing Semi-Structured Data with STORED. Proceedings of ACM SIGMOD Conference on Management of Data (1999) 431–442 Fernandez, M., Tan, W., Suciu, D.: SilkRoute: Trading Between Relations and XML. th Proceedings of the 9 W3C Conference (2000) 723–745 Florescu, D., Kossmann, D.: Storing and Querying XML Data Using an RDBMS. IEEE Data Engineering Bulletin, Vol. 22, No. 3. (1999) 27–34 Funderburk, J.E., Kiernan, G., Shanmugasundaram, J., Shekita, E., Wei, C.: XTABLES: Bridging Relational Technology and XML. IBM Systems Journal (2002) 616–641 Shanmugasundaram, J., Kiernan, J., Shekita, E., Fan, C., Funderburk, J.: Querying XML th Views of Relational Data. Proceedings of the 27 VLDB Conference (2001) 261–270 Shanmugasundaram, J., Shekita, E., Barr, R., Carey, M., Lindsay, B., Pirahesh, H., Reinwald, B.: Efficiently Publishing Relational Data as XML Documents. Proceedings of th the 26 VLDB Conference (2000) 65–76
296 8. 9. 10. 11. 12. 13.
B.-J. Shin and M. Jin Shanmugasundaram, J., Shekita, E., Kiernan, J., Krishnamurthy, R., Viglas, E., Naughton, J., Tatarinov, I.: A General Technique for Querying XML Documents Using a Relational Database System. SIGMOD Record, Vol. 30, No. 3. (2001) 20–26 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., Dewitt, D., Naughton, J.: Relational Databases for Querying XML Documents: Limitations and Opportunities. Proceedings of th the 25 VLDB Conference (1999) 302–314 Williams, M., Brundage, M., Dengler, P., Gabriel, J., Hoskinson, A., Kay, M., Maxwell, T., Ochoa, M., Papa, J., Vanmane, M.: Professional XML Databases. Wrox Press (2000) Yoshikawa, M., Amagasa, T.: XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases. ACM Transactions on Internet Technology, Vol. 1, No. 1. (2001) 110–141 W3C Recommendation. XML Path Language (XPath) Version 1.0. In http://www.w3c.org/TR/xpath/ (1999) W3C Recommendation. XQuery 1.0: An XML Query Language. In http://www.w3c.org/TR/xquery/ (2002)
Improving Query Performance Using Materialized XML Views: A Learning-Based Approach Ashish Shah and Rada Chirkova Department of Computer Science North Carolina State University Campus Box 7535, Raleigh NC 27695-7535 {anshah,rychirko}@ncsu.edu
Abstract. We consider the problem of improving the efficiency of query processing on an XML interface of a relational database, for predefined query workloads. The main contribution of this paper is to show that selective materialization of data as XML views reduces query-execution costs in relatively static databases. Our learning-based approach precomputes and stores (materializes) parts of the answers to the workload queries as clustered XML views. In addition, the data in the materialized XML clusters are periodically incrementally refreshed and rearranged, to respond to the changes in the query workload. Our experiments show that the approach can significantly reduce processing costs for frequent and important queries on relational databases with XML interfaces.
1 Introduction The extended markup language (XML) [18] is a simple and flexible format that is playing an increasingly important role in publishing and querying data in the World Wide Web. As XML has become a de facto standard for business data exchange, it is imperative for businesses to make their existing data available in XML for their partners. At the same time, most business data are still stored in relational databases. A general way to publish XML data in relational databases is to provide XML interfaces over the stored relations and to enable querying the interfaces using XML query languages. In response to the demand for such frameworks, database systems with XML interfaces over non-XML data are increasingly available, notably relational systems from Oracle, IBM, and Microsoft. In this paper we consider the problem of improving the efficiency of evaluating XML queries on relational databases with XML interfaces. When querying a data source using its XML interface, an application issues a query in an XML query language and expects an answer in XML. If the data source is a relational database, this way of interacting with the database adds new dimensions to the old problem of efficiently evaluating queries on relational data. In the standard scheme for evaluating queries on an XML interface of a relational database, the relational query-processing engine computes a relation that is an answer to the query on the stored relational data; see [9] for an overview. On top of this process, the query-processing engine has to (1) translate the query from an XML query language into SQL (the resulting query is then M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 297–310, 2003. © Springer-Verlag Berlin Heidelberg 2003
298
A. Shah and R. Chirkova
posed on the relational data), and (2) translate the answer into XML. To efficiently process a query on an XML interface of a relational database, the query-processing engine has to efficiently perform all three tasks. We propose an approach to reducing the amount of time the query-processing engine spends on answering queries on XML interfaces of relational databases. The idea of our approach is to circumvent the standard query-answering scheme described above, by precomputing and storing, or materializing, some of the relational data as XML views. If the DBMS has chosen the “right” data to materialize, it can use these XML views to answer some or most of the frequent and important queries on the data source without accessing the relational data. We show that our approach can significantly reduce the time to process frequent and important queries on relational databases with XML interfaces. Our approach is not the first view-based approach to the problem of efficiently computing XML data on relational databases. To clarify how our approach differs from previous work, we use the terms (1) view definitions, which are data specifications given in terms of stored data (or possibly in terms of other views), and (2) view answers, which are the data that satisfy the definition of a view on the database. In past work, researchers have looked into the problem of efficiently evaluating XML queries over XML view definitions of relational data (e.g., SilkRoute [8] or XPERANTO [16]). We build on the past work by adding a new component to this framework: We incrementally materialize XML view answers to frequent and important XML queries on a relational database, using a learning approach. To the best of our knowledge, we are the first to propose this approach. The following are the contributions of this paper: • We develop a learning-based approach to materializing relational data in XML. • We propose a system architecture that takes advantage of the materialized XML to reduce the total query-execution times for incoming query workloads. • We show how to transform a purely relational database system to accommodate materialized XML and our system architecture. Using our approach may result in significant efficiency gains on relatively static databases. Moreover, it is possible to combine our solution with the orthogonal approaches described in [8,16], thus achieving the combined advantages of the two solutions. The remainder of the paper is organized as follows. Section 1.1 discusses related work. In Section 2 we formalize the problem and outline our approach. In Sections 3 and 4, we describe the system architecture and the learning algorithm. Section 5 describes experimental results. We discuss the approach in Section 6, and conclude with Section 7. 1.1 Related Work The problem of XML query answering has recently received a lot of attention. [11, 13] propose a logical foundation for the XML data model. [3] describes a system for data-model management, with tools to map schemas between XML and relations. [6]
Improving Query Performance Using Materialized XML Views
299
looks into developing XML documents in a normal form that guarantees some desirable properties of the document format. [7] proposes an approach to efficiently representing and querying semistructured Web data. [10] proposes an XML data model and a formal process, to map Web information sources into commonly perceived logical models; the approach provides for easy and efficient information extraction from the World-Wide Web. [14] describes an approach to XML data integration, based on an object-oriented data model. [15] proposes an XML data-management system that integrates relational DBMS, Java and XSLT. [20] reports on a system that manages XML data based on a flexible mapping strategy; given XML data, the system stores data in relations, for efficient querying and manipulation. XCache [2] describes a web-based XMLquerying system that supports semantic caching; ACE-XQ [4] is a caching system for queries in XQuery; the system uses sophisticated cache-management mechanisms in the XML context. SilkRoute [8] is a framework for publishing relational data using XML view definitions. The approach incorporates an algorithm for translating queries from XQuery into SQL and an optimization algorithm for selecting an efficient evaluation plan for the SQL queries. XPERANTO [16] is an XML-centric middleware layer that lets users query and structure the contents of a relational database as XML data and thus allows them to ignore the underlying relations. Using the XPERANTO query facility and the default XML view definition of the underlying database, it is possible to specify custom XML view definitions that better suit the needs of the applications. The motivation for using views in query processing comes from informationintegration applications; one approach, called data warehousing [17], uses materialized views. [1,5,19,21] propose a unified approach to the problem of view maintenance in data warehouses. In our work, we use a learning method called concept, or rule, learning [12].
2 Problem Specification and Outline of the Proposed Approach In this section we specify the problem of improving the efficiency of answering queries on XML interfaces of relational databases, and outline our solution. An XMLrelational data source (“data source”) comprises a relational database system and an XML interface. For a query in an XML query language, to evaluate the query on a data source means to obtain an XML answer to the query via the XML interface of the source. Suppose there is a finite set of important queries, with associated relative weights, that users or applications frequently pose on the data source. We call these queries a query workload. In our cost model, the cost of evaluating a query on a data source is the total time elapsed between posing the query on the source and obtaining an answer to the query in XML. The total cost of evaluating a query workload on a data source is the weighted sum of the costs of evaluating all workload queries, using their relative weights. We consider the problem of improving the efficiency of evaluating a query workload on a data source; the goal here is to reduce the total cost of evaluating a given query workload on a given data source. To improve the efficiency of evaluating a query workload on a data source, we propose an approach based on incrementally materializing XML views of workload-
300
A. Shah and R. Chirkova
relevant data. To materialize a view is to compute and store the answer to the view on the database. We materialize views in XML rather than in relations, to reduce or eliminate the time required to translate (1) the workload queries from an XML query language into SQL, and (2) the relational answers to the queries into XML. In the proposed system architecture, when answering a query, the query-processing engine first searches the materialized XML views, rather than the relational tables; if the query can be answered using the views, there is no need to access the underlying relations. Using this approach may result in significant efficiency gains when the underlying relational data do not change very often. In our approach, we need to decide which data to materialize in XML. We use a learning-based approach to materialize only the data that is needed to answer the workload queries on the data source. In database systems, it is common to maintain statistics on the stored data, for the purposes of query optimization [9]. We maintain similar statistics on access rates to the data in the stored relations, and materialize the most frequently accessed tuples in XML. We use learning techniques combined with the access-rate statistics to decide when and how to change, incrementally, the set of records materialized in XML. We manage the materialized data using the concept of clustering. In our approach, clustering means combining related XML records into a single materialized XML structure. These XML structures are stored in a special relation and can be queried using the data source’s XML query language. (In the remainder of the paper we assume that XQuery is the language of choice.) Storing the most frequently accessed tuples in materialized XML clusters increases the probability that future workload queries will be satisfied by the clusters. To answer those queries that are not satisfied by the XML clusters, we use the relational query-processing engine.
3 The System Architecture We now discuss the architecture of the system. We describe the query-processing subsystem, the required changes to the schema of the originally relational data source, and the process of generating workload-related XML data from the stored relations. 3.1 The Query-Processing Subsystem In this section we describe a typical query path taken by an input query; see Fig. 1. The solid lines in Fig. 1 show the primary query path, which is taken for all queries on the data. If a workload query can be answered by the materialized XML clusters, then only the primary path is taken. Otherwise, the query next follows the secondary query path, shown in dotted lines in Fig. 1; here, the input query is pushed down to the relational level and is answered using the stored relations, rather than the materialized XML. The XML clusters are stored as values of an attribute in a special relation. The system queries the relation in SQL to find the most relevant cluster, and then poses the XQuery query on the cluster. The schema for the clusters is specified by the database administrator.
Improving Query Performance Using Materialized XML Views
301
Fig. 1. The Query-Processing subsystem
3.2
Setting Up Materialized XML Clusters
In this section we describe how to set up materialized XML clusters, by transforming the relational-database schema to accommodate XML. For simplicity, we use a schema with just two relations, R(A1,…, An) and S(B1,…, Bm). A1 is the primary key of the relation R. 3.2.1 Modifying the Given Relational Schemas In our approach, for tuples of certain relations we keep track of how many times each tuple is accessed in answering the workload queries. To enable these access counts, we change the schema of the relational data source, by adding an extra attribute to the schema of one or more of the stored relations. The most likely candidates for this schema change are the relations of interest, which are relations that have high access rates, primarily large relations that are involved in expensive joins. For instance, suppose we have a query that involves a join of the relations R and S. If the relation R is large, the query would be expensive to evaluate, hence we consider R as a suitable candidate for the schema change. (Alternatively, the database administrator can make the choice of the schema to modify.) Suppose we decide to add an attribute A(n+1) to the schema of the relation R; we will store access counts for the tuples in relation R as values of this attribute. R(A1,…,An, A(n+1)) is the schema of the modified relation. Initially, the value of A(n+1) is NULL in all tuples. 3.2.2 Creating the Relations for the Materialized XML Clusters We now define the schema of the relation T that will store the materialized XML clusters, as T(A1, C). Recall that A1 is the primary key of the relation R; using this attribute in the relation T helps us index the materialized XML clusters in the same way as the relation R. The attribute C is used to store the materialized XML clusters in text format. To summarize, we set up materialized XML clusters by doing the following: 1. Select a relation of interest (R in the example) to modify.
302
A. Shah and R. Chirkova
2. Add an access-count attribute to the schema of the selected relation. 3. Create a new relation (T in the example), to hold the materialized XML version of the data in the selected relation of interest (R in the example).
4 The Learning Algorithm In this section we describe a learning algorithm that populates and incrementally maintains the XML clusters. We first describe how to select relational tuples for materialization, and then explain our clustering strategy for building an XML tree of “interesting records.” Our general approach is as follows. When answering queries, we first pose each query on the materialized XML clusters in the relation T that we have added to the original stored relations. Whenever a query cannot be answered using the materialized XML clusters (or at system startup, see next paragraph), the query is translated into SQL and pushed down to the stored relations. Each time this process is activated, the system increments access counts for all tuples that contribute to the answer to the SQL query. At system startup, the relation T that holds the materialized XML is empty. As a result, all incoming queries have to be translated into SQL and pushed to the relational query-processing engine. The materialization phase starts when the access counts in the relations of interest exceed an empirically determined threshold value, see Section 4.3; all tuples whose access counts are greater than the threshold value are materialized into XML. The schema for the materialized XML is specified by the input XQuery workload. (Alternatively, it can be specified by the database administrator.) As the learning algorithm executes over an extended time period, the most frequently accessed tuples in the relations of interest are materialized into XML and stored in the relation T. 4.1 Learning I: Discovering Access Patterns in the Relations of Interest To incrementally materialize and maintain XML clusters of workload-relevant data, the system periodically runs a learning process that translates frequently accessed relational tuples into XML and reorders the resulting records in a hierarchy of clusters. We now describe the first stage of the learning process, where the system discovers access patterns in the relations of interest by using the access-count attribute. Once the access pattern is established, the system translates the most frequently accessed tuples into XML. To obtain the current access pattern, the system needs to execute the following steps. 1. (This step is executed during the system startup.) Input an expected query stream and set up the desired output XML schema. 2. Pose the incoming workload queries on the stored relations; in answering the queries, increment the access counts for those tuples in the relations of interest that contribute to the answers to the queries. During the system startup we use an expected, rather than real, query stream to determine access patterns in the relations of interest. For example, if each workload
Improving Query Performance Using Materialized XML Views
303
query may use one of the given 250K keywords with given frequencies, then for our expected query stream we select the 1000 most-frequent keywords. 4.2 Learning II: Materializing XML and Forming Clusters Once the first stage of the learning process has discovered the access patterns in the relations of interest, the system performs, in several iterations, the following steps: 1. To generate the materialized XML records, retrieve from the relations of interest all tuples whose access counts are greater than the predefined threshold value. 2. Translate the data into XML and store in the materialized XML relation. 3. Form clusters (also see section 4.3): a. Find all relational tuples that are related to the materialized XML, w.r.t. the workload queries. b. Select those of the tuples whose access counts exceed the threshold value, and translate them into XML. c. Cluster the tuples and materialized XML into a single XML tree. 4.3 The Clustering Phase In our selective materialization, we use clustering to increase the scope of materialized XML beyond the relations of interest, by incrementally adding to the XML records “interesting records” from other relations. The criterion for adding these interesting records is the same as the criterion for materializing relational tuples in XML. More precisely, the relations with the most frequently accessed records are selected in the descending order of access frequency. For example, if there are three relations R1, R2, R3, in descending order of tuple-access frequencies, then we can form clusters, starting with R1 and R2, then R2 and R3, and so on. The relation T now contains a single XML structure, which holds related records with high access rates. In each cluster, the records are sorted in the order of their access counts. In the current implementation, the schema for the cluster is provided as an external input (see Fig. 1). Choosing cluster schemas automatically is a direction of future work. We now explain on an example how to form hierarchies of clusters. Consider a database with four relations, R1-R4, in the descending order of tuple-access frequency. We first modify the relation R1, to store the XML clusters generated from the tuples retrieved from a join of R1 and R2 on some attribute. Similarly, we modify R2 to store a join of R2 and R3, and so on. With every join of Rn and Rn+1, we form the mostfrequently accessed clusters; the clusters form a hierarchy w.r.t. their access rates: For example, the cluster formed from R1 and R2 will have higher access rates than the cluster for R2 and R3. In our experiments, we have explored the first level of clustering for simple queries; see Section 5. We are working on implementing multiple levels of clustering for more complex queries. In our approach we determine the threshold value empirically: At system startup time, we repeat the learning process several times to arrive at a suitable value. The choice of the threshold value is a tradeoff between larger materialized views and better query-execution times: A lower threshold value means more tuples will be
304
A. Shah and R. Chirkova
materialized as XML; thus more queries will get satisfied in the XML views. A higher threshold value prevents most of the relational data from being selected for materialization, which limits the number of queries that can be answered using the views. The key is to strike a balance between the point at which the system materializes tuples and the proportion of records to be materialized. In our future work, we intend to make the choice of this threshold value dynamic.
5 Experimental Setup and Results 5.1 The Setup The CDDB collection [22] is a database that stores information about CDs and CD tracks. The CDDB schema comprises two relations, Disc(cd_id,cd_title,genre,num_of_tracks) and Tracks(cd_id,track_title). (For simplicity, we omit other attributes of the relations in CDDB.) The Disc relation has 250K tuples. Each CD has an average of 10 tracks, stored in the Tracks relation. Fig. 2 shows some tuples in the two relations in CDDB. In our experiments, we used Oracle 9.2 on a Dell Server P4600 with Intel Xeon CPU at 2GHz and 2GB of memory running on Microsoft Windows 2000. We implemented the middleware interface in Java using Sun JDK 1.4, and ran it on an Intel Pentium II 333MHz machine with 128MB of memory on Red Hat Linux 7.3. We conducted a significant number of runs to ensure that the effect of network delays on our experiments is minimal.
Fig. 2. Some tuples in relations Disc and Track in the CDDB database
Fig. 3. Data in the Disc relation with the modified schema
Fig. 4. Schema for the relation that holds the materialized XML and an example of a simple cluster
Improving Query Performance Using Materialized XML Views
305
To determine access patterns for the Disc relation, we added a new attribute, count, to the schema; this attribute holds an access count for each CD record. The rest of the database schema is unchanged. (Section 3.2.1 explains how to choose relations for the schema change.) Fig. 3 shows the tuples in the Disc relation with the modified schema. Fig. 4 shows the table XmlDiscTrack. This new relation holds materialized XML as text data in record format. The process of defining this materialized table is explained in Section 3.2.2. In the XmlDiscTrack relation that we create in the CDDB database, attributes cd_id and count are the same as in the Disc relation. The value of the count attribute in XmlDiscTrack equals the value of count in the corresponding tuple in the Disc relation, at the point in time where that tuple was materialized as XML. The XML attribute in XMLDiscTrack holds the materialized XML. For example, the value of the XML attribute, in the tuple for the ‘Air Supply’ CD in XMLDiscTrack, is shown in Fig. 4. Workload queries: The workload queries in our experiments use CD titles as keywords. In our architecture, the query-processing engine first tries to answer each workload query by searching the XML clusters in the relation XMLDiscTrack; if it fails to find an answer there, the engine then searches the Disc table using SQL. The two query paths are shown in Fig. 1 in Section 3. In learning stage I, whenever the system answers an input query using the original stored relations in the CDDB database, it increments the access count for each answer tuple in the Disc table. For learning stage II to be invoked, the access counts have to reach the threshold value; see Section 4.2. In the second stage of learning, we materialize in XML all the tuples in the Disc relation whose access counts exceed the threshold value. The generation of materialized XML is explained in Section 4. Fig. 4 shows the materialized XML for the CD “Air Supply” generated from Disc and Track. Cluster formation: This phase is invoked for every tuple in the relation XMLDiscTrack that holds materialized XML. The XML shown in Fig. 4 is only suitable to answer those queries that ask for tracks in that CD; this restriction limits the scope of the approach. Hence, we form clusters. Clusters are formed by identifying related records. The algorithm for selecting these related records is explained formally in Section 4. The tuples in the Disc relation that match ‘Air Supply’ and that have their access counts above the threshold value are chosen to form the clusters. These tuples are converted to XML and merged into the original structure. An example for the merged structure is shown in Fig. 5. Once the clustering phase is completed, the XML shown in Fig. 6 replaces the XML in Fig. 4. 5.2
Experimental Results
In this section we show the results of our experiments on the feasibility of our learning-based materialization approach. Comparing the efficiency of querying materialized XML to the efficiency of getting answers to SQL queries on the stored relational data. The objective of this experiment was to analyze whether XQuery-based querying is effective on materialized XML views, as compared to using SQL on the stored relations.
306
A. Shah and R. Chirkova
Fig. 5. An example of materialized clustered XML
Fig. 6. Comparison between using random queries on materialized XML and on relational data
Fig. 7. Average query times for a random set of 1000 repeated queries on relational data and an analysis of the time required to convert this relational data to XML
Improving Query Performance Using Materialized XML Views
307
Fig. 7 shows query-execution times for 5000 XML records, for the query SELECT * FROM Disc, Track WHERE Disc.cd_id = Track.cd_id AND Disc.cd_title = ‘%Eddie Murphy%’. Interesting tuples in the join of Disc and Track are stored in XML. The relational tables hold 2.5 million tuples (250K CDs times 10 tracks). The cluster records are similar to the XML shown in Fig. 5. The graph is a plot of queryexecution times for XQuery queries based on the attribute cd_title of the Disc relation. The experiment shows that processing a query on an XML view is faster than using SQL on the relations and then converting the answer to XML. Fig. 6 shows that executing SQL queries is more time-consuming than executing their XQuery counterparts on materialized XML. In pushing XQuery queries to the relational data, converting the answers into XML is a major overhead. We analyze the overhead in Fig. 7, which shows that the process of converting answer tuples into XML is the most expensive part of answering queries. Hence, it would be beneficial if such data were to be materialized.
Analyzing the maximum time spent in converting query answers to XML. The objective of this experiment was to analyze the time spent on translating relational query answers into XML. The graph in Fig. 7 is a plot of query-execution times for SQL queries based on the attribute cd_title of the Disc relation. The graph shows, as a solid line, the mean execution times for relational queries plus the times to convert the answers into XML. We see that of the total time of around 190 ms, converting relational data into XML takes around 60ms (the dotted line). While the relational query takes 190 ms – 60 ms = 130 ms to execute, there is an overhead of 60 ms in converting the relational data to XML. These results are the motivation for using materialization techniques.
Simulation runs to show the decrease in total query-execution times when querying the materialized XML alongside the stored relational data.
Fig. 8. Average query-execution times for a randomized set of 1000 repeated queries on a combination of relational and materialized data
Fig. 8 shows 30 simulation runs for a query workload of 1000 randomly selected CD titles. The X-axis shows the query ID, while the Y-axis shows the queryexecution times. The vertical dotted lines show the points at which XML materialization took place. (Recall that the system periodically runs the learning
308
A. Shah and R. Chirkova
algorithm.) It can be seen in Fig.8 that after every learning stage, the slope of the curve falls. Intuitively, after new learning has taken place, the XML clusters can satisfy a higher number of queries, with higher efficiency.
6 Discussion The proposed approach is to store materialized XML views in a relational database using learning. One extreme of the approach is to materialize the entire relational database as XML and then use a native XML engine to answer queries. This way, we would be able to avoid the overhead of translating all possible queries on the data source into SQL, and of translating the relational answers to the queries into XML. However, query performance might degrade considerably, as XML query-answering techniques are slower than their relational counterparts. In addition, the system would have to incur a significant overhead of keeping the XML consistent with the underlying relations. In Section 4 we described the process of grouping together related records in XML clusters. This approach allows a database system to incrementally find an optimal proportion of XML records that can be accessed faster than the relational tables. This optimal proportion can be arrived at by varying the size of the clustered XML and the threshold value. Additional improvements can be made when user applications maintain local caches: It may be beneficial to prefetch the XML data in the application’s cache, so that future queries from the application have a higher chance of being satisfied locally. In our approach, the added counters and flags in the relations have to be updated frequently and thus create an overhead. In our future work, we plan to reduce the overhead by updating tuple-access counts offline or during periods of lower queryloads. We materialize only frequently-accessed tuples; thus, only a fraction of the database is materialized as XML at any given time. (The clusters are recomputed from scratch every time the learning phase is invoked.) The advantage of the learning approach is to balance the proportion of data in relations and XML, by materializing the tuples that are in the answers to multiple queries. As the materialized XML is generated based on the access count of relational tuples, there may be queries that need to access both materialized XML and relational database. We plan to explore how to handle such queries in our future work.
7 Conclusions and Future Work We have described a view- and learning-based solution to the problem of reducing total query-execution times in relational data sources with XML interfaces. Our approach combines learning techniques with selective materialization; our experiments show that it can prove beneficial in improving query-execution speeds in relatively static databases. This paper describes an implementation that is external to the database engine. We are currently working on incorporating our approach inside a relational database-
Improving Query Performance Using Materialized XML Views
309
management system. We are looking into automating schema definition for materialized XML clusters, by using the information about past query workloads and the relations accessed by these workloads. We are working on developing a learning approach to selecting “interesting records” for XML clusters. We plan to implement a dynamic approach to selecting the threshold value in XML materialization. We plan to devise better strategies for (1) prioritizing XML records within clusters, and (2) automatically dematerializing obsolete XML data. Finally, we plan to automate the choice of the relations of interest, given a query workload.
References 1. 2. 3.
4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
J. Chen, S. Chen, and E.A. Rundensteiner. A transactional model for data warehouse maintenance. In Proc. of the 21st Int’l Conference on Conceptual Modeling (ER), 2002. L. Chen, E.A. Rundensteiner, and S. Wang. XCache: A semantic caching system for XML queries. In Proc. 2002 ACM SIGMOD International Conference on Management of Data, 2002. K.T. Claypool, E.A. Rundensteiner, X. Zhang, H. Su, H.A. Kuno, W.C. Lee, and G. Mitchell. Gangam — a solution to support multiple data models, their mappings and maintenance. In Proc. 2001 ACM SIGMOD International Conference on Management of Data, 2001. L. Chen, S. Wang, E. Cash, B. Ryder, I. Hobbs, and E.A. Rundensteiner. A fine-grained replacement strategy for XML query cache. In Proc. Fourth ACM CIKM International Workshop on Web Information and Data Management (WIDM 2002), pages 76–83, 2002. J. Chen, X. Zhang, S. Chen, A. Koeller, and E.A. Rundensteiner. DyDa: Data warehouse maintenance in fully concurrent environments. In Proc. ACM SIGMOD, 2001. D.W. Embley and W.Y. Mok. Developing XML Documents with Guaranteed “Good” th Properties. In Proc. 20 International Conference on Conceptual Modeling (ER), pages (426–441), 2001. I.M.R.E. Filha, A.S. da Silva, A.H.F. Laender, and D.W. Embley. Using nested tables for representing and querying semistructured web data. In Proceedings of the Advanced Information Systems Engineering, 14th International Conference (CAiSE 2002), 2002. M. Fernandez, Y. Kadiyska, D. Suciu, A. Morishima, and W.C. Tan. SilkRoute: A framework for publishing relational data in XML. ACM Trans. Database Systems, 27(4):438–493, 2002. Yannis E. Ioannidis. Query optimization. In Allen B. Tucker, editor, The Computer Science and Engineering Handbook, pages 1038–1057. CRC Press, 1997. Z. Liu, F. Li, and W.K. Ng. Wiccap data model: Mapping physical websites to logical views. In Proc. 21st International Conference on Conceptual Modeling (ER), 2002. Liu Mengchi. A logical foundation for XML. In Proc. Advanced Information Systems Engineering, 14th International Conference (CAiSE 2002), pages 568–583, 2002. Tom M. Mitchell. Generalization as search. Artificial Intelligence, 18:203–226, 1982. Liu Mengchi and Tok Wang Ling. Towards declarative XML querying. In Proc. 3rd International Conference on Web Information Systems Engineering (WISE 2002), pages 127–138, 2002. K. Passi, L. Lane, S.K.Madria, B.C. Sakamuri, M.K. Mohania, and S.S. Bhowmick. A model for XML schema integration. In Proc. 3rd Int’l Conf. E-Commerce and Web Technologies, 2002. Giuseppe Psaila. ERX: An experience in integrating entity-relationship models, relational databases, and XML technologies. In Proc. XML-Based Data Management and Multimedia Engineering EDBT workshop, 2002.
310
A. Shah and R. Chirkova
16. J. Shanmugasundaram, J. Kiernan, E. J. Shekita, C. Fan, and J. Funderburk. Querying XML views of relational data. In Proc. 27th Int’l Conference on Very Large Data Bases, 2001. 17. Jennifer Widom. Research problems in data warehousing. In Proc. Fourth International Conference on Information and Knowledge Management, pages 25–30, 1995. 18. Extensible Markup Language (XML) http://www.w3.org/XML. 19. X. Zhang, L. Ding, and E.A. Rundensteiner. Parallel multi-source view maintenance. VLDB Journal: Very Large DataBases, 2003. (To appear). 20. X. Zhang, M. Mulchandani, S. Christ, B. Murphy, and E.A. Rundensteiner. Rainbow: mapping-driven XQuery processing system. In Proc. ACM SIGMOD, 2002. 21. Xin Zhang and Elke A. Rundensteiner. Integrating the maintenance and synchronization of data warehouses using a cooperative framework. Information Systems, 27:219–243, 2002. 22. The CDDB database. http://www.freedb.org.
A Framework for Management of Concurrent XML Markup Alex Dekhtyar and Ionut E. Iacob Department of Computer Science University of Kentucky Lexington, KY 40506 {dekhtyar,eiaco0}@cs.uky.edu Abstract. The problem of concurrent markup hierarchies in XML encodings of works of literature has attracted attention of a number of humanities researchers in recent years. The key problem with using concurrent hierarchies to encode documents is that markup in one hierarchy is not necessarily well-formed with respect to the markup in another hierarchy. The proposed solutions to this problem rely on the XML expertise of the editors and their ability to maintain correct DTDs for complex markup languages. In this paper, we approach the problem of maintenance of concurrent XML markup from the Computer Science perspective. We propose a framework that allows the editors to concentrate on the semantic aspects of the encoding, while leaving the burden of maintaining XML documents to the software. The paper describes the formal notion of the concurrent markup languages and the algorithms for automatic maintenance of XML documents with concurrent markup.
1
Introduction
The problem of concurrent markup hierarchies has recently attracted the attention of a number of humanities researchers [13,6,15]. This problem typically manifests itself when a researcher must encode in XML a large document (book, manuscript, printed edition) with a wide variety of features. A concurrent hierarchy is formed by a subset of the elements of the markup language used to encode the document. The elements within a hierarchy have a clear nested structure. When more than one such hierarchy is present in the markup language, the hierarchies are called concurrent. A typical example of concurrent hierarchies is the XML markup used to encode the physical location of text in a printed edition: book, page, physical line, vs. the markup used to encode linguistic information about the text: sentence, phrase, word, letter. The key problem with using concurrent hierarchies to encode documents is that markup in one hierarchy is not necessarily well-formed with respect to the markup in another hierarchy.
This work has been supported in part by NSF ITR grant 0219924. In addition, the work of the second author has been supported in part by NEH grant RZ-20887-02. The manuscript image [1] appearing in this paper was digitized for the Electronic Boethius project by David French and Kevin Kiernan and is used with permission of the British Library Board.
´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 311–322, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
312
A. Dekhtyar and I.E. Iacob
The study of concurrent XML hierarchies for encoding documents is related to the problem of manipulation and integration of XML documents. However, most of the research on XML data integration addresses the problem of integrating heterogeneous, mostly data-centric XML provided by various applications ([4,9,10,8]). In our case, the data to be integrated has a common denominator: the document content, and the XML encodings are document-centric. Also, the features of the document to be marked up are not (in most cases) heterogeneous, but they might be conflicting in some instances. Management of concurrent markup has been approached in a few different ways. The Text Encoding Initiative (TEI) Guidelines [13] suggest a number of solutions based on the use of milestone elements (empty XML elements) or fragmentation of the XML encoding. Durusau and O’Donnell [6] propose a different approach. They construct an explicit DTD for each hierarchy present in the markup. Then they determine the ”least common denominator” in the markup — the units of content inside which no overlap occurs, in their case, words. They associate attributes indicating the XPath expression leading to the content of each word element for each hierarchy. Other scholars have proposed the use of non-XML markup languages that allow concurrent hierarchies [7]. In their attempts to resolve the problem of concurrent hierarchies, both [13] and [6] rely on the human editor to (i) introduce the appropriate solution to the XML DTD/XSchema, and (ii) follow it in the process of manual encoding of the documents. At the same time, [6] emphasizes the lack of software support for the maintenance of the concurrent markup, which makes, for example, adhering to some of the TEI solutions a strenuous task. While some recent attempts have been made to deal with the problem of concurrent markup from a computer science perspective [14,15], a comprehensive solution has yet to be proposed. This paper attempts to bridge the gap between the apparent necessity for concurrent markup and the lack of software support for it by proposing a framework for the creation, maintenance and querying the concurrent XML markup. This framework relies on the following: – Separate DTDs for hierarchies; – Use of a variant of fragmentation with virtual join suggested by TEI Guidelines [13] to represent full markup; – Automatic maintenance of markup; – Use of a database as XML repository. The ultimate goal of the proposed framework is to free the human editor from the effort of dealing with the validity and well-formedness issues of document encoding and to allow him or her to concentrate on the meaning of the encoding. This goal is achieved in the following way. Durusau and O’Donnell [6] note the simplicity and clarity of DTDs for individual concurrent hierarchies, as opposed to a unified DTD that incorporates all markup elements. Our approach allows the editor to describe a collection of such simple DTDs without having to worry about the need to build and maintain a ”master” DTD. At the same time, existence of concurrent DTDs introduces the need for specialized software to support the editorial process drive it by the semantics of the markup. This
A Framework for Management of Concurrent XML Markup
313
software must allow the editor to indicate the positions in the text where the markup is to be inserted, select the desired markup, and take record the results. In this paper we introduce the foundation for such software support. In Section 2 we present a motivating example based on our current project. Section 3 formally defines the notion of a collection of concurrent markup languages. In Section 4 we present three key algorithms for the manipulation of concurrent XML markup. The Merge algorithm builds a single master XML document from several XML encodings of the same text in concurrent markup. The Filter algorithm outputs an XML encoding of the text for an individual markup hierarchy, given the master XML document. The Update algorithm incrementally updates the master XML document given an atomic change in the markup. This paper describes the work in progress. A major issue not addressed in here is the database support for multiple concurrent hierarchies in our framework. This problem is the subject of ongoing research.
2
Motivating Example
Over the past few years researchers in the humanities have used XML extensively to create readable and searchable electronic editions of a wide variety of literary works [11,12,6]. The work described in this paper originated as an attempt to deal with the problem of concurrent markup in one such endeavor, The ARCHWay Project, a collaborative effort between Humanities scholars and Computer Scientists at the University of Kentucky. This project is designed to produce electronic editions of Old English manuscripts. In this section, we illustrate how concurrent markup occurs in ARCHWay. Building electronic editions of manuscripts. Electronic editions of Old English manuscripts [11,12]combine the text from a manuscript (both the transcript and the emerging edition), encoded in XML using an expressive array of features (XML elements), and a collection of images of the surviving folios of the manuscript. The physical location of text on the surviving folios, linguistic information, condition of the manuscript, visibility of individual characters, paleographic information, and editorial emendations are just some of the features that need to be encoded to produce a comprehensive description of the manuscript. Specific XML elements are associated with each feature of the manuscript. Concurrent hierarchies and conflicts. Most of the features have explicit scopes: the textual content (of the manuscript) that the feature relates to, be it the text of a physical line, or a line of verse or prose, or manuscript text that is missing due to a damage in the folio. Unfortunately, the scopes of different features often overlap, resulting in non-well-formed encoding (we call such a situation a conflict). Consider a fragment of folio 38 verso of British Library Cotton Otho A vi [1] (King Alfred’s Boethius manuscript) shown in Fig.1. The text of the three lines depicted on this fragment is shown in the box marked (0) in Fig.1. The remaining
314
A. Dekhtyar and I.E. Iacob
Fig. 1. A fragment of King Alfred’s Boethius manuscript [1] and different XML encodings
boxes in Fig.1 show the following markup for this fragment: (i) information about physical breakdown of the text into lines ( element); (ii) information about the structure of the text (<w> element encodes words), (iii) information about the damage and text obscured by the damage ( and tags)1 . Some of the encodings of this fragment are in conflict. The solid boxes over parts of the image indicate the scope of the elements and the dotted boxes indicate the scope of the elements. In addition, we indicate the positions of some of the <w> tags. Damage and restoration markup overlaps words in some places: the damaged text includes the end of one word and the beginning of the next word. In addition to that, some words start on one physical line and continue on another. Resolving markup conflicts. The TEI Guidelines [13] suggest a number of possible ways to resolve conflicts. These methods revolve around the use of empty 1
The encodings are simplified. We have removed some attribute values from the markup to highlight the structure of each encoding.
A Framework for Management of Concurrent XML Markup
315
<w>hu <w>iu <w>me <w>hæfst <w>afrefredne <w>ægier <w>ge <w>mid
(a) Milestone elements. ..... <w>æg <w>ier .....
(b) Fragmentation. ..... <w id=”1”>æg <w id =”1”>ier .....
(c) Fragmentation with virtual join (variant with “glue” attribute). Fig. 2. Resolving markup conflicts
milestone tags and the fragmentation of markup. We illustrate the proposed suggestions in Fig.2 on the example of the markup conflict between the <w> and elements at the end of line 22. The first suggested way (Fig.2.(a)) uses milestone (empty) elements. In this case the editor determines the pairs of tags that may be in conflict, and for each such pair declares at least one tag as empty in the DTD/XSchema. The other two ways (Fig.2.(b),(c)) are variants of the fragmentatation technique: one of the conflicting elements is split into two parts by the other one (in Fig.2 we choose to split <w> element). Simple fragmentation, however, may be confusing: encoding in Fig.2.(b) creates the impression that “æg” and “ier” are two separate words. To alleviate this problem, a variety of conventions based on the use of attributes can be proposed to indicate that a specific element encodes a fragment. Fig.2.(c) shows one such convention that uses a “glue” attribute Id. This implied attribute will get the same value for all fragments of the same encoding. Key drawback. The answer lies not only in alleviating the markup conflict problem: a more general problem of maintenance of markup in situations where conflicts are a frequent occurrence must be addressed. Up to this point, such maintenance resided in the hands of human editors who were responsible for specific encoding decisions to prevent markup conflicts. This tended to generate a variety of gimmick solutions in the markup language, such as introduction of tags whose sole purpose was to overcome a specific type of conflict, but which, in the process made the DTD/XSchema of the markup language complex and hard to maintain. Our approach, described in the remainder of this paper allows the software to take over the tasks of markup maintenance, simplifying the work of editors.
3
Concurrent XML Hierarchies
In this section we formally define the notion of the collection of concurrent markup hierarchies. Given a DTD D, we let elements(D) denote the set of all
316
A. Dekhtyar and I.E. Iacob
markup elements defined in D. Similarly, we let elements(d), where d is an XML document, denote the set of all element tags contained in document d. Definition 1. A concurrent markup hierarchy CM H is a tuple CM H =< S, r, {D1 , D2 , ..., Dk } > where: • S is a string representing the document content; • r is an XML element called the root of the hierarchy; • Di , i = 1, k are DTDs such that: (i) r is definedin each Di , 1 ≤ i ≤ k, and ∀1 ≤ i, j ≤ k, i = j elements(Di ) elements(Dj ) = {r}; (ii) ∀1 ≤ i ≤ k, ∀t ∈ elements(Di ) r is an ancestor of t in Di . In other words, the collection of concurrent markup hierarchies is composed of textual content and a set of DTDs sharing the same root element and no other elements. Definition 2. Let CM H =< S, r, {D1 , D2 , ..., Dk } > be a concurrent markup hierarchy. A distributed XML document dd over CM H is a collection of XML documents: dd =< d1 , d2 , ..., dk > where (∀1 ≤ i ≤ k) di is valid w.r.t. Di and content(d1 ) = content(d2 ) = ... = content(dk ) = S 2 . The notion of a distributed XML document allows us to separate conflicting markup into separate documents. However, dd is not an XML document itself, rather it is a virtual union of the markup contained in d1 ,. . . ,dk . Our goal now is to define XML documents that incorporate in their markup exactly the information contained in a distributed XML document. We start by defining a notion of a path to a specific character in content. Definition 3. Let d be an XML document and let content(d) = S. Let S = c1 c2 . . . cM . The path to ith character in d denoted path(d, i) or path(d, ci ) is the sequence of XML elements forming the path from the root of the DOM tree of d to the content element that contains ci . Let D be a DTD and let elements(D) ∩ elements(d) = ∅, and let the root of d be a root element in D. Then, the path to ith character in d w.r.t. D, denoted path(d, i, D) or path(d, ci , D) is the subsequence of all elements of path(d, i) that belong to D. Following XPath notation, we will write path(d, i) and path(d, i, D) in a form a1/a2/ . . . /as . We notice that path(d, i, D) defines the projection of the path to ith character in d onto a specific DTD. For example, if path(d, i) = col/f ol/pline/line/w/dmg and D contains only elements , and <w>, then path(d, i, D) = col/pline/w. We can now use paths to content characters to define “correct” single-document representations of the distributed XML documents. Definition 4. Let d∗ be an XML document and let D be a DTD, such that elements(d∗ )∩elements(D) = ∅ and the root of d∗ is a root element in D. Then, 2
content(doc) denotes the text content of the XML document doc.
A Framework for Management of Concurrent XML Markup
317
the set of filters of d∗ onto D, denoted F ilters(d∗ , D) is defined as follows: F ilters(d∗ , D) = {d|content(d) = content(d∗ ), elements(d) = elements(d∗ ) ∩ elements(D) and (∀1 ≤ i ≤ |content(d)|)path(d∗ , i, D) = path(d, i)} Basically, a filter of d∗ on D is any document that contains only elements from D that preserves the paths to each content character w.r.t. D. If we are to combine the encodings of all di s of a distributed document dd in a single document d∗ we must make sure that we can “extract” every individual document di from d∗ . Definition 5. Let dd =< d1 , d2 , . . . dk > be a distributed XML document over the collection of markup hierarchies CM H =< S, r, {D1 , . . . , Dk } >. A set of mergers of dd denoted M ergers(dd) is defined as M ergers(dd) = {d∗ |elements(d∗ ) ⊆
k
elements(Dj )
j=1
and (∀1 ≤ i ≤ k)di ∈ F ilters(d∗ , Di )} Given a distributed XML document dd, we can represent its encoding by constructing a single XML document d∗ from the set M ergers(dd). d∗ incorporates the markup from all documents d1 , . . . , dk in a way that (theoretically) allows the restoration of each individual document from d∗ . A document d∗ ∈ M ergers(dd) is called a minimal merger of dd iff for each content character cj , path(d∗ , cj ) consists exactly of the elements from all path(di , cj ), 1 ≤ i ≤ k.
4
Algorithms
Section 3 specifies the properties that the “right” representations of distributed XML documents (i.e., XML markup in concurrent hierarchies within a single XML document) must have. In this section we provide the algorithms for building such XML documents. In particular, we address the following three problems: – Merge: given a distributed XML document dd, construct a minimal merger d∗ of dd. We will refer to the document constructed by our Merge algorithm as the master XML document for dd. – Filter: given a master XML document for some distributed document dd and one of the concurrent hierarchies Di , construct the document di . – Update: given a distributed XML document dd, its master XML document d∗ and a simple update of the component di of dd, that changes it to di , construct (incrementally) the master XML document d for the distributed document dd =< d1 , . . . , di , . . . , dk >. Fig.3 illustrates the tasks addressed in this section and the relationship between them and the encoding work of editors. In the proposed framework, the
318
A. Dekhtyar and I.E. Iacob
Fig. 3. The framework solution.
editors are responsible for defining the set {D1 , . . . , Dk } of the concurrent hierarchies and for specifying the markup for each component of the distributed document dd. The MERGE algorithm then automatically constructs a single master XML document d∗ , which represents the information encoded in all components of dd. The master XML document can then be used for archival or transfer purposes. When an editor wants to obtain an XML encoding of the content in a specific hierarchy, the Filter algorithm is used to extract the encoding from the master XML document. Finally, we note that MERGE is a global algorithm that builds the master XML document from scratch. If a master XML document has already been constructed, the Update algorithm can be used while the editorial process continues to update incrementally the master XML document given a simple (atomic) change in one of the components of the distributed XML document. Each algorithm is discussed in more detail below. Note that the theorems in this section are given without proofs. The proofs can be found in [5]. 4.1
MERGE Algorithm
The MERGE algorithm takes as input tokenized versions of the component documents d1 , . . . , dk of the distributed document dd and produces as output a single XML document that incorporates all the markup of d1 , . . . , dk . The algorithm resolves the overlap conflicts using the fragmentation with a ”glue” attribute approach described in Section 2. A special attribute link is added to all markup elements that are being split, and the value of this attribute is kept the same for all markup fragments. The algorithm uses the Simple API for XML (SAX)[3] for generating tokens. SAX callbacks return three different types of token strings: (i) start tag token string (ST), (ii) content token string (CT), (iii) end tag token string (ET). If token is the token returned by the SAX parser, then we use type(token) to denote its type (ST, CT, ET) as described above and tag(token) to denote the tag returned by SAX (for ST and ET tokens). The MERGE algorithm works in two passes. On the first pass, the input documents are parsed in parallel and an ordered list is built of ST and ET tokens for the creation of the master XML document. The second pass of the
A Framework for Management of Concurrent XML Markup
319
algorithm scans the token list data structure built during the first pass and outputs the text of the master XML document. The main data structure in the MERGE algorithm is tokenListSet, which is designed to store all necessary markup information for the master XML document. Generally speaking, tokenListSet is an array of token lists. Each array position corresponds to a position in the content string of the input XML documents. In reality, only the positions at which at least one input document has ST or ET tokens have to be instantiated. For each position i, tokenListSet[i] denotes the ordered list of markup entries at this position. At the end of the first pass of the MERGE algorithm, for each i, tokenListSet[i] will contain the markup elements to be inserted in front of ith character of the content string in the master XML document exactly in the order they are to be inserted. The second pass of the MERGE algorithm is a straightforward traversal of tokenListSet, which for each position outputs all the tokens and then the content character. Fig.4 contains the pseudocode for the MERGE algorithm. The algorithm iterates through the positions in the content string of the input documents. For each position i, the algorithm first collects all ET and ST tokens found at this position. It then determines the correct order in which the tokens must be inserted in the master XML document, and resolves any overlaps by inserting appropriate end tag and start tag tokens at position i and adding the link attribute to the start tag tokens. In the algorithm push(Token,List) and append(Token,List) add Token at the beginning and at the end of List respectively. Theorem 1. Let dd =< d1 , . . . , dk > be a distributed XML document. Let d∗ be the output of MERGE(d1 , . . . , dk ). Then d∗ is a minimal merger of dd. 4.2
FILTER Algorithm
The FILTER algorithm takes as input an XML document d∗ produced by the MERGE algorithm and a DTD D, filters out all markup elements in d∗ that are not in D and merges the fragmented markup. In one pass the algorithm analysis the ordered sequence of tokens provided by a SAX parser and performs the following operations: – removes all ST and ET tokens of markup elements not in D; – from a sequence ST, [CT], ET, [CT], ..., ST, [CT], ET of tokens for a fragmented element in D, removes the ”glue” attributes and outputs the first ST token, all possible intermediate CT tokens and the last ET token in the sequence; – all other tokens are output without change in the same order they are received from the SAX parser. The pseudo-code for FILTER appears in Fig.5. The following theorem states that FILTER correctly reverses the work of the MERGE algorithm. Theorem 2. Let dd =< d1 , . . . , dk > be a distributed XML document, and d∗ be the output of MERGE(dd). Then (∀1 ≤ i ≤ k), FILTER(d∗ , Di ) = di .
320
A. Dekhtyar and I.E. Iacob
Fig. 4. The MERGE algorithm
Fig. 5. The FILTER and UPDATE algorithms
A Framework for Management of Concurrent XML Markup
321
LCA AFROM
ATO
FROM TO from to
Fig. 6. The XML document tree model used in UPDATE algorithm
4.3
UPDATE Algorithm
The UPDATE algorithm updates the master XML document (see Fig.3) with the new markup element. It takes as the input two integers, f rom and to, the starting and ending positions for the markup in the content string and the new markup element, T AG. Due to possible need to fragment the new markup this process requires some care. The goal of the algorithm is to introduce the new markup into the master XML document in a way that minimizes the number of new fragments. The algorithm uses the DOM model [2] for the XML document and performs the insertion of the node in the XML document tree model. In this model, for an element with mixed content, the text is always a leaf. Then f rom and to will be positions in some leaves of the document tree. Let F ROM and T O be the parent nodes of the text leaves containing positions f rom and to respectively. We denote by LCA the lowest common ancestor of nodes F ROM and T O. Let AF ROM be child of LCA that is the ancestor of F ROM , and let AT O be the child of LCA that is the ancestor of T O (see Fig.6). The UPDATE algorithm traverses the path F ROM → . . . → AF ROM → LCA → AT O → . . . → T O and inserts T AG nodes with glue attributes as needed. The pseudo-code description of the algorithm is shown in Fig.5. The following theorem says that the result of UPDATE allows for correct recovery of components of the distributed document. Theorem 3. Let dd =< d1 , . . . , dk > be a distributed XML document and d∗ be the output of MERGE(dd). Let T AG ∈ elements(Di ), (f rom, to, T AG) be an update request and di be a well-formed result of marking up the content between f rom and to positions. Then, FILTER(UPDATE(d∗ , (f rom, to, T AG)), Di ) = di .
5
Future Work
This paper introduces the general framework for managing concurrent XML markup hierarchies. There are three directions in which we are continuing this research. First, we are working on providing the database support for the maintenance of concurrent hierarchies. Second, we are studying the properties of the proposed algorithms w.r.t. the size of the markup generated, optimality of the
322
A. Dekhtyar and I.E. Iacob
markup and computational complexity, and efficient implementation of the algorithms. Finally, we are planning a comprehensive comparison study of a variety of methods for support of concurrent hierarchies.
References 1. British Library MS Cotton Otho A. vi, fol. 38v. 2. Document Object Model (DOM) Level 2 Core Specification. http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/, Nov 2000. W3C Recommendation. 3. Simple API for XML (SAX) 2.0.1. http://www.saxproject.org, Jan 2002. SourceForge project. 4. Serge Abiteboul, Jason McHugh, Michael Rys, Vasilis Vassalos, and Janet L. Wiener. Incremental maintenance for materialized views over semistructured data. In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 38–49, 24–27 1998. 5. Alex Dekhtyar and Ionut E. Iacob. A framework for management of concurrent XML markup. Technical Report TR 374-03, University of Kentucky, Department of Computer Science, June 2003. http://www.cs.uky.edu/∼dekhtyar/publications/TR374-03.concurrent.ps. 6. P. Durusau and M. B. O’Donnell. Concurrent Markup for XML Documents. In Proc. XML Europe, May 2002. 7. C. Huitfeldt and C. M. Sperberg-McQueen. TexMECS: An experimental markup meta-language for complex documents. http://www.hit.uib.no/claus/mlcd/papers/texmecs.html, February 2001. 8. Ioana Manolescu, Daniela Florescu, and Donald Kossmann Kossmann. Answering XML queries over heterogeneous data sources. pages 241–250. 9. Wolfgang May. Integration of XML data in XPathLog. In DIWeb, pages 2–16, 2001. 10. Wolfgang May. Lopix: A system for XML data integration and manipulation. In The VLDB Journal, pages 707–708, 2001. 11. W.B. Seales, J. Griffioen, K. Kiernan, C. J. Yuan, and L. Cantara. The Digital Atheneum: New Technologies for Restoring and Preserving Old Documents. Computers in Libraries, 20(2):26–30, February 2000. 12. E. Solopova. Encoding a transcript of the beowulf manuscript in sgml. In Proc. ACH/ALCC, 1999. 13. C. M. Sperberg-McQueen and L. Burnard(Eds.). Guidelines for Text Encoding and Interchange (P4). http://www.tei-c.org/P4X/index.html, 2001. The TEI Consortium. 14. C. M. Sperberg-McQueen and C. Huitfeldt. GODDAG: A Data Structure for Overlapping Hierarchies, Sept. 2000. Early draft presented at the ACH-ALLC Conference in Charlottesville, June 1999. 15. A. Witt. Meaning and interpretation of concurrent markup. In Proc., Joint Conference of the ALLC and ACH, pages 145–147, 2002.
Object Oriented XML Query by Example Kathy Bohrer, Xuan Liu, Sean McLaughlin, Edith Schonberg, and Moninder Singh {bohrer,xuanliu,ediths,moninder}@us.ibm.com [email protected]
Abstract. This paper describes an XML query language called XML-QBE, which can be used to both query and update XML documents and databases. The language itself has a simple XML form, and uses a query by example paradigm. This language was designed as a middleware layer between UML data models and backend database schemas, as part of a solution to the distributed, heterogeneous data-base problem and legacy database problem. Because the XML layer is derived from UML, XML-QBE is object-oriented. Queries and updates have a very similar form, and the form itself is XML. Therefore this language is also easy to process and analyze. We describe the language, the rationale, and our solution architecture.
1 Introduction The use of XML is now pervasive. At the application level, XML has become the common medium for information exchange. At the middleware level, XML is being incorporated into standard protocols such as SOAP. At the system level, the convenient, self-describing format of XML makes it ideal for the persistent storage of both structured and unstructured data. Consequently, XML query languages are being designed and standardized for managing the growing bodies of XML documents and repositories. In this paper, we present an XML query language, which we call XML-QBE. XML-QBE is itself an XML language, in addition to being a query language for XML. It uses a “query by example” paradigm, which means that the queries look like the data. Typically, query by example languages are very intuitive and easy to use. We designed XML-QBE as part of a solution to integrate distributed, heterogeneous, and legacy databases. In our solution architecture, a unifying XML data model, which is an XML Schema, is defined to integrate the data models across multiple backend databases, thus hiding the details of the potentially heterogeneous and inconsistent backends (see Figure 1). A database backend can be an XML database, a repository of XML documents, a traditional relational or LDAP database, etc. The XML schema itself is derived automatically from a UML data model, using an XMI-to-XML Schema translation tool, written in XSL. For each backend, we provide a mapping table, which specifies how the XML schema is mapped to the target backend schema. Each mapping table is used to translate XML-QBE queries into the appropriate target query language, such as SQL for relational backends, LDAP query language, or XQuery for XML document repositories. M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 323–329, 2003. © Springer-Verlag Berlin Heidelberg 2003
324
K. Bohrer et al.
The design goals for XML-QBE were the following:
y Expressiveness - we are able to express complex queries which retrieve sets of related objects, which are either nested or related through object reference. y Easy to analyze – queries themselves are XML data. y Semantics reflect the UML data model, not the target - the user is not aware of an underlying relational implementation, for example, or semantics of SQL. y Simple and declarative syntax – the query-by-example paradigm was chosen for this reason. y Uniform syntax for all operations – operations include query, create, delete, and modify. y Well-defined syntax and semantics – syntax is expressible by an XML Schema. y Generality -- applicable to all UML data models. y Support for multiple targets – queries can be mapped to a variety of backends. U M L D ata M odel U M L to X M L S chem a translation tool
XM L S chem a
Ma ppin g Tab les X M L S ch em a to B ac ken d Targe t (S Q L , XQ u ery, LD AP ...)
B ackend D atabase Schem a Backend D atabase S chem a
Backend D atabase S chem a
Fig. 1. 1: Solution Overview Figure So lution O verview
2 XML-QBE Features and Examples XML-QBE queries are operations over a collection of objects represented by XML elements. The XML-QBE data models are object-oriented, and they correspond to UML data models (see Figure 1). Objects have properties which are specified in the corresponding UML model. Similarly, objects are related to each other according to the associations defined in the UML model. A sample XML representation for a set of objects describing a personal profile is shown in the Appendix. The conventions used by our XML object representation include the following: 1. All objects have an “oid” property, which is the primary key of the object class. 2. Objects may be nested. For example, the person object with oid PERS1. includes the personDemographics object PDEM1, the occupation object OCCU1, the hobby object HOBB1, and the nationality object NAT1 nested within it.
Object Oriented XML Query by Example
325
3. A property whose name ends with “Id” is a reference to another object. For example, the property defaultNameId in person references the personName object PNAM1. (In the corresponding UML data model, there is an association defaultName between the Person class and the PersonName class). 4. Multi-valued properties are always grouped together under an element with a name ending in Group. For example, partyActivityGroup in the person object is the parent of all activities (including hobbies and occupations) of the person. The remaining subsections give an overview of the XML-QBE language. 2.1 Single Object Queries The simplest queries request objects of the same class. To indicate which objects to return, the values of properties are specified. To indicate which object properties to return, property names are specified without any values. If more than one property is specified, then any result objects returned must match all properties specified in the query. The following query returns properties of all personName objects with lastName “Bingh”. Specifically, the use, fullName, firstName, lastName, and title are returned. (The oid of the object is also always returned.) <use/> Bingh
The result of this query when applied to the document specified in the Appendix is: PNAM1 <use>LEGAL Cherry Bingh Cherry Bingh Mrs.
An element in a query may use the attribute “return” to specify more precisely which properties to return. Attribute “return” can have the value “all” (return all properties), “none” (do not return this object), or “specified” (return only those properties which are present in the query). If the “return” attribute is not used, then the default is “specified”. The “return” attribute can apply to property elements within an object. If an object has return value “all”, and a property element of the object has return value “none”, then this property is not returned, while all of the other properties of the object are returned. However, the converse is not true. If an object has return value “none”, and a property element of the object has return value “all”, then no property of the object is returned.
326
K. Bohrer et al.
2.2 Nested Object Queries Objects in our data model can be nested. We describe how the rules defined in section 2.1 extend to nested objects. To indicate which objects to return, the values of properties are specified. For nested objects, values may be specified for properties at any level of nesting. Specifying the value of an inner property means return all objects which contain an embedded object with the specified property value. To indicate which object properties to return, property element names are specified without any values. Similarly, embedded element names are specified without any values to indicate which embedded objects to return. In this case, all properties and embedded objects of the nested object are returned. Embedded element names also can have a return attribute, with values “all”, “none”, or “specified”. The following query returns the person objects with birthdate 3/1/1965. It also returns the personDemographics and the nationalityGroup objects embedded in these person objects. Note that the default return value for person and personDemographics is “specified”, since this element has embedded property elements. The default return value for nationalityGroup is “all”, since there are no properties specified for this nested element. 1965-03-01
2.3 Related Object Queries Often, it is necessary to request an object and the other objects which are related to it. For example, consider the query for retrieving all person and personName objects for all persons with defaultName “Cherry Bingh”. One way to write this query is shown below. It uses the oid value PNAM1 of the personName object for Cherry Bingh. <defaultNameId>PNAM1 PNAM1
However, usually the oid of an object requested is not known. Therefore, for querying related objects, we provide attributes for symbolically naming elements and referencing related elements using these symbolic names. The attribute “link” is used to symbolically name any property element. The attribute “linkref” can be used with any property element to reference another symbolically named property element. If the link attribute value of a property and the linkref attribute value of another property are equal, then their property values must be equal in the query result. The value of a link attribute must be unique within a query. The following query returns the birthDate of all persons with fullName “Cherry Bingh”. Since birthDate and fullName are properties of different objects, the query
Object Oriented XML Query by Example
327
requests both objects. The objects in the query are linked using the link attribute in the oid property of person and linkref attribute in realPartyId attribute of personName. Cherry Bingh
The result of this query replaces the symbolic link “pers” with the real oid PERS1: PERS1 PDEM1 1965-03-01 PNAM1 Cherry Bingh PERS1
The examples in this section illustrate the simplicity and expressiveness of the XML-QBE language. The queries are simple templates, which are able to specify selection and linking across objects related by nesting and association. Other query features of XML-QBE not described here include the ability to select objects based on property expressions (operators: or, and, lt, le, ge, ge, eq, ne, exists, like), the ability to query multiple unrelated objects, and more complex queries on related object, which require exists, disjunction and union operations.
3 Related Work and Conclusion XQuery [7, 8] from the W3C consortium is a functional, expression-based language for querying XML data and documents as well as other kinds of structured data for which XML views can be defined, such as relational data. XQuery is a very rich language that can be used to create arbitrarily complex queries with complete generality. Two systems, XPERANTO [6] and SilkRoute [3] leverage such query languages to provide efficient, full-purpose and general XML query capabilities over relational database systems. While XPERANTO is based on the newer XQuery language, SilkRoute is based on one of its precursors, namely XML-QL [9]. In both cases, however, the focus is on providing frameworks for efficient processing of arbitrarily complex queries on XML views of relational data that are specified using the query language of the system, and not on developing a scheme for efficient
328
K. Bohrer et al.
querying by example. XQuery is a procedural language, and the syntax of XQuery is not XML. XQuery can be a target of XML-QBE. Several commercial systems, such as IBM DB2 XML Extender [4] and Oracle XML DB [5], provide the ability to store, manage and access XML as well as relational data by allowing the composition/decomposition of XML documents from/to a relational database system. While these systems are feature-rich and provide a lot of functionality in handling transformations between XML and relational data, their query functionality is extremely limited. This is primarily due to the fact that the focus of these systems is to allow the composition and decomposition of XML documents to/from relational data; not to query relational data per-se. For example, the IBM DB2 Extender allows only the query that is defined via the mapping file (data access definition file) which only allows the entire view to be recovered, not just any part that is the result of a query in question. The only way to allow such queries would be to dynamically update the mapping file with new SQL statements for each query that was to be executed, a cumbersome task requiring SQL knowledge. Zhang et al. [10] describe methods for doing queries-by-example in the context of performing simple queries for data in XML documents; they do not address the issue of performing complex queries over XML views of relational data. We have presented a new XML query language, called XML-QBE, which we believe is useful for retrieving information from both XML documents and backend databases. Its simple query by example form makes it easy to use as well as to process and analyze. The underlying data model is object-oriented, derived from UML. Our implementation is table-driven, based on a description of a backend database schema. Thus we have begun to address the problem of heterogeneous and legacy backends. More work needs to be done in this direction in order to better handle distributed databases with possibly inconsistent schemas.
References 1.
Bohrer, Liu, Kesdogan, Schonberg, Singh, Spraragen, “Personal Information Management and Distribution”, 4th International Conference on Electronic Commerce Research, Nov. 2001. 2. Bohrer, Kesdogan, Liu, Podlaseck, Schonberg, Singh, Spraragen. “How to go Window th Shopping on the World Wide Web without Violating the User’s Privacy”, 4 International Conference on Electronic Commerce Research, Nov. 2001. 3. M. Fernandez. W. Tan, D. Suciu (2000). “SilkRoute: Trading Between Relations and XML”. Proceedings of the 9th International World Wide Web Conference. 4. “IBM DB2 XML Extender”. http://www-3.ibm.com/software/data/db2/extenders/xmlext/ 5. “Oracle XML DB”. http://www.oracle.com/ip/index.html?xmldbcm_intro.html 6. J. Shanmugasundaram, J. Kiernan, E. Shekita, C. Fan and J. Funderburk (2001). “Querying XML Views of Relational Data”. Proceedings of the 27th VLDB conference. 7. “XQuery 1.0: An XML Query Language”. W3C Working Draft November 2002. http://www.w3.org/TR/xquery/. 8. “XML Query”. http://www.w3.org/XML/Query. 9. “XML-QL: A Query Language for XML”. W3C Submission, August 1998. http://www.w3.org/TR/NOTE-xml-ql/. 10. S. Zhang, J. Wang and K. Herbert (2002). “XML Query by Example”. International Journal of Computational Intelligence and applications. 2(3), 329–337.
Object Oriented XML Query by Example
Appendix
329
Automatic Generation of XML from Relations: The Nested Relation Approach Antonio Badia Computer Engineering and Computer Science department University of Louisville [email protected]
Abstract. We propose a method to generate XML documents from relational databases by using nested relational algebra. Starting with a specification of the structure of the XML document, and a description of the database schema, we give an algorithm to build automatically a nested algebra query that will generate a nested view from the relational database; this view is isomorphic to the desired XML document, that can easily be generated from it. We discuss limitations of (and extensions to) the framework.
1
Introduction
As the relational and object-oriented technologies are extended to capture XML documents, structured and semistructured data are found side by side in database applications [1]. Hence, a natural line of research is translating between the two environments, that is, producing relational data out of XML documents, and XML documents out of relational data. There has been considerable research on both aspects of the translation process [5,6,7,8,10]. Proposals to create database schemas from XML documents have received special attention, since such procedures can be used to store XML data in relational databases [2,4]. Proposals to create XML documents from relational data, on the other hand, have been less researched [8,10]. Such proposals usually include two components: a specification of the target XML document, and a description of the source relational data, in the form of a view or a SQL query over the relational database. In this paper, we show how it is possible, from the description of the target XML document, and information about the relational database schema (including integrity constraints), to generate the required view or query automatically. However, since XML data is hierarchical and relational data is flat, there is a gap that must be closed somehow. We propose the use of a nested relational algebra as a natural bridge between the two data models (the use of nested algebra, with different purposes, is explored in [4,8]). We give a procedure for generating a Nest-Project-Join (NPJ) algebra expression that generates an XML document with a given structure from a relational database. We investigate some issues involved in the process, like the loss of data from non-matched tuples in a join. ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 330–341, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
Automatic Generation of XML from Relations
331
Section 2 gives some background in XML and nested relational algebra (to make the paper self-contained). For lack of space, we do not mention related research except that which is directly related to approach. Section 3 explains the translation proposed, giving the algorithm (in Subsection 3.1) and some examples (in Subsection 3.2). Section 4 lists some issues that come up with this approach and provides solutions to them. Finally, Section 5 gives some conclusions and directions for further research.
2
Background
Because of space constraints, we assume the reader familiar with the basics of XML and of the (flat) relational algebra and concentrate, in this section, in introducing the ideas of the nested relational data model and algebra. The nested relational model is obtained by getting rid of the first normal form assumption, that states that tuple elements are atomic (without parts) values. By allowing set-based values, we can have relations as members of a tuple. Formally, let U = {A1 , ..., An } be a finite set of attributes. A schema over U ( and its depth) is defined recursively as follows (the following is taken from [11]): 1. If A1 , ..., An are atomic attributes from U , then R = (A1 , ..., An ) is a (flat) schema over U with the name R. depth(R) = 0. 2. If A1 , ..., An are atomic attributes from U , R1 , ..., Rm are distinct names of schema with the sets of attributes (denoted by attr(R1 ), ..., attr(Rm )) such that {A1 , ..., An } and {attr(R1 ), ..., attr(Rm )} are pairwise disjoint, then R = (A1 , ..., An , R1 , ..., Rm ) is a (nested) schema with the name R and R1 , ...Rm are called subschemas. depth(R) = 1 + maxm i=1 depth(Ri ). Clearly, we can see a nested schema as a tree, with the depth giving us the height of the tree. If the depth of a nested schema is 1, all of the subschemas in R are flat schemas, i.e., we have the traditional (flat) relational model. Let R denote a schema over a finite set U of attributes. The domain of R, denoted by DOM (R), is defined recursively as follows1 : 1. If R = (A1 , ..., An ), where Ai (1 ≤ i ≤ n) are atomic attributes, then DOM (R)=DOM (A1 )×...×DOM (An ), where ”×” denotes Cartesian product. 2. If R = (A1 , ...An , R1 , ..., Rm ), where Ai (1 ≤ i ≤ n) are atomic attributes and Rj (1 ≤ j ≤ m) are subschemas nested into R, then DOM (R)=DOM (A1 )× ... × DOM (An ) × 2DOM (R1 ) × ... × 2DOM (Rm ) , where ”×” denotes Cartesian product and 2DOM (Rj ) denotes the power set of the set DOM (Rj )(1 ≤ j ≤ m). 1
As is customary, we assume that any atomic attribute A has associated a non empty set DOM (A).
332
A. Badia
A nested tuple over R is an element of DOM (R). A nested relation r over R is a finite set of nested tuples over R. We say that sch(r) = R. The nested relational algebra, like the model, can be seen as an extension of (flat) relational algebra. Thus, the operations of selection, projection, Cartesian product and (for schemacompatible relations) union and difference can be defined in a similar manner to the flat case. On top of that, the algebra has two additional operators: nest and unnest. The nest operator ν is usually defined as νL (r), where L ⊆ sch(r) is a list of attributes; the intended semantics of the operator is to change the structure of the input relation so that all tuples that coincide in their L values constitute a single tuple. Formally, let N1 and N2 be subsets of sch(r). Then the nest of r with respect to N1 , νN1 (r), is defined as follows: νN1 (r) := {t | (∃w ∈ r)t[N1 ] = w[N1 ] ∧ t[N2 ] = {u[N2 ] | u ∈ r ∧ u[N1 ] = t[N1 ]} We will say that we nest on N2 and group by N1 . We point out that it is customary to write the nesting operator using N2 as the explicit argument; i.e., νN2 (r). We mention the grouping attributes explicitly because it will simplify the expressions that we need to write. The unnest operator µ can be seen as the opposite of the nest operator (i.e., it has the property that, for any relation r, L ⊆ sch(r), µL (νL (r)) = r). We will not further specify this operator since we do not use it here.
3
The Method
In [10], a method is proposed to create XML documents starting with relational data sources. In this method, a mapping from relational to XML is specified in the XML/SQL language, which specifies the structure of the target document in XML and adds a part specifying how to extract the needed information from the relational database; this element is an SQL query. Unfortunately, because the relational model is flat and XML is hierarchical, a certain impedance mismatch must be saved. The authors of [10] attempt to save it by stipulating, in the SQL query, how the data must be massaged to be transformed. For instance, for groupings, an ORDER BY clause is used, since one can iterate then over the resulting evaluation and produce the desired nested result. However, this mixes implementation details with specification details. In this paper, we propose a method to automate the SQL extraction from the relational database by generating a query in relational algebra from the XML specification of the target. To save the data model mismatch, we use nested relational algebra instead of the traditional (flat) algebra. Nested algebra provides a formal framework in which to state and investigate the problem, making possible to consider questions like expressivity and optimization without committing to a particular system or implementation. The idea of using nested relations to represent and manipulate XML data is already proposed in [4,8]. The work of [4] extends the nested data model with structural variants, that allow for irregular semistructured data to be represented
Automatic Generation of XML from Relations
333
<SEQUENCE tagname=’’Department’’ > <SEQUENCE tagname=’’Course’’>
Fig. 1. CONSTRUCTOR Clause
within a single relation. [4] also defines QSByE (Querying Semistructured Data by Example), which extends a version of nested algebra to their data model. However, [4] does not discuss using this paradigm to extract XML data from relational data. The work of [8] is to find any nesting that satisfies some general requirement; it proposes to generate all possible nesting from a given (flat) relation, and choose some of them based on heuristics (like avoiding any flattening by the keys, since that leaves a relation unchanged); there is no target XML used. Here, we use nested algebra to generate a restructured relation for a particular target. There are other approaches to generating XML from relational data; the SilkRoute approach ([3]) uses a language called RXL to define an XML view over the relational database; to process queries, XML-QL is used as input, and SilkRoute translates such queries to SQL over the relational database. In [9], a system is described that publishes relational data in XML format by evaluating queries in XQuery. We point out that these approaches take queries as input, while we take a description of the structure of the desired XML document. Our solution is conceived as an extension of the research in [10], and we will use the same examples as [10], to contrast our approach with the one established there. A relational database with schema Professor(IdProf,ProfName,IdDept), Department(IdDept,DeptName,Acr), Course(IdCourse,IdDept,CourseName), and ProfCourse(IdProf,IdCourse) is used as the source of data. The target is an XML document, which is declared in the XML part of XML/SQL in a CONSTRUCTOR clause; Figure 1 is an example of such clause. This clause, applied to the source database described, generates a target document that looks like the one in Figure 2. Several things are to be noted about the CONSTRUCTOR clause. First, the clause uses some particular tags, CONSTRUCT, LIST and ATOM, to indicate the structure of the resulting document. The names of the tags to be used in the XML document are indicated in the tagname attribute. Thus, this is not an XML Schema, but a meta-description from which to generate the document. Second, besides indicating the structure desired, the clause indicates how to obtain it, by indicating how data needs to be nested (with attribute nestby, that indicates an attribute in the source database), and how atoms (simple attributes) relate
334
A. Badia
1 Computer Science Database Systems Compilers 2 Philosophy Philosophy I Fig. 2. Example of XML Document <SQL idsql=’’v2’’> SELECT d.iddept, d.deptname, c.idcourse, c.coursename FROM Department d, Course c WHERE d.iddept = c.iddept ORDER BY d.iddept Fig. 3. SQL query
to the source data (with attribute source, that also indicates an attribute in the relational database). More importantly, the above clause needs a view, defined in SQL, from where to extract the information. Note that the attributes required come from more than one relation and need to be nested. The query used in the SQL part of XML/SQL for the example is shown in Figure 3. The query specifically joins the tables needed to provide the information, and projects out the required attributes. The ORDER BY clause is used to prepare the data for the nesting needed by the XML data; obviously the intention is for the data to be scanned once and the nesting be obtained by using the fact that the data is physically sorted. It is not difficult to see that this approach mixes implementation with logical details; for instance, having the data sorted is a way to prepare for the nesting (obviously sorting is one algorithm to implement the nest operator; another one would be hashing). Thus, one would like to separate logical from physical description (so that, for instance, a large collection could be nested by using hashing instead of sorting). However, it is possible to do better than that: one could get rid of the SQL specification (view description) completely, and generate one automatically from the XML target description and the database schema. Our goal in this paper is to develop a procedure for doing that.
Automatic Generation of XML from Relations
3.1
335
The Algorithm
We take as input a relational database schema and a target specification in XML. In order to make the program more useful, we will allow regular XML Schema descriptions (i.e., describing the target directly, instead of giving a metadescription using keywords CONSTRUCT, LIST and ATOM). However, in order to connect the target with the source, we still require a source attribute in all simple elements in the XML description2 . Note, though, that we do not ask for any extra information (like explicit declarations of nesting attributes), since this information can be inferred from the XML description. Our method can be described in a few steps: first, we will parse the XML Schema description to extract the attributes involved and the nesting of the schema (if any). The nesting of the schema is obtained by annotating each attribute with an integer called the attribute’s level. Intuitively, if we look at the XML data as a tree, the level indicates the attribute’s depth in the tree. Second, we will generate a nested relational algebra expression by using the information yielded by the parsing. Finally, this expression, when applied to the database, yields a (possibly nested) table which is then used to generate the desired XML document. There are a few nuisances to be dealt with using this method, which are explained after the method has been detailed, next. We assume that the XML Schema target definition contains complex elements (with internal structure of type SEQUENCE) and simple elements. Every simple element has an attribute source, the value of which is an attribute name in the source database. The definition corresponds to an XML document, with a single root node, and a tree structure. There are two ways of writing an XML Schema declaration: the first style is an inlined style, in which information about every element (including its attributes and internal structure) is declared when the element is introduced. In the second style, elements are associated with references (names), and later on elements are described, using the reference to associate the description with the described element. Here we assume the first (inlined) style of description, which is more concise and easier to parse (however, it is easy to modify our algorithm to be used with schema declarations in the second style). Finally, we assume that we have the following information about the source relational database: for each relation, the schema (attribute names), and all integrity constraints (i.e., the primary key, any foreign keys, and their associations). We make the assumption that all relations in the database are connected to some other relation through exactly one foreign-key/primary key link (more on this assumption later). In the first step, we go over the XML definition and get a list of attributes and, for each attribute, its level. The level indicates the depth of the attribute in the tree, with the root being 0 and each node having a level of one more than the parent node. Intuitively, attributes at level 1 are flat (i.e., correspond exactly to those in a relational database), while attributes at a level more than 1 have been nested. The information is collected in a Target List, a sequence 2
Technically speaking, since simple elements cannot have attributes in XML, source should be considered a facet ([12]).
336
A. Badia
INPUT: an XML Schema definition. OUTPUT: Target List T PRECOND: the Schema definition is nested, with all complex type definitions inline. Each tag with simple name has a source attribute that names an attribute in the relational database T = empty; LIFO structure = empty; level = 0; While not end of Schema do if (open tag for complex type $t$ is found) level = level +1; add $t$ to LIFO; else if (open tag for simple type $t$ is found) T = add (source name of $t$, level) to T; add $t$ to LIFO; /* these are the only two choices */ else if (closing tag is found) pop corresponding matching tag off of LIFO; /* else the XML Schema is not well formed */ End While If (LIFO is empty) return T /* else the XML Schema is not well formed */ Fig. 4. Algorithm to Generate a Target List
{(a1 , l1 ), . . . , (an , ln )}, where ai (1 ≤ i ≤ n) is an attribute and li is a number > 0 called the level of attribute ai . Let T be a target list; the At(T ) (called the attributes of T ) is defined as {ai | (ai , li ) ∈ T } and LV (T ) (called the levels of T ) is defined as {li | (ai , li ) ∈ T } . Finally, AL(T, i) (called the attributes of T at level i) is defined as {a | (a, i) ∈ T } Using the information in T , we create a Nest-Project-Join (NPJ) expression as follows: we iterate over T , starting at the deepest level, and for each level i and the next level i − 1 an NPJ expression with exactly one nest operator is built. The process is repeated until we reach the uppermost level in T . Intuitively, when there are several levels of depth in the XML document, we must create a nesting for each level; simply joining all relations at once and nesting once over them would not give the required output. We give two examples to clarify the process. Formally, for database schema DB = (R1 , . . . , Rm ), let Rel(DB, T, i) = {R ∈ DB | ∃A ∈ sch(R) ∩ At(T, i)}. These are the relations with attributes appearing at level i. As stated above, we assume that all relations in Rel(DB, T, i) can be joined among themselves in exactly one way, for any i. Given a set of relations R, R will denote the appropriate join of all relations in R; hence our assumption implies that there is exactly one expression Rel(DB, T, i) (more on this assumption later). Moreover, we also assume that there is also a way to join relations in Rel(DB, T, i) and Rel(DB, T, i − 1), perhaps through some other relations in the database. We call such relations Conn(i, i − 1), and the attributes involved in the joins Att(Conn(i, i − 1)). As an example, assume T such that Rel(DB, T, 1) is {Department, P rof essor};
Automatic Generation of XML from Relations
337
INPUT: Target List T, schema DB OUTPUT: an NPJ expression Let i = max(LV(T)); Let TEMP = REL(DB, T, i); if (i = 1) return πAL(T,i) (T EM P ) /* this case implies no nesting is needed */ while (i = 1) do { if (i > 2) Attrs = att(Conn(i-2,i-1)); else Attrs = ∅; TEMP = νAL(t,i)∪Attrs (πAL(T,i)∪Attrs∪(Al(T,i−1)) (LIN K(REL(DB, T, i − 1), T EM P ))) i = i -1; } /* end while */ return TEMP; Fig. 5. Algorithm to produce an NPJ expression
Rel(DB, T, 2) is {Course}; and Rel(DB, T, 3) is {P rof essor}. Then Conn(2, 3) is {P rof Course}; att(Conn(2, 3)) is {IdDept}, and conn(1, 2) is ∅. Finally, the expression LIN K(Rel(DB, T, i), Rel(DB, T, i − 1)) is to denote the join of: all relations in Rel(DB, T, i), joined among themselves; all relations in Rel(DB, T, i − 1)), joined among themselves; and the join of all relations in Conn(i, i − 1), joined among themselves. For instance, following our previous example, LIN K(Rel(DB, T, 2), REL(DB, T, 3)) = P rof Course IdP rof P rof essor; and Course IdCourse LIN K(Rel(DB, T, 1), REL(DB, T, 2)) = Department IdDept P rof essor IdP rof P rof Course IdCourse Course. The algorithm in Figure 5 uses these expressions to iterate over the target list. If the maximum depth in the target list is 1, there is no nesting, and hence we simply return a Project-Join expression. If the maximum depth ≥ 2, there is some nesting needed. We iterate starting at maximum depth i and the next depth i − 1 to create and NPJ expression for these two levels; this expression is reused in creating another NPJ expression for levels i − 1 and i − 2, and the process is repeated until we reach level 1. The variable Attrs is used to see which attributes are needed for the join at the next level (so, when looking at levels i and i − 1, we check the join between level i − 1 and i − 2), since it is necessary to have those attributes present in the NPJ expression for levels i and i − 1. 3.2
Examples
The algorithm outlined above seems more complicated than the real strategy being used. Let us illustrate the approach with a couple of examples from [10]. The first one is the example that we have used in the previous section; its target list is {(iddept1)(deptname1)(coursename2)} This example has maximum depth of 2 and hence will iterate over the loop only once. It is therefore the simplest example using nesting; it requires only one application of the nest operator. Once the relations involved are known
338
A. Badia
(in this case Rel(DB, T, 1) = {Department} and Rel(DB, T, 2) = {Course}, while Conn(1, 2) = ∅), joins based on the foreign-key/primary key relationship are established. Finally, looking at the attribute levels indicates the arguments needed by the nest operator: in this case, we nest by coursename and group by iddept and deptname. The corresponding NPJ expression is νiddept,deptname (πiddept,deptname,coursename (Department iddept Course) Note the correspondence between this expression and the SQL query used in [10]. A more complex example is example 2.3 of [10], which calls for a document made up of a list of department entries, each one containing the department name, a list of professors (names) in the department, and a list of courses offered by the department, each course element containing a list of the professors that teach the course for the department3 . The solution of [10] is to give two SQL views, one grouping the information by department and the other one grouping the information by department and course (the idsql attribute is used to identify the view from which a source should be obtained). Note, however, that this essentially involves taking the Cartesian product of both views. In our approach, taking the join of all relations and then nesting would not work, even if two copies of relation Professor were used. Rather, the translation proceeds in layers, creating an NPJ expression for each pair of adjacent levels. First, the target list is obtained, yielding {(DeptN ame1)(P rof N ame2)(CourseN ame2)(P rof N ame3)} Observe that ProfName appears twice, with different levels. The translation proceeds bottom-up, by first creating a NPJ expression for levels 2 and 3. However, in doing so it is necessary to take into account that the resulting relation will be used as input in another NPJ expression; the attributes needed for joining and nesting at the outer level are pushed down into this expression, resulting in νCourseN ame,IdDept (πCourseN ame,IdDept,P rof N ame (Course P rof essor P rof Course)) where IdDept is the attribute pushed because it is needed at the next level (note also that this join involved going through a many-to-many table). This is the second value of T EM P (after one iteration of the loop; the first one, previous to entering the loop, was simply ProfCourse). Then the process is repeated for the rest of the target list. The outer NPJ expression, then, is νDeptN ame (πDeptN ame,P rof N ame,T emp (Department P rof essor T emp) The resulting nested relation corresponds exactly to that of example 2.3 of [10], but is obtained as a single query. Once the expression is obtained, generating the XML document consists of two simple steps. First, the expression is applied to the database to yield a single (possibly nested) table. This table is then transformed into a list (sequence) of XML documents, each one of them with the structure given by the target XML Schema. One document is obtained from each row in the table. A simple 3
Obviously, there is redundancy on such a document, but a user may request the information in such a format!
Automatic Generation of XML from Relations
339
attribute in the row corresponds to a first level, simple element in the document (specified in the XML attribute source), while a complex attribute in the row corresponds to a complex element in the document. The complex element contents are obtained by matching simple elements inside complex attributes in the relation to simple attributes in the XML element (again, using the source specification), and recursively constructing complex elements in XML from complex attributes in the relation. Given the way the nested relational algebra expression was constructed, there should be a 1-1 correspondence (an isomorphism, really) between each row in the relation (seen as a tree) and the XML document specification (seen as a tree).
4
Extensions
We have made a series of syntactic simplifications in our algorithm which are not problematic. For instance, a (somewhat) significant simplification is to assume that the join of two relations is always based on a foreign key/primary key relationship, and that such a relationship is unique between two relations. Certainly, if one wants to get rid of the assumption it would become necessary to indicate, somehow, how the underlying relations are to be joined. Three cases are possible: the relations are not joined in any way (in which case one has to use a Cartesian product), or the relations are joined by attributes other than the primary key and foreign key; or the relations are joined by a primary key/foreign key relationship, but there exists more than one such relationship between the tables. The first case is very easy to address; our algorithm can be extended so that when no connection is found between two tables, a Cartesian product is used in the NPJ expression, instead of a join. For the other two cases, there seems to be no good solution without further information about the database (like metadata of some sort); hence the user should provide such information in the XML document description. We do not address such an extension here, since we consider this situation extremely infrequent. A more substantive issue is the inadequacy of joins in some situations. Sometimes, a value in a primary key may have no matching value in a foreign key. In the example above, a department may not have any related courses in the database. In that case, the question is what to do with the unmatched element: should we show a department with no courses attached as part of the answer (with an empty sequence of courses), or should we not include such a department in the answer at all? XML Schema allows the user to specify what the intended result was by using minOccur in the specification of the target, by using a value of “0” or “1”. Intuitively, the user can distinguish between a ’*’ and a ’+’ situation (using DTD vocabulary) to define the intended answer. If minOccur is set to 1, then only departments offering courses should be shown. This is exactly the semantics of the procedure outlined above, since the join operator will only produce tuples from one relation (in this case, Department that have matches in the other relation, Course)4 . If minOccur is set to 0, then all departments 4
Note that 1 is the default value for minOccurs in XML Schema.
340
A. Badia
must be shown, whether they offer any courses or not (if they don’t, an empty sequence should be shown). The problem is how to specify this in the (nested) relational algebra, since the join operator will drop any tuple in a relation that has no match in the relation it is being joined to. The most straightforward answer is to use the outerjoin operator. In the previous example, a left outerjoin between Department (the left input) and Course (the right input) will create a result where departments with no courses associated will be preserved in a tuple padded with nulls on the Course attributes. The presence of nulls, however, creates a problem for the nesting operator: if we try to nest by an attribute with nulls, what should the semantics of the operation be? An easy solution to this problem is to observe that, when outerjoin is used to preserve some information, we usually want to nest by the null attributes and group by the non-null ones (as it is the case in our example, since we nest by coursename, and group by iddept, deptname). Thus, one can stipulate that, in such a case, the nesting should produce a tuple with iddept, deptname values and an empty relation (i.e., an empty set) associated with them. From this, the XML document should produce a element with values for elements , <deptname> and no value for complex element . There are other several issues that need to be addressed to overcome the limitations of our method. We are currently working on some of them, which can only briefly mention here for lack of space. Obviously, by using only NPJ expressions we are limited in the views that we can create. Adding selections would be an obvious extension; the main issue here how to add selection conditions to the XML Schema specification of the output. On the other hand, our XML documents do not use the choice (|) or optional (“?”) constructors (the original examples in [10] did not use them either). One problem with using such operators is that it is difficult to make them correspond to anything in the relational algebra; proposed translations from XML to relations have difficulties dealing with both ([5]). One possible approach is based on the presence of null values. Finally, the assumption that each simple component in the XML document must come from one and only one source attribute can be relaxed at the cost of some complexity in the XML specification.
5
Conclusions and Further Research
In this paper we have proposed a method to generate XML documents from a (flat) relational database using only a description of the target in XML Schema. The nested relational model is used as the bridge between flat relations and hierarchical XML data. Our algorithm parses the target description in XML to obtain all needed information without the user having to explicitly declare it. One of the advantages of this approach is that it automates the somewhat tedious task of creating a view (or SQL query) for each XML document that we want to define; the system takes care of such task automatically, letting the user concentrate on the specification of the desired output document.
Automatic Generation of XML from Relations
341
The approach focuses on nest-project-join expressions; clearly, there are several possible extensions, which we are currently exploring, including dealing with choice and optional operators in the XML description, adding selections to the algebra expression, and allowing more complex matches between attributes in the relational tables and elements in XML. Acknowledgments. The author wishes to thank the anonymous reviewers for their helpful feedback.
References 1. S. Abiteboul, P. Buneman and D. Suciu Data on the Web: From Relations to Semistructured Data and XML, Morgan Kaufman, 1999. 2. A. Deutsch, M. Fernandez, and D. Suciu Storing Semistructured Data with STORED, in Proceedings of the ACM SIGMOD Conference, 1999. 3. M. Fernandez, A. Morishima, D. Suciu and W. Tan, Publishing Relational Data in XML: the SilkRoute Approach, in IEEE Data Engineering Bulletin, no. 24(2), 2001. 4. A. S. da Silva, Irna Evangelista Filha, Alberto H. F. Laender and David W. Embley Representing and Querying Semistructured Web Data Using Nested Tables with Structural Variants, in Proceedings of ER 2002. 5. M. Mani, D. Lee, and R. R. Muntz Semantic Data Modeling Using XML Schemas, in Proceedings of ER 2001. 6. D. Lee and W. Chu, CPI: Constraint-Preserving Inlining Algorithm for Mapping XML DTD to Relational Schema, Data and Knowledge Engineering journal, volume 39, 2001. 7. D. Lee and W. Chu Constraints-Preserving Transformation from XML Document Type Definition to Relational Schema, in Proceedings of ER 2000. 8. D. Lee, M. Mani, F. Chiu, and W. Chu Nesting-Based Relational-to-XML Translation, in Int’l Workshop on Web and Databases (WebDB), 2001. 9. J. Shanmugasundaram, J. Kiernan, E. J. Shekita, C. Fan and J. Funderburk Querying XML Views of Relational Data, in Proceedings of VLDB 2001. 10. C. Vittori, C. Dorneles and C. Heuser Creating XML Documents from Relational Data Sources, in Proceedings of EC-Web, 2001. 11. Vossen, G. Data Models, Database Language and Database Management Systems, Addison-Wesley, 1991. 12. Extensible Markup Language (XML) 1.0, Bray, T. and Paoli, J. and SperbergMcQueen, C.M. (eds), W3C Recommendation, http://www.w3.org/TR/REC-xml-20001006, edition 2.
Toward the Automatic Derivation of XML Transformations Martin Erwig Oregon State University School of EECS [email protected]
Abstract. Existing solutions to data and schema integration require user interaction/input to generate a data transformation between two different schemas. These approaches are not appropriate in situations where many data transformations are needed or where data transformations have to be generated frequently. We describe an approach to an automatic XML-transformation generator that is based on a theory of information-preserving and -approximating XML operations. Our approach builds on a formal semantics for XML operations and their associated DTD transformation and on an axiomatic theory of information preservation and approximation. This combination enables the inference of a sequence of XML transformations by a search algorithm based on the operations’ DTD transformations.
1
Introduction
XML is rapidly developing into the standard format for data exchange on the Internet, however, the combination of an ever growing number of XML data resources on the one hand, and a constantly expanding number of XML applications on the other hand, is not without problems. Of particular concern is the danger of isolated data and application “islands” that can lead users to perceive a prodigious supply of data that is often inaccessible to them through their current applications. This issue has been observed and extensively addressed in previous work in data integration, for example, [8,14,6,7,19,13] and more recently in schema integration and query discovery [21,24,15,16]. So far, however, all the proposed solutions require user input to build a translation program or query. Even more troubling, since each different data source requires a separate transformation, the programming effort grows linearly with the number of data sources. In many cases this effort is prohibitive. Consider the following scenario. An application to evaluate the publication activities of researchers accepts XML input data, but requires the data to be of the form “publications clustered by authors”. A user of this system finds a large repository of bibliographic data, which is given in the format according to the DTD shown in Figure 1 on the left. In the following, we will refer to the corresponding XML data as bib. The application cannot use these data because ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 342–354, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
Toward the Automatic Derivation of XML Transformations
343
byAuthor author*> author (name,(book|article)*)> book title> article (title,journal)> name (#PCDATA)> title (#PCDATA)> journal (#PCDATA)>
Fig. 1. DTD of available data and DTD of required data.
bibliographic entries are not grouped by authors. What is needed is a tool that can transform bib into a list of author elements, each containing a sublist of their publications. Such a format is shown in Figure 1 on the right. Although tools are available that support the transformation, they sometimes require non-trivial programming skills. In almost all cases they require some form of user interaction. In any case, users might not be willing to invest their time in generating one-time conversion tools. Moreover, if the integration of several different data sources should be required to create several different transformations, the programming or specification effort quickly becomes untenable. An intrinsic requirement is that these transformations be “as information preserving as possible”. In the best case the generated transformation preserves the information content completely, but in many instances transformations that lose information are also sufficient. For example, if an application requires only books with their titles, a transformation that “forgets” the author information of an XML document works well. Our solution to the described problem can be summarized as follows: First, identify an algebra of information-preserving and information-approximating XML transformations. In particular, these operations have a precisely defined type, that is, an associated schema transformation for DTDs. By induction it then follows that if we transform a DTD d into a DTD d by a sequence of these elementary XML transformations, the same sequence of operations transforms an XML value of DTD d lossless or approximating into an XML value of DTD d . The second step is then to define a search algorithm that constructs a search space of DTDs by applying algebra operations and find a path from a source DTD d to the required target DTD d . The path represents the sequence of operations that realize the sought transformation. There might be, of course, cases in which the automatic inference does not work well. The situation is comparable to that of search engines like Google that do not always find good matches due to a lack of semantics or structure associated with the query keywords. Nevertheless, search engines are among the most valuable and most frequently used tools of the Internet since they provide satisfactory results in practice. For the same reasons, automatic integration tools, although not complete, might be valuable and useful tools in practice. This paper presents the proposed approach through examples. Due to space limitations we have to restrict ourselves to the description of a small number of elementary XML operations that can be employed in generated transformations
344
M. Erwig
and also a subset of axioms for information approximation. Nevertheless, we will be able to demonstrate the automatic generation of an XML transformation within this restricted setting. The rest of this paper is structured as follows. In Section 2 we will discuss related work. In Section 3 we will formally define the problem of XMLtransformation inference. In Section 4 we axiomatize the notions of information preservation and approximation. In Section 5 we define what it means for an XML transformation to be DTD correct. In Section 6 we introduce basic XML transformations that will be used as building blocks in Section 7 in the inference of complex XML transformations. Finally, Section 8 presents some conclusions.
2
Related Work
Related work has been performed in two areas: (i) schema matching and query discovery and (ii) data semantics and information content. Schema Matching and Query Discovery. Approaches for matching between different data models and languages are described in [19,2,3]. Data integration from an application point of view is also discussed, for example, in [8,6,14,13]. We will not review all the work on data integration here because data integration is traditionally mainly concerned with integrating a set of schemas into a unified representation [22], which poses different challenges than translating between two generally unrelated schemas. A more specific goal of schema matching is to identify relationships between (elements) of a source and a target schema. Such a mapping can then be used to deduce a transformation query for data. The Cupid system [15] focuses exclusively on schema matching and does not deal with the related task of creating a corresponding data transformation/query. The described approach combines different methods used in earlier systems, such as MOMIS [4] or DIKE [20]. The Clio system [9] is an interactive, semi-automated tool for computing schema matchings. It was introduced for the relational model in [16] and was based on so-called value correspondences, which have to be provided by the user. In [24] the system has been extended by using instances to refine schema matchings. Refinements can be obtained by inferring schema matchings from operations applied to example data, which is done by the user who manipulates the data interactively. User interaction is also needed in [21] where a two-phase approach for schema matching is proposed. The second phase, called semantic translation, is centered around generating transformations that preserve given constraints on the schema. However, if few or even no constraints are available, the approach does not work well. It has been argued in [16] that the computation of schema matchings cannot be fully automated since a syntactic approach is not able to exploit the semantics of different data sources. While this is probably true for arbitrarily complex matches, it is also true that heuristic and linguistic tools for identifying renamings can go a long way [12,5]. Certainly, quality and sophistication of
Toward the Automatic Derivation of XML Transformations
345
transformations can be increased by more semantic input. However, there is no research that could quantify the increase/cost ratio. So it is not really known how much improvement is obtained by gathering semantics input. The approach presented in this paper explores the extreme case where users cannot or are not willing to provide input, which means to provide fully automatic support for data transformation. Information Content. A guiding criterion for the discovery of transformations is the preservation (or approximation) of the data sources to which the transformations will be eventually applied. Early research on that subject was performed within relational database theory [10,11] and was centered around the notion of information capacity of database schemas, which roughly means the set of all possible instances that a schema can have. The use of information capacity equivalence as a correctness criterion for schema transformations has been investigated in [17,18]. In particular, this work provides guidelines as to which variation of the information capacity concept should be applied in different applications of schema translation. One important result that is relevant to our work is that absolute information capacity equivalence is too strong a criterion for the scenario “querying data under views”, which is similar in its requirements to data integration. In other words, those findings formally support the use of information approximation in transformation inference.
3
Formalization of Transformation Inference
In the following discussion we make use of the following notational conventions. denote Symbols x, x , y, z XML elements (also called XML values) lists of XML elements , d, d DTDs tags t, u t XML elements with tag t t[x1 . . . xk ], t[] XML elements with tag t and subelements x1 . . . xk (or ) Sometimes we want to refer to a subelement without caring about the exact position of that element. To this end we employ a notation for XML contexts: Cx stands for an XML element that contains somewhere a subelement x. Similarly, C represents an XML element that contains a list of subelements. This notation is particularly helpful for expressing changes in contexts. To simplify the discussion, we do not consider attributes or mixed content of elements in the following. Now we can describe the problem of XML-transformation inference precisely as follows. We are given an XML data source x that conforms to a DTD d (which is written as x : d), but we need the data in the format described by the DTD d . Therefore, we are looking for an XML transformation f that, when applied to x, yields an XML value x that conforms to the DTD d (that is, f (x) : d ) and contains otherwise as much as possible the same information as x. This last
346
M. Erwig
condition can be expressed by defining a partial order on XML values ≺ that formalizes the notion of having less information content. A slight generalization of the problem is to find transformations f with the described property without knowing x. We can express the problem mathematically as follows. P (d, d ) = {f | ∀x.x : d =⇒ f (x) : d ∧ f .f (x) : d ∧ f (x) ≺ f (x)} P defines the set of all transformations f that map an XML value conforming to d to a value conforming to d and also have the property that there is no other transformation f with that property that preserves more information content. The generalized definition reflects the application when the DTD d of the XML data source is known, but the (possibly very large) XML document x has not been loaded (yet). In the following we consider this second case since it subsumes the previous one.
4
Information Preservation and Information Approximation
We formalize the concepts of information preservation and approximation by defining corresponding relations on XML trees. These relations are induced by operations on XML values. We consider here the renamings of tags and regrouping as an information-preserving operation and the deletion of elements as an information-approximating operation. This limitation is not really a problem since the whole theory is generic in the axiomatization of information preservation/approximation, which means that the set of chosen operations does not affect the overall approach. Formally, two elements that have non-matching tags, such as x = a and x = a, are considered to be different. However, if we rename the tag t in x to u, both elements become identical. We write {t → u} for a renaming of t to u and {t → u}(x) for the application of the renaming to the element x. It happens quite frequently that the same data are named differently by different people. For example, we might find bibliographic data sources that wrap the author information by tags , , , and so on. With regard to the information contained in the XML value, the actual choice of individual tag names does not really matter. Therefore, we can consider a broader kind of equality “up to a tag renaming r”, written as ≡r . For example, under the renaming {t → u} the elements x and x are equal, which we could express, for example, by x ≡{t→u} x . This is because {t → u}(x) = x . We must be careful not to rename with a tag that is already in use in the element to be renamed. For example, if we renamed to , the meaning of the bibliographic data from Section 1 would change. In general, a renaming r can consist of a set of tag renamings, which means that r is a function from old tags to new tags. These two sets can be extracted from a renaming by dom(r) and rng(r), respectively. We can formalize the equivalence of DTDs modulo renamings by a rule like ren≡ shown in Figure 2. In this and the rules to follow, r denotes an arbitrary
Toward the Automatic Derivation of XML Transformations
ren≡
rng(r) ∩ tags(x) = ∅ x ≡ r x
r(x) = x
cong≡
x1 ≡r y1
...
347
xk ≡r yk
t[x1 . . . xk ] ≡r t[y1 . . . yk ]
grp≡
del
Ct[1 ] . . . t[k ] ≡r Ct[1 ] . . . Ct[k ] x1 r y1 ... xk r yk cong Cx r C t[x1 . . . xk ] r t[y1 . . . yk ]
Fig. 2. Axiomatic definition of information content and approximation
(set of) renaming(s). The first premise of the rule prevents name clashes by requiring fresh tags in renamings. The function tags computes the set of all tags contained in an XML element. We also have to address the fact that some renamings are more reasonable than others, for example, {name → aname} is more likely to lead to equivalent schemas than, say {name → price}. In the described model, any two structurally identical DTDs can be regarded as equivalent under some renaming. This leads to equivalence classes that are generally too large. In other words, schemas that would not be considered equivalent by humans are treated as equivalent by the model. This will be particularly evident when the tags used in the source and target DTD are completely or mostly different. This problem can be addressed by defining an ordering on renamings that is based on the number and quality of renamings. A cost or penalty can be assigned to each renaming based on its likeliness. For example, names that are “similar” should be assigned a relatively low cost. Measures for similarity can be obtained from simple textual comparisons (for example, one name is the prefix of another), or by consulting a thesaurus or taxonomy like WordNet [1]. Synonyms identified in this way should also have a low penalty. In contrast, any renaming that has no support, such as {name → price}, receives a maximum penalty. With this extension we can measure any equivalence d ≡r d by a number, which is given by the sum of the penalties of all renamings in r. Later, we can use this measure to select the “cheapest” among the different possible transformations by favoring a few, well-matching renamings. Renaming is the simplest form of extending verbatim equality to a form of semantic equivalence. As another example, consider a structural equivalence condition that is obtained from the observation that an element x with tag u containing k repeated subelements with tag t is a grouped or factored representation of the association of each t-element with the rest of x. Therefore, it represents the same information as the corresponding “de-factored” or “ungrouped” representation as k u-elements each containing just one t-element. For instance, the following element on the left represents (in a factored way) the same information as the two elements shown on the right.
348
M. Erwig Principia Math. Russel Whitehead
Principia Math. Russel Principia Math. Whitehead
In general, an element Ct[1 ] . . . t[k ] contains the same information as the list of elements Ct[1 ] . . . Ct[k ]. This idea can be captured by the axiom grp≡ shown in Figure 2. Finally, we also need congruence rules to formalize the idea that if elements x and x contain the same information, then so do, for example, the elements t[x] and t[x ]. This is achieved by the rule cong≡ shown in Figure 2. This approach for formalizing the notion of information equivalence by a set of axioms and rules provides a sound basis for judging the correctness of inferred transformations. In a similar way, we can axiomatize the notion of information approximation. For instance, deleting a subelement from an element x yields a new element x that contains fewer information than x but agrees otherwise with x. This idea can be expressed by the axiom del shown in Figure 2 where we also give a congruence rule cong for information approximation. Since the definition of approximation is an extension of equivalence, we also have to account for renamings in the predicate r .
5
DTD Correctness of XML Transformations
DTDs can be formally defined by extended context-free grammars. Non-recursive DTDs can be represented simply by trees, that is, they can be represented essentially in the same way as XML values. This tree representation simplifies the description of DTD transformations. Note that in this representation * and | occur as tags. For example, the DTD for bib can be represented by the following tree. bib[*[|[book[title, *[author]], article[title, *[author], journal]]]] Representing DTDs as trees means that we can re-use the tree operations we have already defined for XML values. The complexity of the resulting notation can be simplified by abbreviating *[e] by e∗ and |[e, e ] by (e|e ) so that we can recover most of the original DTD notation: bib[(book[title, author∗ ] | article[title, author∗ , journal])∗ ] A DTD transformation is given by a function that maps a DTD d to another DTD d . For each XML transformation f , we can consider its corresponding DTD transformation, for which we write f . Depending on the language in which
Toward the Automatic Derivation of XML Transformations
349
f is defined and on the formalism that is used to describe DTDs and DTD transformations, there might exist zero, one, or more possible DTD transformations for f . The DTD transformation f that corresponds to an XML transformation can also be considered as f ’s type, which is expressed by writing f : d → d if f (d) = d . Formally relating DTD transformations to the transformations of the underlying XML values is achieved by the notion of DTD correctness, that is, an XML operation f : d → d is defined to be DTD correct if f applies to x
=⇒
∀x : d.f (x) : d
In other words, DTD correctness means that the DTD transformation f that is associated with an operation f is semantically meaningful, that is, it reflects correctly the DTD transformation for each underlying XML value. (We can write the condition also as: ∀x : d.f (x) : f (d).)
6
Basic XML Transformations
The feasibility of the automatic XML-transformation inference hinges to a large part on the ability to express complex XML transformations as compositions of a small set of simple operations, which we call basic operations. The design of these basic operations is guided by the following criteria. All basic operations must (a) be information preserving or information approximating, (b) have a clearly specified DTD transformation, and (c) be DTD correct. Why do we require these properties? Item (a) ensures that inferred transformations do not change the information contained in XML data or at most lose information, but never introduce new information. Properties (b) and (c) will ensure that the inference, which is directed by DTDs, yields transformations of XML values that conform to these DTDs. The notion of DTD transformations and correctness will be explained below. Next we consider three basic XML transformations that have been designed guided the just mentioned criteria: renaming, product, and deletion. Renaming. The rename operation α takes a renaming r = {t1 → u1 , . . . , tk → uk } with ui = ti for 1 ≤ i ≤ k and applies it to all tags in an XML element x. We require that the new tags ui do not occur in x. r(x) if rng(r) ∩ tags(x) = ∅ αr (x) = x otherwise Let us check the design constraints for this operation. For information preservation we require that the XML value obtained by the operation in question is equivalent to the original XML value. In the case of renaming we therefore require αr (x) ≡r x, which follows directly from the axiom ren≡ shown in Figure 2. The DTD transformation that corresponds to renaming can be described by: αr : d → r(d)
350
M. Erwig
which means that α transforms an XML value conforming to a DTD d into a value whose DTD is obtained by renaming tags according to r. The proof of DTD correctness can be performed by induction over the syntactic structure of the DTD transformation. Product. Another basic operation is the operation π for de-factoring XML elements. We also call this operation product since it essentially computes a combination of an element with a list of its subelements. The tag t of the subelement to be considered is a parameter of π. πt (u[Ct[1 ] . . . t[k ]]) = u[Ct[1 ] . . . Ct[k ]] The additional root tag u is needed in the definition to force the repetition to apply below the root element. We assume implicitly in this and all other definitions that operations leave all XML values unchanged that do not match the pattern of the definition. In the case of π this means that for any element x that does not contain repeated t-subelements we have πt (x) = x. Again we can check the properties of the operation π. First, information preservation follows from the axiom grp≡ and the congruence rule cong≡ shown in Figure 2. The type of π is: π t : u[Ct∗ ] → u[Ct ∗ ] DTD correctness can again be shown by induction. Deletion. As an example for an information-approximating operation, consider the XML transformation δt that deletes a sequence of t-subelements (on one level) from an XML element. It can be defined as follows. δt (Ct[1 ] . . . t[k ]) = C Obviously, δ is not information preserving, but it is information approximating, which can be proved using the axiom del from Figure 2. The type of δ can be described succinctly by re-using the context notation for XML trees. δ t : Ct∗ |t → C As for the other XML transformations, DTD correctness can be proved by induction. To summarize, for all the basic operation ω defined, we have the following property. ∀x.x : d =⇒ ω(x) : ω(d) ∧ (∃r.x ≡r ω(x) ∨ ω(x) r x) That is, each basic operation ω is: (1) DTD correct and (2a) information preserving or (2b) information approximating (recall that ω denotes the DTD transformation of ω).
Toward the Automatic Derivation of XML Transformations
7
351
Transformation Inference
A very simple, although effective, initial approach is to build a search space of DTDs starting from the DTD of the source document, say d, by repeatedly applying all matching operations until the target DTD, say d , is reached. By “matching operations” we mean basic operations whose argument type have d as an instance. In the search we always favor following paths along information-preserving operations over information-approximating operations. Whenever we apply α we take tags(d ) as a pool from which to draw new names. We also have to ensure not to repeatedly apply inverse renamings to prevent running into infinite search paths. Once we have reached d by this procedure, the path from d to d in this search space corresponds to a sequence of basic XML transformations ω1 , . . . , ωk whose composition f = ωk · . . . · ω1 is the sought transformation of type d → d . This is because we are using only DTD-correct transformations. If all basic operations ωi are information preserving, then so is the transformation f . If at least one ωi is information approximating, then so is f . If we are not able to generate d , the algorithm stops with an error. To illustrate the transformation inference by an example, consider the task of creating a list of title/author pairs for books from the bib element. This means to find a transformation from the DTD d for bib bib[(book[title, author∗ ] | article[title, author∗ , journal])∗ ] into the following DTD d . bookAuthors[book[title, author]∗ ] First, since the tag bookAuthors is not contained in the source DTD d, we know that we have to apply αr with r = {bib → bookAuthors}. Next, we can apply δarticle because its type matches with the context C1 = bookAuthors[(book[title, author∗ ] | )∗ ] However, we might also apply πauthor by choosing, for example, the following context (note that u = bookAuthors). C2 = (book[title, author∗ ] | article[title, , journal])∗ (Alternatively, we could also match author∗ in the book element.) Nevertheless, we choose to apply δ because it is simpler, which is somehow indicated by the smaller context C1 . We could also try to apply δbook to delete the book element, which, however, does not seem to make any sense because we then “lose” a tag of the target DTD. After having applied δarticle , we have reached the DTD described by the context C1 . Now it makes sense to apply πauthor . Before we do this, however, we simplify C1 according to a rule d| = d to remove the now
352
M. Erwig
unnecessary | constructor. So the context for the application of πauthor is (with u = bookAuthors): C3 = book[title, ∗ ]∗ The resulting DTD after the application of πauthor is bookAuthors[(book[title, author]∗ )∗ ] A final simplification through the rule (d∗ )∗ = d∗ [23] yields the target DTD. The inference process has therefore generated the transformation f = πauthor · δarticle · α{bib→bookAuthors} The description is a bit simplified, because in order to apply the operations in f to some XML value, we need all the contexts that were determined during the inference process. Treating these contexts here like implicit parameters, we can now apply f to bib and obtain the desired XML value. With two additional operations for lifting elements upward in XML trees and grouping elements according to common subelements, we can describe the XML transformation that is required for the example given in Section 1. Designing these operations so that they are DTD correct and information preserving/approximating and making transformation inference powerful enough to discover them is part of future work.
8
Conclusions
The fast growing number of Web applications and available information sources carries the danger of creating isolated data and application islands because the distributed nature of the Internet does not enforce the use of common schemas or data dictionaries. Our approach aims at avoiding these data islands and to promote the free flow and integration of differently structured data by developing a system for the automatic generation of XML transformations. Our approach differs from previous efforts since we aim at a fully automated transformation discovery tool where user interaction is not required a priori. It will not, however, rule out any additional input the user is willing to provide. As one example, user-defined renamings can be easily integrated into our approach by setting penalties for these renamings to zero. In other words, users can interact if they want to, but are not required to do so.
References 1. WordNet: A Lexical Database for the English Language. http://www.cogsci.princeton.edu/˜wn/. 2. S. Abiteboul, S. Cluet, and T. Milo. Correspondence and Translation for Heterogeneous Data. In 6th Int. Conf. on Database Theory, LNCS 1186, pages 351–363, 1997.
Toward the Automatic Derivation of XML Transformations
353
3. P. Atzeni and R. Torlone. Schema Translation between Heterogeneous Data Models in a Lattice Framework. In 6h IFIP TC-2 Working Conf. on Data Semantics, pages 345–364, 1995. 4. S. Bergamaschi, S. Castano, and M. Vincini. Semantic Integration of Semistructured and Structured Data Sources. SIGMOD Record, 28(1):54–59, 1999. 5. M. W. Bright, A. R. Hurson, and S. Pakzad. Automated Resolution of Semantic Heterogeneity in Multidatabases. ACM Transactions on Database Systems, 19(2):212–253, 1994. 6. V. Christophides, S. Cluet, and J. Sim`eon. On Wrapping Query Languages and Efficient XML Integration. In ACM SIGMOD Conf. on Management of Data, pages 141–152, 2000. 7. S. Cluet, C. Delobel, J. Sim´eon, and K. Smaga. Your Mediators Need Data Conversion! In ACM SIGMOD Conf. on Management of Data, pages 177–188, 1998. 8. A. Eyal and T. Milo. Integrating and Customizing Heterogeneous E-Commerce Applications. VLDB Journal, 10(1):16–38, 2001. 9. L. M. Haas, R. J. Miller, B. Niswonger, M. T. Roth, P. M. Schwarz, and E. L. Wimmers. Transforming Heterogeneous Data with Database Middleware: Beyond Integration. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 22(1):31–36, 1999. 10. R. Hull. Relative Information Capacity of Simple Relational Database Schemata. SIAM Journal of Computing, 15(3):856–886, 1986. 11. T. Imielinski and N. Spyratos. On Lossless Transformation of Database Schemes not Necessarily Satisfying Universal Instance Assumption. In 3rd ACM SIGACTSIGMOD-SIGART Symp. on Principles of Database Systems, pages 258–265, 1984. 12. P. Johannesson. Linguistic support for Analysing and Comparing Conceptual Schemas. IEEE Transactions on Knowledge and Data Engineering, 21(2):165–182, 1997. 13. A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In 22nd Int. Conf. on Very Large Databases, pages 251–262, 1996. 14. B. Lud¨ ascher, Y. Papakonstantinou, and P. Velikhov. Navigation-Driven Evaluation of Virtual Mediated Views. In 7th Int. Conf. on Extending Database TechnologyEuropean, LNCS 1777, pages 150–165, 2000. 15. J. Madhavan, P. A. Bernstein, and E. Rahm. Generic Schema Matching with Cupid. In 27th Int. Conf. on Very Large Databases, pages 49–58, 2001. 16. R. J. Miller, L. M. Haas, and M. A. Hern` andez. Schema Mapping as Query Discovery. In 26th Int. Conf. on Very Large Databases, pages 77–88, 2000. 17. R. J. Miller, Y. Ioannidis, and R. Ramakrishnan. The Use of Information Capacity in Schema Integration and Translation. In 19th Int. Conf. on Very Large Databases, pages 120–133, 1993. 18. R. J. Miller, Y. Ioannidis, and R. Ramakrishnan. Schema Equivalence in Heterogeneous Systems: Bridging Theory and Practice. Information Systems, 19(1):3–31, 1994. 19. T. Milo and S. Zohar. Using Schema Matching to Simplify Heterogeneous Data Translation. In 24th Int. Conf. on Very Large Databases, pages 122–133, 1998. 20. L. Palopoli, G. Terracina, and D. Ursino. Towards the Semi-Automatic Synthesis of Cooperative Information Systems and Data Warehouses. In ADBIS-DASFAA Symp. on Advances in Databases and Information Systems, pages 108–117, 2000. 21. L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hern` andez, and R. Fagin. Translating Web Data. In 28th Int. Conf. on Very Large Databases, 2002.
354
M. Erwig
22. S. Ram and V. Ramesh. Schema Integration: Past, Current and Future. In A. Elmagarmid, M. Rusinkiewicz, and A. Sheth, editors, Management of Heterogeneous and Autonomous Database Systems, pages 119–155. Morgan Kaufman, 1999. 23. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational Databases for Querying XML Documents: Limitations and Opportunities. In 25th Int. Conf. on Very Large Databases, pages 302–314, 1999. 24. L. L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-Driven Understanding and Refinement of Schema Mappings. In ACM SIGMOD Conf. on Management of Data, 2001.
VACXENE: A User-Friendly Visual Synthetic XML Generator 1
1
Khoo Boon Tian , Sourav S Bhowmick , and Sanjay Madria
2
1
School of Computer Engineering, Nanyang Technological University, Singapore [email protected] 2 Department of Computer Science, University of Missouri-Rolla, Rolla, MO 65401 [email protected]
Abstract. Synthetic collections of valid XML documents are useful in many XML applications. However, creating test cases for XML applications by manually from DTDs can be tedious and may not cover all possible or required instances. In this paper, we present VACXENE (VisuAl synthetiC Xml gENErator), a Java-based tool that creates test cases for XML applications by generating random instances of valid XML documents from a single DTD. XML generator provides a user-friendly GUI that allows us to control the appearance of the XML output by imposing user-defined constraints. The paper presents an overview of the various features supported by VACXENE and reports some preliminary results regarding performance.
1 Introduction XML has emerged as the dominant standard for representing and exchanging data over the Internet. When compared with other mark-up languages such as HTML, the main advantage of XML is that each XML document can have a Document Type Definition (DTD) associated with it. A DTD serves as an implicit semantic schema for the XML document and makes it possible to define much more powerful queries than what is possible with simple, keyword-based text retrievals. Also, XML’s nested, self-describing structure provides a simple yet flexible means for applications to model and exchange data. For example, a business can easily model complex structures such as purchase orders in XML form. As another example, all of Shakespeare’s plays can be marked up and stored as XML documents. Overall, XML can serve at least two roles [6]. First, as a new markup language, a web browser can browse an XML file in the same way as an HTML file. Second, XML can serve as a standard way of storing semi-structured data sets. XML makes it possible for users to ask very powerful queries against the web. Consequently, a great deal of research recently has focused on storing, indexing, and querying XML data [6]. One of the critical steps in XML research is the performance evaluation of new techniques for storing, indexing and querying XML data. It is imperative to have access to large XML data sets with widely varying characteristics for gaining insights into the performance of proposed techniques on different kinds of XML data. However, using only real XML data can be very limiting for three reasons [2]. First, there is not much publicly available XML data at this time. Second, all the real XML M.A. Jeusfeld and Ó. Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 355–365, 2003. © Springer-Verlag Berlin Heidelberg 2003
356
K.B. Tian, S.S Bhowmick, and S. Madria
data that we have encountered has relatively simple structure. Using more complex XML data can provide better insights, even if this data is synthetic. Third, like all real data, we have very little control over the characteristics of real XML data. Synthetically generated data has always been important for evaluating and understanding new ideas in database research. Synthetic data generators allow users to generate large volumes of data with well-understood characteristics. One can easily vary the characteristics of the generated data by varying the input parameters of the data generator. This allows users to systematically cover much more of the space of possible data sets than relying solely on real data over which users have little or no control. As such, using synthetic data for evaluating research ideas and testing the performance of database systems can provide users with deeper insights and stronger conclusions than relying solely on real data. Of course, while experimenting with synthetic data is an ideal way to explore the behavior of different solutions on data with different characteristics, an additional validation step may be necessary to ensure that the conclusions drawn from synthetic data extend to real world applications. In this paper, we describe a data generator for generating synthetic XML data called VACXENE1 (VisuAl synthetiC Xml gENErator) that allows a high level of control over the properties of the generated XML data using a small number of parameters. The simple and intuitive nature of the data generation parameters means that the characteristics of the generated XML data will be easy to understand, even though this data may not necessarily resemble any available real data. This data generator is certainly not the ultimate solution to the problem of generating synthetic XML data, but we have found it very useful in our research on XML data management, and we believe that it can also be useful to other researchers. Given a DTD, VACXENE generates tree-structured XML documents of arbitrary complexity. It uses the information provided by the user to generate one or more XML documents with variety of characteristics, and it also generates values for the elements and attributes in these documents. It can generate both data-centric and document-centric XML data. The data generator has a user-friendly GUI for inputting various input parameters and visualizing the generated synthetic data set.
2 Related Work Recently, there has been several works in the area of generating synthetic XML data. In [5], synthetic XML data is used to evaluate different strategies for storing XML in relational database systems. The XML data used is extremely simple in characteristics and consists of elements at one level with no nesting. The elements are randomly connected in a graph structure using IDREF attributes. This graph-structured view of XML data is useful in some contexts, but XML data is by nature tree structured, and it may often be useful to have a tree-structured view of this data. Furthermore, the data generation process of [5] has very few opportunities for varying the structure and distribution of the generated data. In [4] and [7], two benchmarks are proposed for evaluating the performance of XML data management systems. Both benchmarks use synthetic XML data that 1
The name also implies that our data generator is a “vaccine” for the “cure” of lack of diverse characteristics in real XML data.
VACXENE: A User-Friendly Visual Synthetic XML Generator
357
models data from high-level applications: a database of structured text documents and a directory of these documents in [4], and data about on-line auctions in [7]. The structure of the data in both cases is fixed and simple, and there is very little opportunity for varying it. This kind of data may be adequate for a benchmark that serves as a standard yardstick for comparing the performance of XML data management systems. However, if we wish to evaluate a particular XML data management system and gain insights into its performance, then using XML data with widely varying structure over which we have more control can be more useful. IBM provides a data generator that generates XML data that conforms to an input DTD [1]. Like the previous approaches, the IBM data generator is limited in the control it provides over the data generation process. For example, we cannot control the number of words nested in an element. This is important, as document-centric XML as opposed to data-centric XML have order and may contain few sentences in an element and one may wish to generate such documents. A general purpose synthetic XML documents generator is presented in [2]. The data generator can generate XML documents of arbitrary complexity. It generates XML elements and values within these elements, but it does not currently handle the generation of attributes. The data generator starts by generating a tree called the path tree that represents the structure of the XML data. The data generator assigns tag names to the nodes of this tree, and specifies the frequency distribution of the XML elements represented by these nodes. It uses the information in this tree to generate one or more XML documents, and it also generates values for the elements in these documents. It does not use a DTD or XML-schema to generate the documents and hence may not reflect meaningful XML documents always. ToXgene [3] is a template-based generator for large, consistent collections of synthetic XML documents, developed as part of the ToX (the Toronto XML Server) project. It was designed to be declarative, and produces fairly complex XML content. The ToXgene Template Specification Language (TSL) is a subset of the XML Schema notation augmented with annotations for specifying certain properties of the intended data, such as value distributions, the vocabulary for CDATA content, etc. It also allows different elements (or attributes) to share CDATA literals, thus allowing the generation of references among elements in the same (or in different) documents. This enables the generation of collections of correlated documents (i.e., documents that can be joined by value). ToXgene also allows the specification of most common integrity constraints (e.g., uniqueness) over the data in such lists; thus, one can generate consistent ID, IDREF and IDREFS attributes. In contrast to the proposals for generating synthetic XML data in [4, 7], our data generator can generate much more complex data, and it provides much more control over the characteristics of the generated data. Unlike in [2], we use a DTD to generate synthetic data and hence we can generate meaningful elements and attributes. VACXENE also allows us to choose element and attribute values from specific dictionary or domain knowledge. This enables us to generate XML data that contains keywords from a particular domain. Compared to [3], we provide mechanism to create data-centric as well as document-centric XML. We also provide a user-friendly GUI so that novice users can easily specify input parameters and generate synthetic documents and visualize them effectively. Nevertheless, it may be possible to use ideas from these proposals to extend our data generator. For example, IDREF attributes may be used to connect the elements of the generated documents as in [5].
358
K.B. Tian, S.S Bhowmick, and S. Madria
Also, different probability distributions and element sharing can be supported as in [3]. Next, we describe the different steps of generating synthetic XML data, and we point out the input parameters that control each step.
3 Specifying Input Parameters In this section, we discuss the various input parameters specified in VACXENE for generating synthetic XML documents. The XML Generator has many input features that help us to make it powerful but yet easy to use. These features provide the user with a wide range of functionality. We now elaborate on these functionalities. 3.1 Specifying the DTD The first step for creating synthetic valid XML documents in VACXENE is DTD specification. The generated synthetic document set will satisfy the input DTD. VACXENE can parse any DTD files provided by the user. The DTD will be validated to ensure that it conforms to the recommendations provided by the World Wide Web Consortium (W3C). If the DTD is not well-formed, an error message will inform the user of the error detected and the parsing process will be terminated. If the DTD is correctly validated, its content will be displayed. Upon completion of the parsing process, the XML Generator will generate the tree structure of the DTD. As shown in Figure 1, the tree shows the hierarchical view of the contents of the DTD. It allows the user to have a clearer picture of the DTD structure by displaying the relationship between different nodes, the cardinality of various nodes and their corresponding attributes. User can decide on the amount of details he needs to view, by expanding or collapsing the tree. From the DTD tree, the user can also specify the node parameters for each individual node.
3.2 Specifying the Level of Synthesis Input parameters control the level of complexity of the generated XML. Whenever a DTD is parsed, a set of default parameter values will be generated. User can control these default parameters through the use of the “level of synthesis” option. These options are catered to suit the different needs of various users. The three available options are “none”, “moderate” and “high”, with the “high” option producing the most complex documents. The semantics of these three types of complexity is given in Table 1. The default option is “moderate”. Table 1. Level of Synthesis Complexity Low Moderate High
Document Depth Fixed Varies Varies
Document Width Fixed Fixed Varies
Nodes Cardinality Maximum 1 Maximum 1 No Limit
VACXENE: A User-Friendly Visual Synthetic XML Generator
359
Cardinality of Node
Root Node
Name of Node Property of Attribute Attribute of Node
Fig. 1. DTD Tree Structure
3.3 Parameters to Control Characteristics of XML Documents A range of input parameters is available to allow users to vary the characteristic of the generated XML documents. These parameters play a very important role as they affect the level of complexity and increase the randomness of the generated documents. The parameters are defined into two groups, Main Parameters and Node Parameters. Main Parameters consists of parameters, which affect the structure of the XML document as a whole. Node Parameters consists of parameters, which affect only a specified node and all its children nodes. Figure 3 shows a screenshot of the GUI for input parameter specification. The following is a list of Main Parameters, currently supported by VACXENE: o Minimum and Maximum Depth: Depth measures the number of levels of the XML document, starting from the root element. The root level is considered as level 0. The depth value of each document is a random number generated between the minimum and maximum depth value specified by the user. o Minimum and Maximum Width: Width measures the maximum number of children permitted at any level of the XML document. This width value is a random number generated between the minimum and maximum width value specified by the user. A value of zero for width is not allowed. o Scaling: This parameter specifies the number of children that the root element can have. In another words, scaling decides the number of level 1 element. Intuitively, scaling factor denotes the length of the XML document. o XML Files: This parameter is an indication of the number of synthetic XML documents to be created. o Enable Dictionary: Random dictionary words were generated to act as values for the elements in the XML documents. This parameter defines whether the generated document should contain any random dictionary words. The length of the generated values and their frequency can be controlled. This feature can be disabled.
360
K.B. Tian, S.S Bhowmick, and S. Madria
Recursive Node
Recursion Rate Parameter
Fig. 2. Node Parameters of Recursive Node
o
o o
Size of Long Value: Elements with values that exceed a certain length is considered as Long Value. This parameter specifies the size of element values in order to consider them as Long Value. For example, if the value specified is 10, any element values that are of more than 10 words will be considered as Long Value. Percentage Elements with Long Value: This parameter defines the percentage number of elements in the document that will contain Long Values. Document Name: This parameter allows the user to define the file name of the generated XML documents.
We can also specify constraints at node level of XML documents by inputting Node Parameters. By clicking on a particular node in the DTD tree structure in Figure 2 5 Node Depth = 1
Node Depth = 2
Fig. 3. Example of Node Depth
VACXENE: A User-Friendly Visual Synthetic XML Generator
361
and filling up the Node Parameters, we may control a subtree of the XML data. The following is a list of Node Parameters supported in VACXENE: Minimum and Maximum Cardinality: Cardinality measures the number times an element will appear. This cardinality value is a random number generated between the minimum and maximum cardinality parameter value. The cardinality of any element is restricted by the cardinality symbol specified in the DTD. Minimum and Maximum Fan Out: Fan out determines the maximum number of children a particular element can have. It is similar to scaling except that it can be applied to any elements in the XML. The fan out value is a random number generated between the minimum and maximum fan out parameter value. Minimum and Maximum Node Depth: Node depth determines the maximum number of levels in each sub-tree and affects the complexity of the synthetic XML document. This node depth value is a random number generated between the minimum and maximum node depth parameter value. For example in Figure 3, node 5 has a node depth of 2.
5
1
5
1
5 5
1
Recursion
Fig. 4. Example of Recursive Node Recursion
3.4 Specifying Recursive Nodes VACXENE supports the recursion of nodes. However, only recursive nodes with a cardinality symbol of “?” or “*” can be accepted. This is to avoid the possibility of infinite recursion. Recursive nodes have different node parameters as compared to ordinary nodes. They do not have fan out or node depth parameters. Instead, they have a recursion rate parameter. Recursion rate determines the number of times a recursive node can iterate. Recursion of nodes can be demonstrated by the example in Figure 4. Node 5 of the DTD tree is a recursive node and we can control the number of times it can recurse using the recursive rate parameter. Figure 2 is a screenshot of a recursive node and its node parameters. As seen in the Figure, node Staff is a recursive node. Upon the selection of node Staff, the recursion rate parameter will
362
K.B. Tian, S.S Bhowmick, and S. Madria
appear as one of the node parameters and allow the user to specify the number of times this recursive node can iterate. 3.5 Conflict Detection and Rectification In certain situation the constraints imposed on a node (using Node Parameters) may conflict with the parameters set on the whole XML document (using the Main Parameters). For example if the maximum depth of the node is set to 5 and if the maximum height of the tree is set to 3 then obviously these two input parameters conflict with one another. This conflict may also occur due to the constraints imposed by the input DTD. Hence, it is necessary detect such conflicts and ask the user to rectify the problem. VACXENE can automatically detect such conflicts and inform users the conflict that may occur due to their choice of input parameters. Whenever a conflict has occurred, an interactive status message will appear to allow the user to respond to the conflict. In the previous section we discussed how to specify various input parameters in VACXENE to vary the characteristics of the synthetic XML documents. Given a well-formed DTD and a set of user specified input parameters, the XML Generator will generate synthetic XML documents. The generated documents will comply strictly to the DTD and the various input parameters. User can also specify the name of the XML documents and the number of documents to be generated as part of the input.
4 Visualization Synthetic Document Set Upon the generation of the XML documents, the user can view the contents of any of these documents and their corresponding tree structure. As in the case of DTD tree structure, the XML tree can also be expanded or collapsed. To enable the user to personalize the XML documents, the contents of the displayed documents can be edited. However if the editing of the contents leads to a conflict with the DTD, an error will occur and the changes will not be saved. VACXENE also provides the user with the option to validate the generated documents so as to ensure that the documents follow the structure specified by the DTD and satisfy all the input conditions. The screenshot in Figure 5 shows the Document Validation Panel. Document validation is an optional feature that allows the user to validate the generated documents against their DTD and the input parameters. User can choose a range of documents to be validated or use the “Validate All” checkbox to indicate that all generated documents are to be validated. The status of the validation will then be displayed on the status screen.
VACXENE: A User-Friendly Visual Synthetic XML Generator
363
Range of XML Document to be Validated
Checkbox for Validating all XML Document
Validation Status
Button to Start Validation
Fig. 5. Document Validation Panel
5 Experimental Results In this section, we discuss the results of preliminary experiments with VACXENE. We explore how the three major factors (complexity of the documents, the number of documents generated and the use of dictionary) affect the performance of the system. For our experiment we use 8 different data sets for same input DTD. We vary the following three input parameters: level of synthesis, number of documents to be generated, and the usage of dictionary. We compute the run time of our tool. Table 2 summarizes the results. Table 2. Results Number of Documents 10 10 10 10 100 100 100 100
Complexity Moderate High Moderate High Moderate High Moderate High
Use of Dictionary No No Yes Yes No No Yes Yes
RUN-TIME Result (sec) 0.841 0.900 1.632 2.374 3.836 4.517 12.197 17.335
364
K.B. Tian, S.S Bhowmick, and S. Madria
Result Graph 20 18
Run-time in seconds
16 14 Moderate, No
12
Moderate, Yes
10
High, No
8
High, Yes
6 4 2 0 10
100 Number of documents
Fig. 6. Result Graph
The results showed that the run-time of VACXENE increases as the complexity of the document is increased. The use of dictionary and an increase in the number of documents generated will also increase the run-time. From the results, we can also conclude that the use of dictionary plays a very big role in the run-time of the XML Generator. The use of the dictionary will increase the run-time by at least 2 fold. The effect of the dictionary will even be higher if the complexity of the documents is high. For moderate complexity documents, the use of dictionary will increase run-time by 94%. The use of dictionary for high complexity documents will lead to an 163% increase in run-time. The result obtained from the tests was as expected. Whenever dictionary is accessed, the XML Generator will access the dictionary file and randomly pick a word. This process will take up a certain amount of execution time. For example, if a document consists of 20 elements and each element have values of length 15 words, a total of 20 X 15 = 300 random words will have to be generated. This will lead to 300 accesses to the dictionary. If 10 documents are generated, a total of 300 X 10 = 3000 dictionary accesses will be made. The number of dictionary accesses will be further increased when the complexity of the documents increased as higher complexity documents will usually have more elements.
6 Conclusions and Future Work In this paper we introduced VACXENE, a data generator for generating synthetic complex-structured XML data, which can be of use to researchers in XML data management. VACXENE has been implemented using Java. The data generator has several input parameters that control the characteristics of the generated data. The parameters all have simple and intuitive meanings, so it is easy to understand the structure of the generated data. We presented an overview of our tool, which is based
VACXENE: A User-Friendly Visual Synthetic XML Generator
365
on DTD, and discussed how various input parameters can easily be specified in VACXENE to generate XML documents with wide variety of characteristics. Finally, we reported on preliminary experiments we conducted with our tool. Development of VACXENE is continuing. It can easily be extended and modified to allow for different methods of data generation not covered in this paper. Areas for possible extension include, among others, generating data that conforms to a given XML schema, element sharing, and support of various probability distributions. We also intend to provide a mechanism to allow the generation of text according to different vocabularies, grammars, and character encoding schemes. This would be of great importance for generating testing data for text-intensive applications [3].
References 1. 2. 3.
4. 5. 6. 7.
IBM XML generator. http://www.alphaworks.ibm.com/tech/xmlgenerator. Aboulnaga, J. F. Naughton, and C. Zhang. Generating synthetic complex-structured XML data. In Proceedings of the Fourth International Workshop on the Web and Databases, pages 79–84, Santa Barbara, CA, USA, May 24–25 2001. Denilson Barbosa, Alberto Mendelzon, John Keenleyside and Kelly Lyons. ToXgene: a template-based data generator for XML. In Proceedings of the Fifth International Workshop on the Web and Databases (WebDB 2002). Madison, Wisconsin – June 6–7, 2002. Timo Böhme and Erhard Rahm. XMach–1: A benchmark for XML data management. In Proc. German Database Conference (BTW2001), Oldenburg, Germany, March 2001. Daniela Florescu and Donald Kossmann. Storing and querying XML data using an RDBMS. IEEE Data Engineering Bulletin, 22(3):27–34, September 1999. Feng Tian, David DeWitt, Jianjun Chen and Chun Zhang. The Design and Performance Evaluation of Alternative XML Storage Strategies. Technical Report, University of Wisconsin, Madison, USA. Albrecht Schmidt, Florian Waas, Martin Kersten, Daniela Florescu, Ioana Manolescu, Michael J. Carey, and Ralph Busse. The XML benchmark project. Technical Report INSR0103, CWI, April 2001.
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas Shiyong Lu, Yezhou Sun, Mustafa Atay, and Farshad Fotouhi Department Of Computer Science Wayne State University Detroit, MI 48202 {shiyong,sunny,matay,fotouhi}@cs.wayne.edu
Abstract. XML is rapidly emerging on the World Wide Web as a standard for representing and exchanging data. It is critical to have efficient mechanisms to store and query XML documents to exploit the full power of this new technology. While one approach is to develop native XML repositories that support XML data models and query languages directly, the other approach is to take advantage of the mature technologies that are provided by current relational or object-relational DBMSs. There is active research along both approaches and it is still not clear which one is better than the other. We continue our effort on the second approach. In particular, we have developed an efficient algorithm which takes an XML DTD as input and produces a relational schema as output for storing and querying XML documents conforming to the input DTD. Our algorithm features several significant improvements over the shared-inlining algorithm including overcoming its incompleteness, eliminating redundancies caused by shared elements, performing optimizations and enhancing efficiency.
1
Introduction
With the trend of increasing amount of XML documents on the World Wide Web, it is critical to have efficient mechanisms to store and query XML documents to exploit the full power of this new technology. As a result, various XML query languages have been proposed such as XML-QL [7], XQL [14], Lorel [12] and XML-GL [5], and more recently XQuery [6], and XML has become one of the most active research fields attracting different researchers from various communities. Currently, two approaches are being investigated for storing and querying XML data. One approach is to develop native XML repositories that support XML data models and query languages directly. This includes Software AG’s Tamino [2] and eXcelon’s XIS [1], among others. The other approach is to take advantage of the mature technologies that are provided by current relational or object-relational DBMSs. The major challenges of this approach include: (1) XML data model needs to be mapped into the target model such as the relational model; (2) queries posed in XML query languages need to be translated into ones ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 366–377, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas
367
in the target query languages such as SQL or OQL; and (3) the query results from the target database engines need to be published back to XML format. Recently, Kurt and Atay have performed an experimental study to compare the efficiency of these two approaches [13]. However, since both approaches are still under active research and development, it is too early to conclude which one is better than the other. Related work. Several mechanisms have been proposed to store XML data in relational or object-relational databases [8] [10] [16] [11] and publish relational or object-relational data as XML data [15] [4] [9]. Some of them use XML DTDs [16] and others consider situations in which DTDs are not available [8] [10] [11]. Two recent evaluations [19] [11] of different XML storage strategies indicate that the shared-inlining algorithm [16] overperforms other strategies in data representation and performance across different datasets and different queries when DTDs are available. In this paper, we propose a new inlining algorithm that maps XML DTDs to relational schemas. Our algorithm is inspired by the shared-inlining algorithm [16] but features several improvements over it. We will discuss these improvements in Section 3.3. Organization. The rest of the paper is organized as follows. Section 2 gives a brief overview of XML Document Type Definitions (DTDs), Section 3 describes our new inlining algorithm that maps an input DTD to a relational schema in terms of three steps: (1) simplifying input DTDs (Section 3.1); (2) creating and inlining DTD graphs (Section 3.2); (3) generating relational schemas (Section 3.3). The section ends with a discussion of the improvements we have made over the shared-inlining algorithm, which is considered as the best strategy when DTDs are available [19] [11]. A full evaluation and comparison is underway and will be presented in the near future. Section 3.4 illustrates the three steps of our algorithm using a real input DTD, and demonstrates how XML documents conforming to the DTD can be stored. Finally, Section 4 concludes the paper and provides some directions for future work.
2
XML DTDs
XML Document Type Definitions (DTDs) [3] describe the structure of XML documents and are considered as the schemas for XML documents. In this paper, we model both XML elements and XML attributes as XML elements since XML attributes can be considered as XML elements without further nesting structure. A DTD D is modeled as a set of XML element definitions {d1 , d2 , · · · , dk }. Each XML element definition di (i = 1, · · · , k) is in the form of ni = ei , where ni is the name of an XML element, and ei is a DTD expression. Each DTD expression is composed from XML element names (called primitive DTD expressions) and other DTD subexpressions using the following operators: – Tuple operator. (e1 , e2 , · · · , en ) denotes a tuple of DTD subexpressions. In particular, we consider (e) is a singleton tuple. The tuple operator is denoted by “,”.
368
S. Lu et al.
– Star operator. e∗ represents zero or more occurrences of subexpression e. – Plus operator. e+ represents one or more occurrences of subexpression e. – Optional operator. e? represents an optional occurrence (0 or 1) of subexpression e. – Or operator. (e1 | e2 | · · · | en ) represents one occurrence of one of the subexpressions e1 , e2 , · · ·, en . We ignore the encoding mechanisms that are used in data types PCDATA and CDATA and model both of them as data type string. The DOCTYPE declaration states which XML element will be used as the schema for XML documents. This XML element is called the root element. However, we assume that arbitrary XML elements defined in the DTD might be selected, inserted, deleted and updated individually. We define a DTD expression formally as follows. Definition 1. A DTD expression e is defined recursively in the following BNF notation where n range over XML element names and e1 , · · ·, en range over DTD expressions. e ::= string | n | e+ | e∗ | e? | (e1 , · · · , en ) | (e1 | · · · |en ) where the symbol “::=” should be read as “is defined as” and “|” as “or”.
3
Mapping XML DTDs to Relational Schemas
In this section, we propose a new inlining algorithm that maps an input DTD to a relational schema. The algorithm contains the following three steps: 1. Simplifying DTDs. Since a DTD expression might be very complex due to its hierarchical nesting capability, this step greatly simplifies the mapping procedure. 2. Creating and inlining DTD graphs. We create the corresponding DTD graph based on the simplified DTD, and then inline as many descendant elements as possible to an XML element. In contrast to the shared-inlining algorithm, our inlining rules eliminate the redundancy caused by shared elements in the generated relational schema and can deal with arbitrary input DTDs including those that contain arbitrary cycles. 3. Generating relational schemas. After a DTD graph is inlined, we generate a relational schema based on it. We describe these three steps in Sections 3.1, 3.2 and 3.2 respectively and conclude the section by a discussion of the improvements we have made over the shared-inlining algorithm. Finally, Section 3.4 illustrates these steps using a real XML DTD and demonstrates how XML documents conforming to this DTD can be stored based on the generated schema.
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas
369
e+ → e∗ . e? → e. (e1 | · · · | en ) → (e1 , · · · , en ). a) (e1 , · · · , en )∗ → (e∗1 , · · · , e∗n ). b) e∗∗ → e∗. 5. a) · · · , e, · · · , e, · · · → · · · , e∗ , · · · , · · ·. b) · · · , e, · · · , e∗ , · · · → · · · , e∗ , · · · , · · ·. c) · · · , e∗ , · · · , e, · · · → · · · , e∗ , · · · , · · ·. d) · · · , e∗ , · · · , e∗ , · · · → · · · , e∗ , · · · , · · ·.
1. 2. 3. 4.
Fig. 1. DTD simplification rules
3.1
Simplifying DTDs
Most complexity of a DTD comes from the complexity of DTD expressions such as . However, as far as an XML query language is concerned, what matters is the siblings and parentchild relationships between elements. We apply the transformation rules listed in Figure 1 in the given order: 1. 2. 3. 4.
Apply rule 1 recursively and the resulting DTD will not contain +. Apply rule 2 recursively and the resulting DTD will not contain + and ?. Apply rule 3 recursively and the resulting DTD will not contain +, ? and |. Apply rules 4(a) and 4(b) recursively and the resulting DTD will take the form (e1 , e2 , · · · , en ). Each ei = e or e∗ (i = 1, · · · , n) where e is an element name. Therefore, a DTD is in some flattened form after this step. 5. Apply rules 5(a), 5(b), 5(c) and 5(d) recursively and the resulting DTD will take the form (e1 , e2 , · · · , en ) such that each ei contains distinct element name.
From an XML query language’s point of view, two pieces of information are essential: (1) The parent-child relationships between XML elements; and (2) the relative order relationships between siblings. The above transformation maintains the former but not the later. Fortunately, we can introduce an ordinal attribute for each generated relation to encode the order of XML elements when an XML element (and its containing subelements) is inserted into the database, so that any XML query conforming to the input DTD can be evaluated over the generated relational schema. Example 1. Use the above simplification procedure, one can transform to a simplified version . The following theorem indicates that our simplification procedure is complete and in addition, the resulting DTD expression is a tuple of element names or their stars.
370
S. Lu et al.
Theorem 1. Our DTD simplification procedure is complete in the sense that it accepts every input DTD and each resulting DTD expression is in the form of (e1 , e2 , · · · , en ) where ei = e or e∗ (i = 1, · · · , n), e is an element name and each ei contains a distinct XML element name. Proof. We omit the proof since it is obvious. Discussion. Compared to the transformation rules defined in the shared-inlining algorithm [16], we made several improvements over it: – Completeness. Our rules consider all possible combinations of operators and XML elements whereas the shared-inlining algorithm only lists some important combinations. For example, there is no rule that corresponds to (e1 | · · · | en )? in the shared-inlining algorithm. – Efficiency. We enforce the application of the rules in the order given. Earlier rules totally transform away some operators from the input DTD, and in each step, the number of rules to be matched is greatly reduced. This improves the efficiency of the simplification procedure significantly. – Further simplification. We observe that the role of “?” corresponds to the notion of nullable column in the relational table. We transform away “?” and this greatly simplifies the resulting DTD graph (to be described in the next subsection) since it does not contain “?” any more. 3.2
Creating and Inlining DTD Graphs
In this step, we create the corresponding DTD graph based on the simplified DTD, and then inline as many descendant elements to an element as possible. The rationale is that these inlined elements will eventually produce a relation. Therefore, we only inline a child c to a parent p when p can contain at most one occurrence of c in order to avoid introducing redundancy into the generated relation. Theorem 1 indicates that after the simplification procedure, any input DTD is now in a canonical form, i.e., each DTD expression is a tuple of distinct element names or their stars. As a result, in the corresponding DTD graph, each node represents an XML element, and each edge represents an operator of ’,’ or ’*’. Our inlining procedure considers the following three cases. 1. Case 1: Element a is connected to b by a ,-edge and b has no other incoming edges. In other words, b is a non-shared node. In this case, a can contain at most one occurrence of b, and we will combine node b into a while maintaining the parent-child relationships between b and its children. 2. Case 2: Element a is connected to b by a ,-edge but b has other incoming edges. In other words, b is a shared node. We do not combine b into a in this case since b has multiple parents. 3. Case 3: Element a is connected to b by a *-edge. In this case, each a can contain multiple occurrences of b element, and we do not combine b into a. Only case 1 allows us to inline an element to its parent. We define the notion of inlinable node as follows.
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas *
a
d
* c, d
a, b b
*
371
c
*
A
B
a b g
c
a, b, c , d d
g
e e, f f
C
D
Fig. 2. Inlining DTD graphs
Definition 2. Given a DTD graph, a node is inlinable if and only if it has exactly one incoming edge and that edge is a ,-edge. Definition 3. Given a DTD graph and a node e in the graph, node e and all other inlinable nodes that are reachable from e by ,-edge constitute a tree (since we assume a DTD graph is consistent, thus there is no ,-edge cycle in the graph). This tree is called the inlinable tree for node e (it is rooted at e). Example 2. In Figure 2.A, nodes b and d are inlinable but nodes a and c are not inlinable. The inlinable tree for a contains nodes a and b, whereas the inlinable tree for c contains nodes c and d. In Figure 2.C, nodes b, c, d and f are inlinable, but nodes a, e and g are not inlinable. The inlinable tree for a contains nodes a, b, c and d, and the inlinable tree for node e contains nodes e and f . The notion of inlinable tree formalizes the intuition of “inlining as many descendant elements as possible to an element”. We illustrate our inlining algorithm in pseudocode in Figure 3. Essentially, it uses a depth-first-search strategy to identify the inlinable tree for each node and then inline that tree to its root. A field inlinedSet of set type is introduced for each node e to represent the set of XML element nodes that has been inlined to this node e (initially e.inlinedSet = {e}). For example, in Figure 2.C, after the inlining procedure, a.inlinedSet = {a, b, c, d}. The algorithm is efficient as indicated in the following theorem. Theorem 2 (Complexity). Our inlining algorithm can be performed in O(n) where n is the number of elements in the input DTD. Proof. This is obvious since each node of the DTD graph is visited at most once.
372
S. Lu et al.
Algorithm Inline(DTDGraph G) Begin For each node e in G do If not visited(e) then InlineNode(e) End If End For End Algorithm InlineNode(Node e) Begin Mark e as “visited” For each child c of e do If not visited(c) then InlineNode(c) End If End For For each child c of e do If inlinable(c) then e.inlinedSet ∪ = c.inlinedSet; assign all children of c as the children of e and then delete c from G End If End For End Fig. 3. The inlining procedure
Example 3. Using our inlining procedure given in Figure 3, the DTD graph shown in Figure 2.A will be inlined into one shown in Figure 2.B, and the DTD graph shown in Figure 2.C will be inlined into one shown in Figure 2.D. We observe that after our inlining algorithm is applied, a DTD graph has the following property: nodes are connected by ,-edge or *-edge and ,-edge must point to a shared node. This observation is the basis of the final step of the algorithm: generating relational schemas. 3.3
Generating Relational Schemas
After a simplified DTD graph is inlined, the last step is to generate a relational schema based on this inlined DTD graph. The generated schema supports the select, insert, delete and update [18] of an arbitrary XML element declared in the input DTD. The following four steps will be performed on the inlined DTD graph to generate a set of relations. 1. For each node e, a relation e is generated with the following relational attributes.
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas
373
a) ID is the primary key, and for each XML attribute A of e, a corresponding relational attribute A is generated with the same name. b) If | e.inlinedSet | ≥ 2, we introduce attribute nodetype to indicate the type of the XML element stored in a tuple. c) The names of all the terminal XML elements in e.inlinedSet. Since a non-terminal XML element is stored with values for ID and nodetype and the storage of the XML subelements it contains, no additional attribute is needed for it (this will become more clear later). d) If there is a ,-edge from e to node c, then introduce c.ID as a foreign key of e referencing relation c. 2. If there are at least two relations t1 (ID) and t2 (ID) generated by step 1, then we combine all the relations of the form t(ID) into one single relation table1(ID, nodetype) where nodetype indicates which XML element is stored in a tuple. 3. If there are at least two relations t1 (ID, t1 ) and t2 (ID, t2 ) generated by step 1, then we combine all the relations of the form t(ID, t) into one single relation table2(ID, nodetype, pcdata) where nodetype indicates which XML element is stored in a tuple. 4. If there is at least one ∗ edge in the inlined DTD graph, then we introduce relation edge(parentID, childID, parentType, childType) to store all the parent-child relationships corresponding to *-edges. The domains of parentType and childType are the set of XML element names defined in the input DTD. Essentially, step 1 converts each node e in the inlined DTD graph into a separate relation e. If there are some other XML element nodes that have been inlined to it (i.e., | e.inlinedSet | ≥ 2), relation e will be used to store all these XML elements, and attribute nodetype will be introduced to indicate which XML element is the root for each tuple. Since step 1 might produce a set of relations in the forms of t(ID) and t(ID, t), Step 2 and 3 optimize them by performing a horizontal combining of them into table1(ID, nodetype) and table2(ID, nodetype, pcdata). These optimizations reduce the number of target relations and will facilitate the mapping from XML operations to relational SQL operations. Finally, one single relation edge(parentID, childID, parentType, childType) stores all the many-to-many relationships between arbitrary two XML elements. Although our inlining algorithm is inspired by the shared-inlining algorithm, we made several significant improvements over it: – Completeness. Our algorithm is complete in the sense that it can deal with any input DTDs including arbitrary cyclic DTDs. The shared-inlining algorithm defines a rule to deal with two mutually recursive elements and it is not clear how a DTD with a cycle involving more than two elements is handled (see Figure 4.A for such an example). In addition, the shared-inlining algorithm checks the existence of recursion explicitly, we do not need to do this checking and cycles are dealt with naturally. – Redundancy elimination for shared nodes. A node is shared if its in-degree is more than one. Our algorithm deals with shared nodes differently from the
374
S. Lu et al. papers
paper
*
* * *
authors
conference journal *
* report, references
*
author,name,institute
A
B
literature * book * part *
telephone dept *
* faculty
*
* staff
student
chapter * section C
name
D
Fig. 4. Four inlined DTD graphs
shared-inlining algorithm. For example, for the shared node author in Figure 4.B, the shared-inlining algorithm will generate a separate relation author(authorID, author.parentID, author.parentCODE, author.name.isroot, author.name, author.institute.isroot, author.institute). This schema implies a great deal of redundancy if an author writes hundreds of conference or journal papers, In contrast, we create a relation author(ID, nodetype, name, institute) for author, and translate its parent ∗-edges (and all other ∗-edges) into another separate relation edge(parentID, childID, parentType, childType). Our strategy eliminates the above redundancy and bears the same spirit as the rule of mapping many-to-many relationships into separate relations in translating Entity-Relationship (ER) diagrams into relational schemas. – Optimizations. Two situations are very common in XML documents: (1) there are XML elements which do not have any attributes and their single purpose is to provide a tag name (e.g., Figure 4.C) for supporting nested structure; and (2) there are terminal nodes that are shared by several XML elements (such as name and telephone in Figure 4.D). If we created a separate relation for each such kind of element, then we would produce a set of relations of the form of t(ID) (case 1) or t(ID, t) (case 2). Hence, instead, we create two relations table1(ID, nodetype) and table2(ID, nodetype, pcdata) which conceptually combine all these relations. These optimizations greatly reduce the number of relations in the generated schema and facilitates the translation of XML queries into relational queries.
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas
375
]> Fig. 5. A publication DTD
– Efficiency: The shared-inlining algorithm introduces an attribute parentID for each node under the ∗ operator while the ∗ operator itself is never translated into a separate relation. This facilitates the traversal of XML documents upwards (from children to parents) but not downwards (from parents to children). For example, in Figure 4.D, the shared-inlining algorithm will generate relations dept, faculty, staff, etc. Given a faculty, it is very easy to locate which department he is from based on an index on facultyID and faculty.parentID of relation faculty. However, it would be difficult to navigate downwards for path expressions such as dept//name (get all the names reachable from element dept), since one needs to consider the fact that dept actually has three kinds of children (faculty, staff, and student), and all these three ways of reaching a name have to be combined. In contrast, We will translate all ∗-edges into one single relation edge(parentID, childID, parentType, childType), and create two indices on parentID and childID respectively. In this way, both upward navigation and downward navigation are supported efficiently.
3.4
A Complete Example
In this section, we illustrate different steps of our algorithm with a real DTD example, and demonstrate how XML documents conforming to this DTD can be stored based on the generated schema.
376
S. Lu et al.
An XML DTD for publications is shown in Figure 5. After the simplification step (using the rules defined in Figure 1), the input DTD is simplified into one with the following new XML element definitions. The definitions for other XML elements remain the same. – – – – – –
journal (name, editors, paper*)>. conference (name, paper*)>. paper (ptitle,authors,volume,number)>. editors (person*)>. authors (person*)>. references (paper*)>.
Due to space limit, we omit the DTD graph for the simplified DTD and the inlined DTD graph and leave them as an exercise for the reader. Finally, the following eight relations will be generated. – – – – –
publication(ID) stores XML element publication. conference(ID, name.ID) stores XML element conference. journal(ID, nodetype, name.ID) stores XML elements journal and editors. name(ID, PCDATA) stores XML element name. paper(ID, nodetype, ptitle, volume, number, year) stores XML elements ptitle, authors, volume, number and year. – person(ID, nodetype, pname, institute) stores XML elements person, pname and institute. – techreport(ID, nodetype, title) stores XML elements techreport, title and references. – edge(parentID, childID, parentType, childType) stores all the parent-child relationships between two XML elements.
4
Conclusions and Future Work
We have developed a new inlining algorithm that maps a given input DTD to a relational schema. Our algorithm is inspired by the shared-inlining algorithm but features several improvement over it including overcoming its incompleteness, eliminating redundancies caused by shared elements, performing optimizations and enhancing efficiency. Future work includes a full evaluation of the performance of our approach versus other approaches and adapting our algorithm to one that maps XML Schemas [17] (an extension to DTDs) to relational schemas. Based on this schema mapping scheme, the mappings from XML data to relational data, and from XML queries to relational queries need to be investigated.
References 1. eXtensible Information Server (XIS). eXcelon Corporation. http://www.exln.com. 2. Tamino XML Server. Software AG. http://www.softwareag.com/tamino.
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas
377
3. T. Bray, J. Paoli, C. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0, October 2000. http://www.w3.org/TR/REC-xml. 4. M. J. Carey, D. Florescu, Z. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, and S. Subramanian. XPERANTO: Publishing object-relational data as XML. In WebDB (Informal Proceedings), pages 105–110, 2000. 5. S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi, and L. Tanca. XMLGL: a graphical language for querying and restructuring WWW data. In International World Wide Web Conference (WWW), Toronto, Canada, May 1999. 6. D. Chamberlin, D. Florescu, J. Robie, J. Simeon, and M. Stefanascu. XQuery: A Query Language for XML, February 2001. http://www.w3.org/TR/xquery. 7. A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A Query Language for XML, August 1998. http://www.w3.org/TR/NOTE-xml-ql/. 8. A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with STORED. In Proc. of ACM SIGMOD International Conference on Management of Data, pages 431–442, Philadephia, Pennsylvania, June 1999. 9. M. Fernndez, W. Tan, and D. Suciu. SilkRoute: Trading between relations and XML. In Proc. of the Ninth International World Wide Web Conference, 2000. 10. D. Florescu and D. Kossman. Storing and querying XML data using an RDBMS. IEEE Data Engineering Bulletin, 22(3), 1999. 11. D. Florescu and D. Kossmann. A performance evaluation of alternative mapping schemes for storing xml data in a relational database. In Proc. of the VLDB, 1999. 12. R. Goldman, J. McHugh, and J. Widom. From Semistructured Data to XML: Migrating the Lore Data Model and Query Languages, 1999. 13. A. Kurt and M. Atay. An experimental study on query processing efficiency of native-XML and XML-enabled relational database systems. In Proc. of the 2nd International Workshop on Databases in Networked Information Systems (DNIS’2003), Lecture Notes in Computer Science, Volume 2544, pages 268–284, Aizu-Wakamatsu, Japan, December 2002. 14. J. Robie, J. Lapp, and D. Schach. XML Query Language (XQL), 1998. http://www.w3.org/TandS/QL/QL98/pp/xql.html. 15. J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, and B. Reinwald. Efficiently publishing relational data as XML documents. VLDB Journal, 10(2–3):133–154, 2001. 16. J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and opportunities. The VLDB Journal, pages 302–314, 1999. 17. C. Sperberg-MCQueen and H. Thompson. W3C XML Schema, April 2000. http://www.w3.org/XML/Schema. 18. I. Tatarinov, Z. Ives, A. Halevy, and D. Weld. Updating XML. In Proc. of ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, 2001. 19. F. Tian, D. DeWitt, J. Chen, and C. Zhang. The design and performance evaluation of alternative XML storage strategies. ACM Sigmod Record, 31(1), March 2002.
From XML DTDs to Entity-Relationship Schemas Giuseppe Psaila Universit` a degli Studi di Bergamo Facolt` a di Ingegneria Viale Marconi 5 - I-24044 Dalmine (BG), Italy [email protected]
Abstract. The need for managing large repositories of data coming from XML documents is increasing; in fact, XML is emerging as the standard format for documents exchanged over the internet. At University of Bergamo, recently we developed the ERX Data Management System, to study issues concerning the management of data coming from XML documents; its data model, called ERX (Entity Relationship for XML), being an extension of the classical ER model, allows to deal with concepts coming from XML documents at the conceptual level, and allows to reason about integration of data coming from different XML document classes. This paper focuses on the problem of automatically deriving EntityRelationship Schemas (ERX Schemas) from DTDs (Document Type Definition). In fact, the derivation of such schemas from DTDs might be a hard work to do by hand, since real DTDs are very complex and large.
1
Introduction
The need for managing large repositories of data coming from XML documents is increasing; in fact, XML is emerging as the standard format for documents exchanged over the internet. For this reason, several systems to store XML documents were developed (see Lore [6], Tamino [1], etc.). In particular, database researchers studied the problem of building such systems on top of relational databases (see [10]). At University of Bergamo, recently we developed the ERX Data Management System [9,7,8], to study issues concerning the management of data coming from XML documents, as well as the integration of different technologies, such as databases, Java and XML technology (XSLT) in the same framework. The data model provided by the system is an extension of the classical Entity-Relationship model [2], named ERX (Entity-Relationship for XML); this data model allows to deal with concepts coming from XML documents at the conceptual level. During the design of the system, we decided for an ER data model due to its independence of the particular database technology used to develop the system. The system stores data obtained from processed XML documents, and provides a query language that rebuilds XML documents [7]. ´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 378–389, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
From XML DTDs to Entity-Relationship Schemas
379
Using the system, we found that this data model can be effective independently of the system. In fact, by means of ER Schemas it is possible to have an in-depth understanding of concepts described by XML documents and their correlations, as well as to reason about integration of data coming from XML documents belonging to different classes (i.e. valid for different DTDs, using the XML terminology). Thus, the ERX data model can be used independently of the actual exploitation of the ERX System. However, the derivation of an ER schema from a real DTD (Document Type Definition) might become a hard work to do by hand, since real DTDs are very complex and large. This paper focuses on the problem of automatically deriving Entity-Relationship Schemas (ERX Schemas) from DTDs. This work is the first attempt toward a technical solution to this problem: in fact, we consider (and propose a solution to deal with) basic concepts provided by DTD and ER model (e.g., XML entities are not considered, as well as hierarchies of entities in the ER model are not considered). We proceed as follows. First, we identity a set of DTD rewriting rules; by applying them, it is possible to obtain a new version of the original DTD, which is more suitable for mapping DTD concepts into ER concepts (Section 4). Then a derivation technique, and the corresponding algorithm, is defined, to derive the ER schema from the rewritten DTD (Section 5). We assume the reader familiar with XML basic concepts [3].
2
Preliminaries
Case Study. Suppose we are building the ER model for lists of products; a sample XML document might be the one reported in Figure 1. In the document, tag Product describes a single product, characterized by attributes ID (the product identifier), Description (product description), Brand (product brand). As content of tag Product, we find one Technical tag (a technical description about the product), and one Note tag (generic notes about the product). Both tags Technical and Note contains a Text tag, whose content is a mixed composition of generic text and occurrence of HyperLink tags. Document Type Definition (DTD). Suppose now that the document in Figure 1 is valid for the DTD (Document Type Definition) reported in the figure. Recall that basic tags in a DTD are the !ELEMENT tag and the ATTLIST tag: the former defines tags (also called elements) and the structure of their content; the latter defines attributes for tags. In particular, the syntax of !ELEMENT is , where T agN ame is the name of the tag under definition. Structure is a regular expression, based on the iteration operators * (zero to many repetitions) and + (one to may repetitions), the optionality operator ? (zero or once), the sequence operator (a comma) and the alternative operator (or choice operator) |. A special case of content is the mixed content, specified as (#PCDATA | T agN ame1 | . . . | T agN amen )* where #PCDATA denotes generic text. In practice, the content is a mixed combination of generic text and occurrences of the listed tag names T agN amei . Finally, the keyword EMPTY (tag without content) can be used for Structure.
380
G. Psaila
Fig. 1. XML document (on the left) and DTD (on the right) for products.
The syntax of !ATTLIST is where T agN ame is the name of the tag for which attributes are defined. In the simplest version (considered in this paper), each AttrDefi is a triple AttrN ame CDATA RI where AttrN ame is the name of the attribute under definition; CDATA denotes that the attribute value is a string; RI can be either #REQUIRED or #IMPLIED, (mandatory or optional attribute, resp.). Consider now the DTD for our example, reported in Figure 1. We can see that the structure of the document is as follows. The root element ProductList must contain a non empty list of Product tags, and has a mandatory attribute Date. Tag Product can contain an occurrence of the Technical tag, followed by a possibly empty sequence of Note tags; furthermore, Product tags have three mandatory attributes (ID, Description and Brand) and a possibly missing attribute (Price). Tags Technical and Note have the same structure, i.e. a mandatory and unique occurrence of tag Text. The content of this latter tag is a typical case of mixed content: generic text and occurrences of tag Hyperlink. Finally, tag HyperLink is an empty tag, with a mandatory attribute URL and a possibly missing attribute Text (the text associated to the URL).
3
The ERX Data Model
We now introduce the basic concepts of the ERX Data Model [9,7,8]. Entities. An entity describes a complex (structured) concept of the source XML documents. Entities are represented as solid line rectangles; the entity name is inside the rectangle. An instance of an entity X is a particular occurrence of the concept described by entity X in a source document. It is identified by a unique, system generated, numerical property named OID. The ERX Data Model provides the concept of Hierarchy as well (see [9]); for the sake of space, we do not consider this concept here.
From XML DTDs to Entity-Relationship Schemas ID (R)
381
Description (R)
Technical in Product
Product in ProductList
Technical
Product (1:1)
(1:1)
(0:1)
(1:1)
Note in Product
Note
Brand (R) Price (I) (1:N)
(0:N)
(1:1) (1:1)
Content (R)
ProductList
Contain Text (1:1)
Text
Date (R)
Par (0:N)
Text Contains
HyperLink
URL (R)
Text (I)
Fig. 2. ERX Schema for the case study
Relationships. A relationship describes correlations existing between entities X and Y . A relationship is represented as a diamond labeled with the name of the relationship. The diamond is connected to X and Y by solid lines; these lines are labeled with a cardinality constraint (l:u), which specifies for each instance of entity X (resp. Y ) the minimum number l and the maximum number u of associated instances of Y (resp. X). An instance of the relationship describes a particular association between two instances of the connected entities. A complex form of relationship is represented by a relationship with alternatives: an instance of an entity X is associated with instances of alternative entities Y1 , Y2 , ..., Yn ; the cardinality constraint for X considers all associations of an instance of X with instances of any entity among Y1 , Y2 , ..., Yn . Orthogonally, a relationship can be a containment relationship. Given two entities X and Y , a containment relationship from X to Y denotes that an instance of X structurally contains instances of Y . Containment relationships are represented as normal relationships, except for the fact that lines from X to Y are arrows, oriented from X to the relationship diamond, and from the diamond to Y . The cardinality constraint on the contained side is always (1:1) (thus, it is omitted). Instances of containment relationships have an implicit property, named order: this property denotes the position occupied by each contained entity instance (this concept is good for ordered lists, such as XML miced content). Attributes. Entities can have attributes: they represent elementary concepts associated to an entity. Attributes are represented as small circles, labeled with the name, and connected to the entity they belong to by a solid line. Entity attributes are always string valued. Furthermore, ERX does not provide the concept of key attribute. Attribute names are associated to a qualifier, which indicates specific properties of the attribute. Qualifiers (R) and (I) denotes that the attribute is required or implied (i.e. optional), respectively.
382
G. Psaila Table 1. Simple rewriting rules 1 2 3 4 5 6 7
((item)) (item?) (item*) (item+) (item+)* (item*)* (item?)*
≡ ≡ ≡ ≡ ≡ ≡ ≡
(item) (item)? (item)* (item)+ (item)* (item)* (item)*
8 9 10 11 12 13 14
(item+)+ (item*)+ (item?)+ (item+)? (item*)? (item?)? (item)op
≡ ≡ ≡ ≡ ≡ ≡ ≡
(item)+ (item)* (item)* (item)* (item)* (item)? item op
Consider now the ERX Schema in Figure 2. Observe that there is an entity for each tag in the DTD discussed in Section 2, and for each attribute defined in the DTD, there is the corresponding attribute in the schema. Furthermore, notice attribute Content on entity Par: this report the generic text specified in a mixed content (see definition for tag Text). Considering relationships, at first notice relationship Contain Text: this is a relationship with alternatives, and means that an instance of entity Text can be associated either to an instance of entity Technical or to an instance of entity Note (this corresponds to the fact that tag Text appears in the content of two distinct tags). Finally, relationship Text Contains is a containment relationship with alternative: this means that not only instances of entities Par and HyperLink are alternatively associated to instances of entity Text, but also it is necessary to keep these associations ordered, since this relationship derives from a mixed content in the DTD.
4
DTD Rewriting
We are now ready to introduce our technique to derive ER Schemas from DTDs. In this section, we introduce DTD rewriting rules. The goal of this step is the following. Given a DTD, this might be in a form not suitable for deriving ER schemas, e.g. because it is not sufficiently simplified (extra parenthesis), or because for some DTD constructs there is not a suitable ER structure. Thus the main goal of DTD rewriting rules is to obtain an equivalent (when possible) or a slightly more general version (not equivalent to the original one) which is suitable for deriving ER schemas. These rules will be illustrated in Section 4.1. However, it might not be a good idea to derive ER schemas which are too close to the original DTD, in particular when the final goal of the process is the integration of data coming from documents belonging to different classes (thus, specified by different DTDs). In practice, we want to avoid the problem known as over-fitting. To solve this problem, Section 4.2 introduces rules that perform a deeper rewriting: the rewritten DTD is significantly more general than the original one, but is also more distant from it w.r.t. the one obtained only by means of basic rewriting rules. We will show in Section 5 that ER schemas derived after deeper rewriting are simpler and more compact.
From XML DTDs to Entity-Relationship Schemas
383
Table 2. Rewriting rules for choice and sequence (15 to 19) and for deeper rewriting 15 (item1 | . . . | itemi | . . . | itemn ) ≡ (item1 | . . . | itemi,1 | . . . | itemi,h | . . . | itemn ) where itemi = (itemi,1 | . . . | itemi,h ) 16 (item1 | . . . | itemi ? | . . . | itemn ) ≡ (item1 | . . . | itemi,1 ? | . . . | itemi,h ? | . . . | itemn ) where itemi = (itemi,1 | . . . | itemi,h ) 17 (item1 op1 | . . . | itemi opi | . . . | itemn opn) ≡ (item1 | . . . | itemi | . . . | itemn )opex such that: opex =? if opi = N U LL or opi =?, with at least one opi =? 18 (item1 op1 | . . . | itemi opi | . . . | itemn opn) ⇒ (item1 | . . . | itemi | . . . | itemn )opex such that: opex =* if there exists one opi =*, or one opi =+ and one opj =? opex =+ if opi = N U LL or opi =+, with at least one opi =+ opex =? if opi = N U LL or opi =?, with at least one opi =? 19 (item1 op1 , . . . , itemi opi , . . . , itemn opn)? ≡ (item1 op1 , . . . , itemi opi , . . . , itemn opn) where opi =? or opi =* 20 (item1 op1 , . . . , itemi opi , . . . , itemn opn) → (item1 | . . . | itemi | . . . | itemn )opex such that: opex =+ if every opi =+ or opi = N U LL opex =* if there exists one opi =? or opi =* 21 (item1 op1 , . . . , itemi opi , . . . , itemn opn)ope → (item1 | . . . | itemi | . . . | itemn )opex such that: opex =+ if ope =+ and every opi =+ or opi = N U LL opex =* if ope =?, or ope =*, or there exists one opi =? or opi =*
4.1
Basic Rewriting
Let us start with basic rewriting rules. We can distinguish them into three categories: simple rewriting rules, choice rewriting rules and sequence rewriting rules. Simple rewriting rules simplify the DTD, always obtaining an equivalent version. In particular, they reduce the number of parentheses and put the regular expression operator (* or + or ?) outside the external parenthesis. Table 1 shows these rules (in rule 14, and also in rules from 15 to 21, with o p we denote any operator among *, +, ?). Observe that they must be applied from left to right, so that they actually simplify the DTD. Choice rewriting rules operate on choice expressions appearing in DTDs (see Table 2, rules from 15 to 18). They replace a choice expression with another choice expression. In particular, observe that rules 15, 16, 17 obtain equivalent, but simplified, expressions in which regular expression operators (* or + or ?) appears only outside the external parenthesis.
384
G. Psaila
This result is obtained also by rule 18, but the right hand side expression is not equivalent to the left hand side expression; indeed, the right hand side expression is more general (notice that the symbol ⇒ is used in the middle of the two expressions). Although an equivalent DTD is not obtained, this is not a problem: in fact it is true that the rewritten DTD is slightly more general, but this limited loss of precision allows to obtain a simpler ER schema (at the end, a choice expression is transformed into another choice expression). Finally, rule 19 in Table 2 is the only rule for sequence expressions considered here. This rule rewrites the left hand side expression, obtaining an equivalent expression; it pushes the external regular expression operator into the parenthesis; the right hand side expression is then simplified w.r.t. the left hand side. Deeper, but not equivalent, rules for sequences are introduced in Section 4.2. Examples Consider the following element specification. It can be rewritten by applying the discussed rewriting rules. Below, the sequence of rewriting is reported, where the subscript parenthesis on the left hand side of symbols ≡ and ⇒ denotes the number of the applied rules. Rows 6. and 7. are obtained by applying rule 18, which does not produce an equivalent expression. Furthermore, the final expression is certainly simplified, but not so far away from the original one. This means that the loss of precision is minimal. 1. 2. 3. 4. 5. 6. 7. 4.2
(((E1? | E2*)?, (E3)+) | ((E4, E5)?)) ≡ (((E1? | E2*)?, E3+) | ((E4, E5)?)) (14) ≡ (((E1? | E2*)?, E3+) | (E4, E5)?) (16) ≡ ((((E1?)? | (E2*)?), E3+) | (E4, E5)?) (13,12) ≡ (((E1? | E2*), E3+) | (E4, E5)?) (18) ⇒ (((E1 | E2)*, E3+) | (E4, E5)?) (18) ⇒ (((E1 | E2)*, E3+) | (E4, E5))? (14)
Deeper Rewriting
Apart for rule 18, the above discussed rewriting rules do not change the structure of DTDs. This way, the ER schema that derives from the rewritten DTD well represents the source DTD. However, this may cause over-fitting: the ERX schema is too close to a specific DTD, and it is not general enough for integration of documents valid for several DTDs. To overcome this problem, in Table 2 we introduce two new rewriting rules, numbered as 20 and 21. Both these rules substitute sequence expressions with choice expressions; consequently, the resulting DTD is significantly more general. However, this loss of precision allows further simplification of the DTD, and allows to derive a simpler and more general ER Schema, more suitable for integration purposes. Observe that in rules 20 and 21 we used the symbol →; this way, we denote that this rewriting significantly changes the DTD. Examples. Consider the expression number 7., which is the result of the rewriting process discussed in the previous example. The rewriting process can be car-
From XML DTDs to Entity-Relationship Schemas
385
ried on by applying the deeper rewriting rules, obtaining a very simplified and general expression. 8. 9. 10. 11. 12. 13.
5
→ (((E1 | E2) | E3)* | (E4, E5))? (15) ≡ ((E1 | E2 | E3)* | (E4, E5))? (20) → ((E1 | E2 | E3)* | (E4 | E5)+)? (18) ⇒ (((E1 | E2 | E3) | (E4 | E5))*)? (12) ≡ ((E1 | E2 | E3) | (E4 | E5))* (15,1) ≡ (E1 | E2 | E3 | E4 | E5)* (20)
From DTD Concepts to ERX Concepts
We are now ready to discuss how the ER schema is derived from the rewritten DTD. The main derivation rules are the following. 1. Each XML element defined in the DTD corresponds to an entity; attributes defined for each XML element are assigned to the corresponding entity, and are mandatory or optional depending on the fact that they are defined as #REQUIRED or #IMPLIED in the DTD. 2. Generic text elements (#PCDATA) correspond to entities with one single mandatory attribute named Content, whose value represents the contained text. 3. Choice expressions are translated as containment relationships. If the choice expression is a top level expression, the left hand side entity is the entity derived from the XML element whose content is defined by the choice expression. The right hand side is an alternative of all entities derived from XML elements listed in the choice expression; complex items appearing in the choice expression are dealt with by a dummy entity1 ; the process is then recursively repeated. The left hand side cardinality of the generated relationship depends on the iteration operator applied to the choice expression ((0: 1) for ?, (0: N) for *, (1: N) for +, and (1: 1) if missing). 4. Sequence expressions to which no iteration operator is applied, are translated as a series of relationships, one for each item appearing in the sequence. If the sequence expression is a top level expression, the left hand side of these relationships is the entity derived from the XML element whose content is defined by the choice expression. The right hand side is the entity derived from one element in the sequence expression; complex items appearing in the choice expression are dealt with by a dummy entity; the process is then recursively repeated. The left hand side cardinality and the right hand side cardinality are (1: 1). 5. Sequence expressions to which an iteration operator is applied are translated as a relationship, whose left hand side is the same as in point 4, while the right hand side is a dummy entity. The left hand side cardinality of the generated relationship depends on the iteration operator applied to the choice expression ((0: 1) for ?, (0: N) for *, (1: N) for +); the right hand side cardinality is (1: 1
A dummy entity is an entity that does not derive from any XML element defined in the DTD. Its name is generated by the system.
386
G. Psaila ID (R)
Description (R)
Product in ProductList Technical
Product (1:1)
(0:1)
(1:1)
(1:1) Brand (R)
Note
In Product
Price (I) (1:N)
(1:1) (1:1) Content (R)
ProductList
Contain Text (1:1)
Text
Date (R)
Par (0:N)
Text Contains
HyperLink
URL (R)
Text (I)
Fig. 3. Generalized ERX Schema for the case study
1). Then the process is recursively repeated considering the dummy entity and the sequence expression without iteration operator. 6. Finally, after all DTD specifications have been processed, it is necessary to collapse relationships. In particular, for each entity that appears in multiple right hand sides with cardinality (1: 1), these relationships are collapsed into one single relationship, with the same right hand side; the left hand side is the set of all left hand sides of relationships which are collapsed together. For example, the sample DTD described in Section 2 is already minimal w.r.t. basic rewriting rules. Hence, the ERX schema reported in Figure 2 is derived by means of our technique. Notice relationships named Technical in Product and Note in Product: they are derived from the DTD line ; notice cardinalities (0: 1) and (0: N), which derive from operators ? and *, resp.. Also note that entity Text and relationship Contain Text are obtained by applying item 6. in the derivation rules; finally, relationship Text Contains is derived from a mixed content specification (#PCDATA | HyperLynk)*. If we consider the application of deeper rewriting rules, in our sample DTD only the specification for element Product changes; it becomes: . Observe that the sequence is changed into an iterated choice expression. Figure 3 shows the resulting ERX schema: notice that this is more general, in that relationships Technical in Product and Note in Product have been substituted by a single relationship named In Product. 5.1
The Algorithm
We briefly describe the core of the derivation algorithm, which is Procedure DeriveRFel shown in Figure 4. This procedure is called for each element defini-
From XML DTDs to Entity-Relationship Schemas
387
tion in the DTD, and corresponds to items 2 to 5 of the derivation rules. Here, we report some useful notation. With structure we denote a structure definition in the DTD, such as (E1, (E2 | E3))*. This is a sequence structure, which contains a simple element (E1 and a complex item (E2 | E3), which is in turn a choice structure; a structure is denoted as structure. With structure[i] we denote the i-th element in the structure (structure[1]=E1, structure[1]=(E2 | E3)). With structure.card we intend the iteration (also called cardinality) operator applied to the structure (in this case, strcture.card=*). With structure.length we denote the number of elements in the structure (structure.length = 2 in the sample). The procedure makes use of some procedures and functions. Function CreateEntityPar creates an entity for textual paragraphs (see entity PAR in Figure 2). Function CreateDummyEntity creates new entities with no attributes, whose name is system generated. Procedure CreateRelationship creates a new relationship, while procedure CreateContainmentRelationship creates containment relationships; both of them have three parameters, which specify, resp., the name of left hand side entity, the set of alternative entities in the right hand side, the iteration operator to use to establish cardinality constraints on the left hand side ((0: 1) for ?, (0: N) for *, (1: N) for +). Procedure DeriveRel is recursive, since it has to deal with nested regular expressions, as for . Figure 5.a shows the derived ERX schema (for simplicity, we do not worry about attributes); notice the presence of two relationships Rel1 and Rel2, which directly derive from each one of the two elements in the sequence structure (thus, the left hand side cardinality constraints correspond to the iteration operators, i.e.(0: N) for * and (1: N) for +. To represent the nested sequence structure, the algorithm creates a dummy entity D1, which represents the overall nested sequence structure; then, the algorithm recursively derives two relationships Rel3 and Rel4, whose left hand side is the dummy entity D1. If we apply deeper rewriting rules, we obtain the simplified element definition . from which the algorithm derives the ERX schema in Figure 5.b. Observe that this is a very simple schema.
6
Conclusions
In this paper, we considered the problem of deriving Entity Relationship schemas from XML DTDs; we adopt the ERX Data Model, a variation of the classical ER model specifically designed to cope with XML. The problem has been dealt with as follows. At first, a set of rewriting rules for DTDs has been defined; the goal of these rules is to simplify and generalize DTDs. Then a derivation technique to derive the Entity Relationship schema from the rewritten DTD has been developed. We compared the resulting schemas for our case study, obtained by applying only basic rewriting rules and deeper rewriting rules. Schemas obtained by means of deeper rewriting rules are more general, and more suitable for integrating documents valid for different DTDs.
388
G. Psaila Procedure DeriveRel(Entity, structure) begin if structure is a choice structure then list = { }; for i = 1 to structure.length do if structure[i] is a simple element then list = list ∪ structure[i].name; if structure[i] is a #PCDATA element then list = list ∪ CreateEntityPar(); if structure[i] is a sequence or a choice structure then DummyEnt = CreateDummyEntity(); DeriveRel(DummyEnt, structure[i]); list = list ∪ DummyEnt; end if end for CreateContainmentRelationship(DummyEnt, list, structure.card); end if if structure is a sequence structure then if structure.card = NULL then DummyEnt = CreateDummyEntity(); N ewStruct = structure; N ewStruct.card = NULL; DeriveRel(DummyEnt, N ewStruct); CreateRelationship(Entity, { DummyEnt }, structure.card); else for i = 1 to structure.length do if structure[i] is choice structure then DeriveRel(Entity, structure[i]); continue; if structure[i] is a simple element then Ent = structure[i].name; if structure[i] is a sequence structure then Ent = Entity; N ewStruct = structure[i]; N ewStruct.card = NULL, DummyEnt = CreateDummyEntity(); Ent = DummyEnt; DeriveRel(Ent, N ewStruct); end if CreateRelationship(Entity, { Ent }, structure[i].card); end for end if end if end Fig. 4. Procedure DeriveRel
In effect, the ERX Data Model, from which this work moved on, is very suitable to study concepts present in XML documents and their correlations. Although design guidelines convinced us to use this data model as the one provided by the ERX Data Management System, its use is not restricted within it;
From XML DTDs to Entity-Relationship Schemas
389
Rel3 E1
Rel1
(1:1)
E
(1:1)
Rel1
D1 (0:N)
E1
E Rel4
(1:1)
E2
(1:1)
Rel2
(0:N)
E2
(1:1)
E3 (1:N)
E3
(1:1)
a)
b)
Fig. 5. Sample ER schemas
indeed, it is very useful to understand the content of XML documents and to perform integration tasks. Future Work. This is the first work on this topic. We plan to continue the research on two main directions. The first one is to consider the concept of hierarchy, a very useful concept provided by ERX and implicitly provided by DTDs, e.g. by means of the concept of XML entity. The second one is to move to XML Schema, the standard that will replace DTDs in the near future.
References 1. Tamino XML Database. Software AG, http://www.softwareag.com/tamino. 2. C. Batini, S. Ceri, and S. Navathe. Conceptual Database Design: An EntityRelationship Aprroach. Benjamin Cummings, Menlo Park, California, 1992. 3. T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup language (xml). Technical Report PR-xml-971208, World Wide Web Consortium, Dec. 1997. 4. M. Kay. XSLT Programmer’s Reference. Wrox Press, 2000. 5. M. Liu and T. W. Liung. A data model for semistructured data with partial and inconsistent information. In Intl. Conf. on Extending Database Technology, Konstanz, Germany, March 2000. 6. J. McHugh and J. Widom. Query optimization for xml. In Proc. 25th VLDB Conference, Edinburgh, Scotland, September 1999. 7. G. Psaila. Erx-ql: Querying an entity-relationship db to obtain xml documents. In Proceedings of DBPL-01 Intl. Workshop on Database Programming Languages, Monteporzio Catone, Rome, Italy, September 2001. 8. G. Psaila. Erx: an experience in integrating entity-relationship models, relational databases and xml technologies. In Proceedings of XMLDM-02 Intl. Workshop on XML Data Management, Prague, Czech Republic, March 2002. 9. G. Psaila and D. Brugali. The erx data management system. In Proc. of IC-2001, Second Int. Conference on Internet Computing, Las Vegas, USA, June 2001. 10. J. Shammugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying xml documents: Limitations and opportunities. In Proc. 25th VLDB Conference, Edinburgh, Scotland, September 1999.
Extracting Relations from XML Documents Eugene Agichtein1 , C.T. Howard Ho2 , Vanja Josifovski2 , and Joerg Gerhardt2 1
Columbia University, New York, NY, USA [email protected] 2 IBM Almaden, San Jose, CA, USA {ho,vanja}@almaden.ibm.com
Abstract. XML is becoming a prevalent format for data exchange. Many XML documents have complex schemas that are not always known, and can vary widely between information sources and applications. In contrast, database applications rely mainly on the flat relational model. We propose a novel, partially supervised approach for extracting userdefined relations from XML documents with unknown schema. The extracted relations can be directly used by an RDBMS, or utilized for information integration or data mining tasks. Our method attempts to automatically capture the lexical and structural features that indicate the relevant portions of the input document, based on a few user-annotated examples. This information can then be used to extract the relation of interest from documents with schemas potentially different from the training examples. We present preliminary experiments showing that our method could be capable of extracting the target relation from XML documents even in the presence of significant variations in the document schemas.
1
Introduction
XML provides a standardized format for data exchange where relationships between entities are encoded by nesting of the elements. XML documents can have complex nested structure, while many applications prefer a simple and flat representation of the relevant information. Extracting information from XML documents into relations is of special interest, since the resulting relations would allow the use of SQL and the full power of RDBMS query processors. In such a scenario, a mapping is needed to specify the extraction of the required portions of XML documents to relations. Mapping specification is usually performed by an experienced user with knowledge of the content of the input document and resulting relations. If detailed description of the document structure is available in advance, a mapping can be defined once and used over all of the input documents. In the case when the XML documents originate from a number of different sources with variations in their schema, or when the schema evolves over time,
Work done while visiting IBM Almaden.
´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 390–401, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
Extracting Relations from XML Documents
391
the mapping specification process can be long and labor-intensive. The user needs to provide a mapping for each new source, and update the queries as the document structures change. In order to relieve the user of this tedious task, we propose a system for mapping from XML to relations by generalizing from the user-provided examples, and applying the acquired knowledge on unseen documents with flexibility to handle variations in the document input structure and terminology (tag names). Such documents may be derived from HTML pages, or from business objects exported to XML. For example, consider the task of compiling a table of product prices and descriptions from different vendors, where each vendor exports their product catalogs as XML documents. These documents may encode the prices of products in a variety of ways, using different tag names and structures. With current technology, each vendor source would have to be wrapped manually to extract the tuples for the target table. Being able to extract key relations from such XML documents using a few user-specified examples would reduce the system setup time and allow for improved robustness in a case of schema change. Our partially supervised approach is an adaptation of the general nearest neighbor classification strategy [1]. In this approach, the candidate objects are compared with a set of “prototype” objects. The candidates that are the closest to a prototype p are classified into the same class as p. In our setting, the goal of the classifier is to recognize the nodes (if any), in a given XML document that are needed to extract (map) the information in the document to the given target relation. The prototype objects are constructed based on the user-annotated example XML documents, and correspond to the nodes that contain the attributes to be mapped to the target relation. The similarity between the candidate nodes and the prototype nodes is computed using signatures that represent the position, internal structure, and the data values of the document nodes. Preliminary experiments indicate that our method can be used to reliably detect relevant elements in unseen documents with similar, but different structure. The use of signatures as opposed to queries allows more flexibility in the structure captured from the training examples. For example, the terms in the signature can be related in a way that does not match the XQuery axes, or weights can be assigned to individual terms, which would allow specifying increased importance to some of the terms. Such features are not available in today’s XML query and transformation languages. Related Work Several commercial and research databases support mapping XML documents into user-defined relations. These mappings are specified by using XPath expressions and apply only to documents with schemas compatible with the expressions. If the schema is not available, a system such as XTRACT [2] can be used to infer a DTD. The documents in the collection are assumed to have the same structure, and elements can be described independently. Some systems allow building a summary of several XML documents, as for example the
392
E. Agichtein et al.
DataGuides [3] techniques that emerged from the LORE project. A related approach taken by STORED [4] uses data mining techniques to store the information in an RDBMS for efficient querying. In contrast, we assume a given user-defined relation to which we want to map XML documents with variable schema. Several interactive tools have emerged that allow mapping from XML documents to relational schemas, as for example [5]. The mappings produced by these tools are used for both shredding and storing XML documents and for view generation for online querying [6]. The techniques we use in our work draw on methods developed for the extraction of structured information from HTML and plain text, notably [7,8,9,10,11,12,13]. The rest of the paper proceeds as follows: In Section 2 we present an overview of our system and describe our data model. In Section 3 we describe our method for generating and using signatures for extracting a relation from new XML documents (Section 4). We then present preliminary experimental results in Section 5 and conclude the paper in Section 6 with a description of our current activities and future work.
2
System Overview and Data Model
We use a partially supervised approach to extract a user-specified relation from XML documents based on a few user-tagged examples. The system works in two phases, Training and Extraction, shown in Figure 1. In the Training phase, the system is trained on a set of XML documents where the user has indicated the correct mapping of the XML elements to the attributes of the target table. The result of the training stage is a set of signatures that are used in the subsequent Extraction stage. During the Extraction phase, the target table is extracted from new XML documents that may have different tag names or structure than the example documents. As the first step of the extraction stage, the nodes of the input documents are merged in order to generate a “canonical” representation of the input document. Then, the signatures generated during training are used to find the candidate nodes in the canonical representation of each input document that are most likely to contain attributes for the target relation tuples. Finally, the mapping from the descendants of this node to the target attributes is derived. The resulting mapping can be translated trivially into XPath expressions to extract the tuples from the input document, or from any document with the same structure. 2.1
Data Model
Our system extracts a single target relation, T (a1 , a2 , ..., an ), from a collection of XML documents. Representing the input XML document as a tree of nodes, each tuple t ∈ T is extracted from a document subtree rooted at a node called instance node. More formally, we define an instance node I as a document element such that:
Extracting Relations from XML Documents
393
Merge AS
IS /Split a1
S
a2
a1
User-tagged examples IS
AS1 AS2
a2
Merged Tree Merge
Match
Map
/Split
Nodes
Attributes
S
AS AS1 AS2
Input document
Merged Representation
Best Instance Node
a1
a2
Best Mapping
Instance and attribute signatures
Training
Extraction
Fig. 1. Overview of our system: In the Training stage the system derives instance and attribute signatures used for extracting the target relation from a new XML document.
1. Children of I contain complete information needed to extract exactly one tuple of the target table. 2. I is maximal, i.e., any ancestor node of I will contain complete information for more than one tuple in T . Figure 2 illustrates the role of the instance node in a document representing a set of books, such as one that may be exported by a book vendor. The target relation is defined as NewBooks(ISBN, BookTitle, Author, Price). The node Item in the Books category, shown, contains all the information in the attributes of its descendants that is needed to extract a tuple for the target relation. Therefore all the Item elements shown in this example are instance nodes. The extraction of the relation from a new document d consists of first identifying a node in d that corresponds to I, and then mapping descendants of I to the attributes of the target relation. We now present our approach for automatically generating flexible signatures that can be used to recognize instance nodes in new XML documents with variations in label names and structure.
3
System Training: Generating Instance and Attribute Signatures
Our approach of deriving instance and attribute signatures to extract a target relation uses as input as set of user-supplied example XML documents. First, we pre-process the input documents to derive a merged document representation (Section 3.1). As we will discuss, the merged representation will allow us to describe the example documents more completely. We then generate instance signatures (Section 3.2) that capture the position and internal structure of the instance nodes. Then we describe attribute signatures (Section 3.3) that cap-
394
E. Agichtein et al.
Ancestors
Products Books Instance Item
Item
Title
Item
Category_Desc
Author
Siblings
Publisher
Descendants New
ISBN
ISBN
BookTitle
Price
Used
Num_Copies
Author
Price
Fig. 2. Example of Extracting a Table from XML documents.
ture the structural and data characteristics of the nodes that correspond to the attributes of the tuples in the target relation. The training begins with a set of annotated example XML documents, with special tags specifying the instance nodes and the attribute mappings.1 For each example we solicit two types of input from the user: 1. Identify the instance node I (e.g., the Item node) 2. Identify the descendants of I that contain the values for the attributes in the target relation. In all machine learning approaches, one of the major problems is data sparsity, where the manually annotated examples do not fully represent the underlying distribution. Some of the nodes in the initial examples may be optional (and therefore missing), and the data values may not be repeated enough across the remaining attributes to generate a reliable signature. Therefore, we propose merging the nodes in the input documents to create the “canonical” representation of the document tree as we describe next. 3.1
Merging Nodes in the Input Document
A relational table usually represents a set of similar entities, such that each entity corresponds to one tuple. We can therefore expect that an XML document mapped to a table will also contain a set of nodes for a set of related entities. 1
In our prototype we use reserved XML element names to specify the mapping. This allows for use of XML parsing and processing over the training documents. The annotated documents can be produced from the original documents by a user using a GUI tool.
Extracting Relations from XML Documents Merge
Products (Root) Item
Book Author Title
Item
Book Author Year
Item
Item
CD Artist Length
CD Name Artist
Products (Root) Item1*
Item2*
Item3 Year
Products (Root) Item*
Book Author Title Year CD Artist Length Name
Split Node\Tag Item1 Item2
Book Author Title
395
CD
Artist
Length Name
Item4
Book
Author
Title
Year
CD
Artist
Length
Name
1 1 0 0
1 1 0 0
1 0 0 0
0 1 0 0
0 0 1 1
0 0 1 1
0 0 1 0
0 0 0 1
Fig. 3. Operation of the Merge algorithm for merging similar nodes in the input document.
Intuitively, the nodes representing the same class of entities will have similar structure and relative position in the document. Often such XML documents will be produced by a single source and will have some regularity in their structure. We can exploit this regularity within a single document by merging “similar” nodes. As a result, we will have more rich signatures and reduce the complexity of the subsequent extraction phase. More importantly, merging nodes in the input document will allow us to reduce noise (e.g., missing optional nodes), resulting in more complete signatures (Sections 3.2 and 3.3) that will later be used for extraction. The Merge Algorithm: Our procedure for merging nodes is shown graphically in Figure 3. Intuitively, sibling nodes with the same tag name, and with similar internal structure can be assumed to represent similar objects. Using this observation, we merge sibling nodes that share the same prefix path from the root, and have similar internal structure. The user-annotated instance nodes are merged just as any other nodes, resulting in the more complete examples of the instance nodes for signature generation. Our algorithm proceeds in two stages: First we Merge all nodes that share the same prefix path from the root, and then we Split the nodes in the resulting tree that are too heterogeneous.2 Merge. We traverse the input tree in a top-down fashion, recursively merging siblings with the same label into one supernode. In the example, all siblings Item that have the same label are merged into Item*. The children of Item* are the union of children of each original Item node. Currently we only merge nodes at the same level. In the future, we may want to merge nodes that have the same 2
In practice, it would be more efficient to avoid merging nodes with completely heterogeneous internal structures. For clarity and generality, we present a two-step implementation.
396
E. Agichtein et al.
label, but occur in slightly different depths in the input tree. It is not clear if this is a desired behavior and most likely depends on the application. Split. In some XML document, sibling nodes with the same tag might have completely different structure. The goal of the split phase is to correct the merged nodes generated in the previous phase. This process allows us to distinguish between nodes that are semantically equivalent but happen to have missing information, and nodes having the same label, but which are actually heterogenous. The main criterion for splitting is whether there are disjoint subsets among the set of children of the merged node. In the example above, the merged node Item* contains 2 disjoint subsets ((Book, Title, Author, Year) and ((CD, Artist, Length, Name)). Thus, Item* would be split into 2 nodes, Item1 ∗ and Item2 ∗. The split procedure splits the nodes in the merged tree in a top-down fashion. At each node, the set of the children is examined. If the set contains at least two disjoint sets of children, the current node is split, and children are allocated accordingly. Finding the disjoint sets of children can be done efficiently by using the matrix shown in Figure 3. In this matrix, a “1” in the position i, j indicates that a node i contains a child with the label j. Using the matrix we can quickly find the connected (and disjoint) entries. This approach can be extended to splitting nodes that are weakly connected, and not completely disjoint. As we discussed, the purpose of merging is to create a more complete representation of the input document. We now describe how we use this representation to generate instance and attribute signatures that we will use subsequently for extracting the target relation from new, previously unseen documents. 3.2
Instance Signatures
Recall that our goal is to generate signatures that will allow us to find instance nodes in new documents with both structure and label names potentially different from the example documents. To support such flexibility, we need to capture both the position in the document and internal structure of the instance node. Further, the representation of the signature should be such as to allow finding the instance node in documents with structure and tag names different from the example documents observed in the Training stage. To accomplish this, we divide the document tree into four regions: 1. 2. 3. 4.
A: Ancestors of I (some number of levels up the tree). S: Siblings of I. C: Descendants of I. I: Self: Tag of instance node I itself.
The Siblings and the Ancestor nodes intuitively describe the position of the instance node in the document. The Descendants component allows us to describe the internal structure of the instance node. From these tree regions, we build the instance signature S of each example. We represent S as a set of vectors S = {A, S, C, I} where each vector
Extracting Relations from XML Documents
397
represents the respective tree region. More specifically, we represent each tree region using the tag names of the nodes in the region just like the vector-space model of information retrieval represents documents and queries [14]. Recall, that in this representation each unique tag name corresponds to a dimension in the vector space, and therefore the order of the tag names in the input document can be ignored. For example, the vector A generated as part of signature to represent the Ancestors region for the Item node in Figure 2, would contain terms Products, Books. In future work, we plan to investigate different weighting schemes for the terms in the vector. We could also use other reported techniques for representing XML structures in a high dimensional vector space, e.g., [15], but it is not clear which representation would work best for our application. Therefore, for our initial experiments we chose the minimal representation described above.
3.3
Attribute Signatures
So far we have discussed the characterization of the position and internal structure of the instance node. These signatures would allow us to find instance nodes despite variations in the structure of the document. Similarly, we want to support variations in the internal structure of the descendants of the instance node that map to the attribute values in the target relation. To capture the characteristics of the attributes of the target relation as they appear in the example documents, we build an attribute signature AS({D}, S{A, S, C, I}), for each attribute of the target relation, which consists of two components: – 1: Data signature D for the column over all known instances of the attribute to represent the distribution of values expected to be found in each attribute. (We can use a technique similar to the one described in [5].) – 2: Structure signature S(A, S, C, I), defined equivalently to the instance signature S, where the current instance node is used as the document root, and I refers to the set of tags of all elements in the example documents that map to this attribute. We will use these signatures to map descendants of instance nodes found in test documents to attributes in the table.
3.4
Signature Similarity
We now define the similarity of signatures that we will use to extract the target relation from new documents (Section 4). Intuitively, signatures of nodes that are located in similar positions in the document tree and have similar internal structures should have a high similarity value. For this, the similarity measure should consider all components of the signature.
398
E. Agichtein et al.
More formally, we define the Similarity between signatures Sigi (A, S, C, I) and Sigj (A, S, C, I) as: Similarity(Sigi , Sigj ) = wA · Sim(Ai , Aj ) + wS · Sim(Si , Sj ) +wC · Sim(Ci , Cj ) + wI · Sim(Ii , Ij )
(1)
a·b , or cosine of the angle between vectors a and where Sim(a, b) is defined as |a|·|b| b, which is a common way to quantify similarity in information retrieval. The Similarity function combines the Sim values between the positional and structural components of the signatures. Currently, all the components of the signature are weighted equally. However, depending on the application needs, the relative importance of different tree regions (as reflected by the weights of their respective vector components, e.g., wA ), may be tuned either by the user, or by using machine learning techniques. We define similarity between attribute signatures equivalently to the way we define similarity between instance signatures (Equation 1). The only difference is that we also add the similarity of the respective data components (vector D in the attribute signature definition). Relative importance of the structural and data components of AS has been studied previously in the context of relational schema mapping in [5].
4
Extraction
Having derived the sets of instance signatures (IS) and attribute signatures (AS), we proceed to extract the target relation from the new, previously unseen XML documents. The extraction proceeds in three stages. First, similar nodes of the input document are merged using the Merge algorithm (Section 3). Then, the instances nodes in the merged document representation are identified. Finally, the descendants of the discovered instance node are mapped to the attributes of the target relation. Identifying Instance Nodes. To discover the most likely instance node, we traverse the merged document tree in a bottom-up fashion. For each node X we generate the instance signature SX . We then compute the similarity of SX and each instance signature in IS that was generated during training. The score of X is then computed as the maximum of these similarities. The node with the highest score is then chosen as the candidate instance node. Mapping Attributes. For each target column Ti , we compute the similarity between the attribute signature ASi and the value of each descendent of the candidate instance node. Since merged instance nodes are expected to have small number of descendants, and the target table a relatively small number of attributes, this exhaustive approach is feasible. The mapping that maximizes the total similarity, computed as the product of similarities over all the target attributes, is chosen as the best mapping.
Extracting Relations from XML Documents
(a)
399
(b)
Fig. 4. The merged representation of the training document (a), and a test document (b) document with scores for each potential instance node. The node Book, the target instance node, is assigned the highest score by the system.
We can use the results of this step as feedback to the previous step of identifying instance nodes. For example, if we cannot find a satisfactory attribute mapping from the best candidate instance node, we then try the candidate instance node with the next highest score.
5
Preliminary Experiments
For our exploratory experiments we have considered a scenario where the target relation is NewBooks, described in Section 2. We want to extract this relation from XML documents such as may be exported by book publishers and vendors. We used the same tagged example document as shown in Figure 2 for training our system. The instance node in the example is the Item node, as displayed by our system prototype in Figure 4(a). Our system uses this example to generate the instance and attribute signatures to be used for extraction. The original XML document structure and the tag names were modified significantly to create test documents. A sample modified document is shown in the merged representation (Figure 4(b)). The instance node that contains all the information needed to extract a tuple for NewBooks now has the tag name Book, and the Products node now has a new tag name Publications. Additionally, internal structure of the instance node was changed. Such variations in structure would break standard XPath expressions that depend on the element tags in the document to find the instance node. However, our system prototype consistently assigned the highest score to the correct instance nodes in all tested variations, including the test structure shown in Figure 4(b). These exploratory results are encouraging and we are currently working on a more extensive empirical evaluation of our approach.
400
6
E. Agichtein et al.
Conclusions and Future Work
We have presented a novel approach for partially supervised extraction of relations from XML documents without consistent structures (schemas) and terminologies (tag names). These XML documents may be derived from HTML pages, or obtained from exporting business objects to XML format. Extracting relations from schema-less XML documents using the approach presented in this paper can speed deployment of web-based systems and make their maintenance easier in presence of evolving schemas. We introduced the concept of the instance node, which is crucial in identifying the target node (object) that contains information for the attributes of the target relation. Second, we partitioned the neighboring nodes of the instance node into three different regions (siblings, ancestors, and descendants) and derived their respective signatures. We then defined a classification model based on spatial proximity of these tree regions to the instance node, each region having different semantic associations with the instance node. Third, the relative influence of these regions in finding the instance node in new documents can be simply adjusted by “turning a knob”, i.e., by changing the weights of the corresponding components in the similarity calculation. Finally, the Merge algorithm described in Section 4.1 enables our system to capture the notion of semantically equivalent XML nodes, which have stronger semantics than simply having the same tag names. We are currently exploring extending our model to allow the user to identify hint nodes - those nodes that do not contain information for attributes in the target relation, yet may indicate presence of the instance node. We also plan to experiment with different signature representations and weighting schemes, and alternative similarity definitions.
References 1. Gates, G.W.: The reduced nearest neighbor rule. In: IEEE Transactions on Information Theory. (1972) 2. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: Proceedings of ACM SIGMOD Conference on Management of Data. (2000) 165–176 3. Goldman, R., Widom, J.: Dataguides: Enabling query formulation and optimization in semistructured databases. In: Twenty-Third International Conference on Very Large Data Bases. (1997) 436–445 4. Deutsch, A., Fern´ andez, M., Suciu, D.: Storing semi-structured data using STORED. In: SIGMOD. (1999) 5. Miller, R.J., Hernandez, M.A., Haas, L.M., Yan, L.L., Ho, C.T.H., Fagin, R., Popa, L.: The Clio project: Managing heterogeneity. SIGMOD Record 30 (2001) 78–83 6. Josifovski, V., Schwarz, P.: XML Wrapper - reuse of relational optimizer for querying XML data. Submitted for publication. (2002) 7. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin 23 (2000) 33–41
Extracting Relations from XML Documents
401
8. Grishman, R.: Information extraction: Techniques and challenges. In: Information Extraction (International Summer School SCIE-97), Springer-Verlag (1997) 9. Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries. (June 2000) 10. Arnaud Sahuguet, F.A.: Building light-weight wrappers for legacy web datasources using w4f. In: Proceedings of the International Conference on Very Large Databases (VLDB). (1999) 738–741 11. Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of ICDE. (2000) 611–621 12. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: VLDB. (2001) 13. Crescenzi, V., Mecca, G., Merialdo, P.: Towards automatic data extraction from large web sites. In: VLDB. (2001) 14. Salton, G.: Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley (1989) 15. Kha, D.D., Yoshikawa, M., Uemura, S.: An XML indexing structure with relative region coordinate. In: ICDE. (2001) 313–320
Extending XML Schema with Nonmonotonic Inheritance Guoren Wang and Mengchi Liu School of Computer Science, Carleton University, Canada {wanggr,mengchi}@scs.carleton.ca
Abstract. Nonmonotonic inheritance is a fundamental feature of object-oriented data models. In this paper, we extend XML Schema with nonmonotonic inheritance due to its powerful modeling ability to support multiple inheritance, overriding of elements or attributes inherited from super-elements, blocking of the inheritance of elements or attributes from super-elements, and conflict handling. Another key feature of objectoriented data models is polymorphism. We introduce it into XML to support polymorphic elements and polymorphic references.
1
Introduction
Several XML schema languages have been proposed, such as DTD [2], SOX [3], XML Schema [5] to constrain and define a class of XML documents. However, they do not support inheritance at all except for XML Schema and SOX [6]. Nonmonotonic multiple inheritance is a fundamental feature of objectoriented data models [4,7]. In object-oriented languages with multiple inheritance, a class may inherit attributes and methods from more than one superclass. For example, class TA might inherit attributes and methods directly from classes teacher and student. In a multiple inheritance hierarchy, users can explicitly override the inherited attributes or methods and block the inheritance of attributes or methods from superclasses [7]. One of the problems with multiple inheritance is that ambiguity may arise when the same attribute or method is defined in more than one superclass. Therefore, conflict resolution is important in object-oriented database systems with multiple inheritance and most systems use the superclass ordering to solve the conflicts [4,7]. In this paper, we extend XML Schema with nonmonotonic inheritance due to its powerful modeling ability to support multiple inheritance, overriding of elements or attributes inherited from super-elements, blocking of the inheritance of elements or attributes from super-elements, and conflict handling. Another key feature of object-oriented data models is polymorphism. We introduce it into XML to support polymorphic elements and polymorphic references.
Guoren Wang’s research is partially supported by NSFC of China(60273079) and Mengchi Liu’s research is partially supported by NSERC of Canada.
´ Pastor (Eds.): ER 2003 Workshops, LNCS 2814, pp. 402–407, 2003. M.A. Jeusfeld and O. c Springer-Verlag Berlin Heidelberg 2003
Extending XML Schema with Nonmonotonic Inheritance (01) (02) (03) (04) (05) (06) (07) (08) (09) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26)
403
<xsd:element name=“univ” type=“univType”/> <xsd:complexType name=“univType”> <xsd:sequence> <xsd:element name=“person” type=“personType” minOccurs=“0” maxOccurs=“’unbounded’/ > <xsd:element name=“course” type=“courseType” minOccurs=“0” maxOccurs=“’unbounded’/ > <xsd:complexType name=“personType”> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“birthdate” type=“xsd:date”/> <xsd:element name=“addr” type=“addrType”/> <xsd:element name=“homephone” type=“xsd:string”/> <xsd:attribute name=“pid” type=“xsd:ID” use=“required”/> <xsd:complexType name=“addrType”> <xsd:sequence> <xsd:element name=“street” type=“xsd:string”/> <xsd:element name=“city” type=“xsd:string”/> <xsd:element name=“state” type=“xsd:string”/> <xsd:element name=“zip” type=“xsd:string”/>
Fig. 1. Type definitions for elements univ, person and addr
2
Extensions to XML Schema
Figure 1 shows the type definitions for elements univ, person and addr. Although they have the same syntax as the original XML Schema, some of them (for example lines (04)-(07) of Figure 1) have different semantic constraints on XML instance documents due to the introduction of the polymorphic feature. Figure 2 shows the type definition for element student that inherits personType. Because the inheritance mechanism provided by XML Schema is not flexible and powerful, we extend it as follows: (1) In a type hierarchy, a subtype may have more than one supertype to support nonmonotic multiple inheritance. Therefore, the attribute base of the extension machanism is modified as bases, e.g. in line (03) of Figure 2. (2) In the original XML Schema, a subtype inherits all elements but not attributes from its supertype. Although attributes are different from elements, they are a special kind of information from users’ point of view. Therefore, in the Extended XML Schema, a subtype inherits not only elements but also attributes from its supertypes. Note that no other ID attributes are allowed to be declared in the subtype since pid is an ID attribute inherited by the subtype. (3) In the Extended XML Schema, a specific component element or attribute in the subtype may override the element or attribute defined in the supertype. For example, the component element addr in student inherited from personType is overridden with a new simple type, as shows in line (05) of Figure 2. Note that there is no special syntax extension for overriding of element and attribute. Sometimes, it is necessary to allow a subtype to block the inheritance of attributes and elements from its supertypes. For example, teachers usually prefer
404
G. Wang and M. Liu (01) <xsd:complexType name=“studentType”> (02) <xsd:complexContent> (03) <xsd:extension bases=“personType”> (04) <xsd:sequence> (05) <xsd:element name=“addr” type=“xsd:string”/> (06) <xsd:element name=“dept” type=“xsd:string”/> (07) <xsd:element name=“takes”> (08) <xsd:attribute name=“courses” type=“IDREFS” (09) target=“courseType” use=“implied”/> (10) (11) (12) <xsd:attribute name=“sno” type=“xsd:string” use=“required”/> (13) (14) (15)
Fig. 2. Type definition for element Student (01) <xsd:complexType name=“teacherType”> (02) <xsd:complexContent> (03) <xsd:extension bases=“personType”> (04) <xsd:sequence> (05) <xsd:element name=“workphone” type=“xsd:integer”/> (06) <xsd:element name=“salary” type=“xsd:float”/> (07) <xsd:element name=“dept” type=“xsd:string”/> (08) <xsd:element name=“teaches”> (09) <xsd:attribute name=“courses” type=“IDREFS” (10) target=“courseType” use=“implied”/> (11) (12) (13) <xsd:block from=“personType”> (14) <xsd:element name=“homephone”/> (15) (16) <xsd:attribute name=“tno” type=“xsd:string” use=“required”/> (17) (18) (19)
Fig. 3. Type definition for element Teacher
to use workphone rather than homephone as their contact phone. It is reasonable that in the definition of the subtype teacherType the inheritance of homepone is blocked from its supertype personType. Therefore, the blocking mechanism is introduced as shown in lines (13)-(15) of Figure 3. The blocking mechanism has an attribute from specifying from which type the inheritance is blocked and some components specifying attributes and elements to be blocked. Another extension to XML Schema is typing of IDREF and IDREFS. The literature [1] pointed out that neither XML Schema nor DTD support typing of IDREF and IDREFS. In this case, a reference may point to any kind of element instance. One cannot require a reference to point to only an expected kind of element instances. For example, it is possible that attribute @courses of the element takes in student references a person rather than a course. So, we extend attribute declaration for specifying the type of IDREF or IDREFS, for example, in lines (08)-(09) of Figure 2 and in lines (09)-(10) of Figure 3. In Figure 4, type TAType inherits elements and attributes from both supertypes studentType and teacherType. There are two conflicts to be resolved, since elements addr and dept are declared on both supertypes studentType and
Extending XML Schema with Nonmonotonic Inheritance
405
(01) <xsd:complexType name=“TAType”> (02) <xsd:complexContent> (03) <xsd:extension bases=“studentType teacherType”> (04) <xsd:rename> (05) <xsd:element name=“dept” from=“studentType” as=“student-dept”/> (06) <xsd:element name=“dept” from=“teacherType” as=“teacher-dept”/> (07) (08) <xsd:block from=“studentType”> (09) <xsd:element name=“addr”/> (10) <xsd:block> (11) (12) (13)
Fig. 4. Type definition for element TA (01) (02) (03) (04) (05) (06) (07) (08) (09) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28)
<xsd:complexType name=“courseType”> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“desc” type=“xsd:string”/> <xsd:element name=“takenBy”> <xsd:attribute name=“students” type=”xsd:IDREFS” target=“studentType” use=“implied”/> <xsd:attribute name=“teachers” type=”xsd:IDREFS” target=“teacherType” use=“implied”/> <xsd:attribute name=“cid” type=“xsd:ID” use=“required”/> <xsd:complexType name=“underCourseType”> <xsd:complexContent> <xsd:extension bases=“courseType”/> <xsd:complexType name=“gradCourseType”> <xsd:complexContent> <xsd:extension bases=“courseType”/> <xsd:element name=“student” type=“studentType”> <xsd:element name=“teacher” type=“teacherType”> <xsd:element name=“TA” type=“TAType”> <xsd:element name=“underCourse” type=“underCourseType”> <xsd:element name=“gradCourse” type=“gradCourseType”>
Fig. 5. Type definitions for elements course, underCourse and gradCourse
teacherType. In our Extended XML Schema, three ways can be used to handle conflicts. In the first way, a conflict resolution declaration is specified explicitly to indicate from which supertype an element or attribute is inherited, for example, the block construct in lines (08)-(10) of Figure 4 indicates that the declaration of addr is inherited from the supertype teacherType rather than from studentType. In the second way, the names of elements or attributes causing conflicts are explicitly renamed in the inheriting element declaration, for example, in the subtype TAType declaration, the rename construct in line (05) of Figure 4 renames element dept inherited from supertype studentType to student-dept while the rename construct in line (06) of Figure 4 from teacherType to teacherdept. Finally, if there is a conflict and there is no conflict resolution declaration, then the element or attribute is inherited from the supertype in the order the supertypes are listed in the extension construct of the type definition.
406
G. Wang and M. Liu <– person instance –> Jaonne Barbosa 1965-04-07 <street> 310 University Ottawa <state> Ontario K1S 5B6 5073322 <– student instance –> <student pid=“200”> Jones Gillmann 1976-02-25 708D Somerset St 6185708 <dept> Computer Science < takes courses=“CS200 CS300” /> <– teacher instance –> Alley Srivastava 1957-06-26 <street> 56 Broson Ottawa <state> Ontario K2B 6M8 <workphone> 2314343 <salary> 1200.00 <salary> <dept> Computer Science <– TA instance –> Alice Bumbulis
1976-08-29 <street> 440 Albert Ottawa <state> Ontario K1R 6P6 2915318 <workphone> 2502600 <student-dept> CS SE <– course instance –> Introduction to CS <desc> Continuing Education <– underCourse instances –> Introduction to DBS <desc> Basic concepts Introduction to SE <desc> Basic concepts <– gradCourse instance –> DBMS <desc> Impl. Techniques
Fig. 6. An XML instance document
Figure 5 shows the type definitions for elements course(lines (01)-(13)), underCourse(lines (14)-(18)) and gradCourse(lines (19)-(23)), and other element declarations(lines (24)-(28)).
3
Extensions to XML Instance Document
Consider the examples described before, type personType has three direct or indirect subtypes, studentType, teacherType and TAType, and type courseType two direct subtypes, underCourseType and gradCourseType. When polymorphism is introduced into XML, an element instance of personType in a valid instance document can be substituted with an instance of elements of its subtypes, and the instance document should still be valid. If the type of an element has at least one subtype, then the element is polymorphic. For example, element person is polymorphic since type personType has three direct or indirect subtypes. the person element instance can be substituted by instances of student, teacher,
Extending XML Schema with Nonmonotonic Inheritance
407
or TA since their types all are subtypes of personType. Similarly, the instance of course can be substituted by instance of underCourse and gradCourse. The substituting element instances are referred to as polymorphic instances. From lines (04)-(07) of Figure 1, we can see that element univ can contain a number of person and course element instances; that is univ→person∗,course∗. Therefore, element univ can contain seven component element instances due to polymorphism: person, student, teacher, TA, course, underCourse and gradCourse instances. Now we extend XML Schema with polymorphic reference, which is similar to polymorphic element. A little bit complicated example for polymorphic reference is that a teacher may teach several courses including underCourses and gradCourses as well, see the definition of element teacher in Figure 3 and its instance in Figure 6. In the definition, teaches is an IDREFS to course. If polymorphic references are supported by the system(that is, teaches can also be used to reference to either underCourse or gradCourse elements as their types all are subtypes of the type of element course), the following six combinations are valid in the instance document: (1) a teacher teaches courses; (2) a teacher teaches underCourses; (3) a teacher teaches gradCourses; (4) a TA teaches courses; (5) a TA teaches underCourses; and (6) a TA teaches gradCourses. Polymorphic reference is introduced to meet the above requirements. An IDREF or IDREFS attribute of a given element can point to instance(s) of the substituting elements of the element. It is referred to as polymorphic references.
4
Conclusions
In this paper, we extend XML Schema to support the key object-oriented features such as nonmonotonic inheritance, overriding, blocking, conflict handling. Moreover, we extend XML instance document with polymorphism, including typing of references, polymorphic elements and polymorphic references.
References 1. Lewis, P.M., Bernstein, A., Kifer, M.: Databases and transaction processing: an application-oriented approach. Addison Wesley (2002) 2. Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E.: Extensible Markup Language (XML) 1.0. 2nd Edn. Available at http://www.w3.org/TR/REC-xml (2000) 3. Davidson, A., Fuchs, M., Hedin, M.: Schema for Object-Oriented XML 2.0. W3C Note. Available at http://www.w3.org/TR/NOTE-SOX (1999) 4. Dobbie, G., Topor, R.W.: Resolving Ambiguities caused by Multiple Inheritance. In: Proceedings of the 4th DOOD International Conf. Singapore (1995) 265–280 5. Fallside, D.C.: XML Schema Part 0: Primer. W3C Recommendation. Available at http://www.w3.org/TR/xmlschema-0/ (2001) 6. Lee, D., Chu, W.W.: Comparative analysis of six XML schema languages. ACM SIGMOD Record. 29 (2002) 117–151 7. Liu, M., Dobbie, G., Ling, T.W.: A logical foundation for deductive object-oriented databases. ACM Transaction on Database Systems. 27 (2002) 117–151
Author Index
Adam, Emmanuel 168 Agichtein, Eugene 390 Al-Muhammed, Muhammed Albert, Manoli 40 Atay, Mustafa 250, 366 Augusto, Juan C. 17
Iacob, Ionut E. 244
Badia, Antonio 330 Bergholtz, Maria 180 Bhowmick, Sourav S 355 Bohrer, Kathy 323 Bresciani, Paolo 217 Carroll, John M. 241 Chakravarthy, Sharma 273 Chalmeta, Ricardo 65 Chirkova, Rada 297 Dedene, Guido 105 Dekhtyar, Alex 311 Dindeleux, R´egis 5 Donzelli, Paolo 217 Embley, David W. 244 Erwig, Martin 342 Estrella, Florida 5 Ferreira, Carla 17 Fettke, Peter 80 Fons, Joan 40 Fotouhi, Farshad 250, 366 Garzotto, Franca 92 Gaspard, S´ebastien 5 Genero, Marcela 79, 118 Gerhardt, Joerg 390 Giorgini, Paolo 167 Grangel, Reyes 65 Gravell, Andy M. 17 Guo, Zhimao 261 Hawryszkiewycz, Igor T. 195 Henderson-Sellers, Brian 167, 195 Heuvel, Willem-Jan van den 3 Ho, C.T. Howard 390
311
Jaakkola, Hannu 129 Jacob, Jyoti 273 Jayaweera, Prasad 180 Jin, Min 285 Johannesson, Paul 180 Josifovski, Vanja 390 Kim, Kibum 241 Ko, Su-Jeong 29 Leuschel, Michael A. 17 Li, Ming 261 Lin, Aizhong 195 Liu, Mengchi 402 Liu, Xuan 323 Loos, Peter 80 Lu, Shiyong 250, 366 Madria, Sanjay 249, 355 Mandiau, Ren´e 168 Mayr, Heinrich C. 3 McClatchey, Richard 5 McLaughlin, Sean 323 Michiels, Cindy 105 Miranda, David 118 Nelson, Jim 79 Ng, Karen M.Y. 17 Orri¨ens, Bart 52 ´ Ortiz, Angel 65 Papazoglou, Mike P. 52 ´ Pastor, Oscar 40 Pelechano, Vicente 40 Perrone, Vito 92 Piattini, Mario 79, 118 Poels, Geert 79, 152 Poler, Ra´ ul 65 Prat, Nicolas 136 Psaila, Giuseppe 378 Rosson, Mary Beth
241
410
Author Index
Sachde, Alpa 273 Schonberg, Edith 323 Shah, Ashish 297 Shin, Byung-Joo 285 Si-Said Cherfi, Samira Singh, Moninder 323 Snoeck, Monique 105 Sun, Yezhou 250, 366
Wagner, Gerd 205 Wang, Guoren 402 Weiss, Michael 229 Wohed, Petia 180 136
Thalheim, Bernhard 129 Tian, Khoo Boon 355 Tulba, Florin 205
Xu, Zhengchuan Yang, Jian
261
52
Zhou, Aoying 261 Zhou, Shuigeng 261