Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5530
Stefano Spaccapietra Esteban Zimányi Il-Yeol Song (Eds.)
Journal on Data Semantics XIII
13
Volume Editors Stefano Spaccapietra École Polytechnique Fédérale de Lausanne EPFL-IC Database Laboratory 1015 Lausanne, Switzerland E-mail:
[email protected] Esteban Zimányi Université Libre de Bruxelles Department of Computer and Decision Engineering 50 av. F.D. Roosevelt, 1050 Bruxelles, Belgium E-mail:
[email protected] Il-Yeol Song Drexel University College of Information Science and Technology Philadelphia, PA 19104, USA E-mail:
[email protected]
CR Subject Classification (1998): H.3, H.4, H.2, C.2, D.3, F.3, D.2
ISSN ISSN ISBN-10 ISBN-13
0302-9743 (Lecture Notes in Computer Science) 1861-2032 (Journal on Data Semantics) 3-642-03097-1 Springer Berlin Heidelberg New York 978-3-642-03097-0 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12660797 06/3180 543210
The LNCS Journal on Semantics of Data
Computerized information handling has changed its focus from centralized data management systems to decentralized data exchange facilities. Modern distribution channels, such as high-speed Internet networks and wireless communication infrastructure, provide reliable technical support for data distribution and data access, materializing the new, popular idea that data may be available to anybody, anywhere, anytime. However, providing huge amounts of data on request often turns into a counterproductive service, making the data useless because of poor relevance or an inappropriate level of detail. Semantic knowledge is the essential missing piece that allows the delivery of information that matches user requirements. Semantic agreement, in particular, is essential to meaningful data exchange. Semantic issues have long been open issues in data and knowledge management. However, the boom in semantically poor technologies, such as the Web and XML, has boosted renewed interest in semantics. Conferences on the Semantic Web, for instance, attract big crowds of participants, while ontologies on their own have become a hot and popular topic in the database and artificial intelligence communities. Springer’s LNCS Journal on Data Semantics aims at providing a highly visible dissemination channel for remarkable work that in one way or another addresses research and development on issues related to the semantics of data. The target domain ranges from theories supporting the formal definition of semantic content to innovative domain-specific application of semantic knowledge. This publication channel should be of the highest interest to researchers and advanced practitioners working on the Semantic Web, interoperability, mobile information services, data warehousing, knowledge representation and reasoning, conceptual database modeling, ontologies, and artificial intelligence. Topics of relevance to this journal include: – – – – – – – – – – – – –
Semantic interoperability, semantic mediators Ontologies Ontology, schema and data integration, reconciliation and alignment Multiple representations, alternative representations Knowledge representation and reasoning Conceptualization and representation Multimodel and multiparadigm approaches Mappings, transformations, reverse engineering Metadata Conceptual data modeling Integrity description and handling Evolution and change Web semantics and semi-structured data
VI
– – – – – – –
The LNCS Journal on Semantics of Data
Semantic caching Data warehousing and semantic data mining Spatial, temporal, multimedia, and multimodal semantics Semantics in data visualization Semantic services for mobile users Supporting tools Applications of semantic-driven approaches
These topics are to be understood as specifically related to semantic issues. Contributions submitted to the journal and dealing with semantics of data will be considered even if they are not from the topics in the list. While the physical appearance of the journal issues resembles the books from the well-known Springer LNCS series, the mode of operation is that of a journal. Contributions can be freely submitted by authors and are reviewed by the Editorial Board. Contributions may also be invited, and nevertheless carefully reviewed, as in the case for issues that contain extended versions of the best papers from major conferences addressing data semantics issues. Special issues, focusing on a specific topic, are coordinated by guest editors once the proposal for a special issue is accepted by the Editorial Board. Finally, it is also possible that a journal issue be devoted to a single text. The Editorial Board comprises an Editor-in-Chief (with overall responsibility), a Co-editor-in-Chief, and several members. The Editor-in-Chief has a fouryear mandate. Members of the board have a three-year mandate. Mandates are renewable and new members may be elected anytime. We are happy to welcome you to our readership and authorship, and hope we will share this privileged contact for a long time. Stefano Spaccapietra Editor-in-Chief http://lbd.epfl.ch/e/Springer/
JoDS Volume XIII – Special Issue on Semantic Data Warehouses
Data warehouses have been established as a fundamental and essential component of current decision-support systems. Many organizations have successfully used data warehouses to collect essential indicators that help them improve their business processes. Furthermore, the combination of data warehouses and data mining has allowed these organizations to extract strategic knowledge from raw data, allowing them to design new ways to perform their operations. In recent years, research in data warehouses has addressed many topics ranging from physical-level issues, aiming at increasing the performance of data warehouses in order to deal with vast amounts of data, to conceptual-level and methodological issues, which help designers build effective data warehouse applications that address the needs of decision makers better. Nevertheless, globalization and increased competition pose new challenges to organizations, which need to dynamically and promptly adapt themselves to new situations. This brings new requirements to their data warehouse and decisionsupport systems, particularly with respect to (1) heterogeneity, autonomy, distribution, and evolution of data sources, (2) integration of data from these data sources while ensuring consistency and data quality, (3) adaptability of the data warehouse to multiple users with multiple and conflicting requirements, (4) integration of the data warehouse with the business processes of the organization, and (5) providing innovative ways to interact with the data warehouse, including advanced visualization mechanisms that help to reveal strategic knowledge. In addition, data warehouses are increasingly being used in non-traditional application domains, such as biological, multimedia, and spatio-temporal applications, which demand new requirements for dealing with the particular semantics of these application domains. Therefore, building next-generation data warehouse systems and applications requires enriching the overall data warehouse lifecycle with semantics in order to support a wide variety of tasks including interoperability, knowledge reuse, knowledge acquisition, knowledge management, reasoning, etc. The papers in this special issue address several of the topics mentioned above. They all provide different insights into the multiple benefits that can be obtained by envisioning data warehouses from a new semantic perspective. As this is a relatively new domain, these papers open many new research directions that need to be addressed in future work. This research will definitely have a huge impact on the next generation of data warehouse applications and tools. January 2009
Esteban Zimányi Il Yeol Song
VIII
JoDS Volume XIII
Referees for the Special Issue We would like to thank all the reviewers for their excellent work in evaluating the papers. Without their committment the publication of this special issue of JODS would not have been possible. Alberto Abelló, Universitat Politécnica de Catalunia, Spain Omar Boussaïd, Université du Lyon 2, France Matteo Golfarelli, University of Bologna, Italy Panagiotis Kalnis, National University of Singapore, Singapore Jens Lechtenbörger, University of Münster, Germany Wolfgang Lehner, Dresden University of Technology, Germany Tok Wang Ling, National University of Singapore, Singapore Sergio Luján Mora, University of Alicante, Spain Elzbieta Malinowski, Universidad de Costa Rica, Costa Rica Svetlana Mansmann, University of Konstanz, Germany Rokia Missaoui, Université du Québec en Outaouais, Canada Ullas Nambiar, IBM India Research Lab, India Torben Bach Pedersen, Aalborg University, Denmark Mario Piattini, Universidad de Castilla La Mancha, Spain Stefano Rizzi, University of Bologna, Italy Markus Schneider, University of Florida, USA Alkis Simitsis, Stanford University, USA Dimitri Theodoratos, New Jersey Institute of Technology, USA Juan-Carlos Trujillo Mondéjar, Universidad de Alicante, Spain Panos Vassiliadis, University of Ioannina, Greece Robert Wrembel, Poznan University of Technology, Poland
Previous Issues of the Journal
JoDS I
Special Issue on Extended Papers from 2002 Conferences, LNCS 2800, December 2003 Co-editors: Sal March and Karl Aberer
JoDS II
Special Issue on Extended Papers from 2003 Conferences, LNCS 3360, December 2004 Co-editors: Roger (Buzz) King, Maria Orlowska, Elisa Bertino, Dennis McLeod, Sushil Jajodia, and Leon Strous
JoDS III
Special Issue on Semantic-Based Geographical Information Systems, LNCS 3534, August 2005 Guest Editor: Esteban Zimányi
JoDS IV
Normal Issue, LNCS 3730, December 2005
JoDS V
Special Issue on Extended Papers from 2004 Conferences, LNCS 3870, February 2006 Co-editors: Paolo Atzeni, Wesley W. Chu, Tiziana Catarci, and Katia P. Sycara
JoDS VI
Special Issue on Emergent Semantics, LNCS 4090, September 2006 Guest Editors: Karl Aberer and Philippe Cudre-Mauroux
JoDS VII
Normal Issue, LNCS 4244, November 2006
JoDS VIII
Special Issue on Extended Papers from 2005 Conferences, LNCS 4830, February 2007 Co-editors: Pavel Shvaiko, Mohand-Saïd Hacid, John Mylopoulos, Barbara Pernici, Juan Trujillo, Paolo Atzeni, Michael Kifer, François Fages, and Ilya Zaihrayeu
JoDS IX
Special Issue on Extended Papers from 2005 Conferences (continued), LNCS 4601, September 2007 Co-editors: Pavel Shvaiko, Mohand-Saïd Hacid, John Mylopoulos, Barbara Pernici, Juan Trujillo, Paolo Atzeni, Michael Kifer, François Fages, and Ilya Zaihrayeu
JoDS X
Normal Issue, LNCS 4900, February 2008
X
Previous Issues of the Journal
JoDS XI
Special Issue on Extended Papers from 2006 Conferences, LNCS 5383, December 2008 Co-editors: Jeff Z. Pan, Philippe Thiran, Terry Halpin, Steffen Staab, Vojtech Svatek, Pavel Shvaiko, and John Roddick
JoDS XII
Normal Issue, in press, March 2009
JoDS Editorial Board
Editor-in-Chief Co-editor-in-Chief
Stefano Spaccapietra, EPFL, Switzerland Lois Delcambre, Portland State University, USA
Members Carlo Batini Alex Borgida Shawn Bowers Tiziana Catarci David W. Embley Jerôme Euzenat Dieter Fensel Fausto Giunchglia Nicola Guarino Jean-Luc Hainaut Ian Horrocks Arantza Illarramendi Larry Kerschberg Michael Kifer Tok Wang Ling Shamkant B. Navathe Antoni Olivé José Palazzo M. de Oliveira Christine Parent Klaus-Dieter Schewe Heiner Stuckenschmidt Pavel Shvaiko Katsumi Tanaka Yair Wand Eric Yu Esteban Zimányi
Università di Milano Bicocca, Italy Rutgers University, USA University of California Davis, USA Università di Roma La Sapienza, Italy Brigham Young University, USA INRIA Alpes, France University of Innsbruck, Austria University of Trento, Italy National Research Council, Italy FUNDP Namur, Belgium University of Manchester, UK Universidad del País Vasco, Spain George Mason University, USA State University of New York at Stony Brook, USA National University of Singapore, Singapore Georgia Institute of Technology, USA Universitat Politècnica de Catalunya, Spain Universidade Federal do Rio Grande do Sul, Brazil Université de Lausanne, Switzerland Massey University, New Zealand University of Mannheim, Germany Informatica Trentina, Italy University of Kyoto, Japan University of British Columbia, Canada University of Toronto, Canada Université Libre de Bruxelles, Belgium
Table of Contents
Multidimensional Integrated Ontologies: A Framework for Designing Semantic Data Warehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Victoria Nebot, Rafael Berlanga, Juan Manuel P´erez, Mar´ıa Jos´e Aramburu, and Torben Bach Pedersen
1
A Unified Object Constraint Model for Designing and Implementing Multidimensional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fran¸cois Pinet and Michel Schneider
37
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandipto Banerjee and Karen C. Davis
72
An ETL Process for OLAP Using RDF/OWL Ontologies . . . . . . . . . . . . . Marko Niinim¨ aki and Tapio Niemi Ontology-Driven Conceptual Design of ETL Processes Using Graph Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitrios Skoutas, Alkis Simitsis, and Timos Sellis
97
120
Policy-Regulated Management of ETL Evolution . . . . . . . . . . . . . . . . . . . . . George Papastefanatos, Panos Vassiliadis, Alkis Simitsis, and Yannis Vassiliou
147
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
179
Multidimensional Integrated Ontologies: A Framework for Designing Semantic Data Warehouses Victoria Nebot1, Rafael Berlanga1, Juan Manuel Pérez1, María José Aramburu1, and Torben Bach Pedersen2 1
Universitat Jaume I, Av. Vicent Sos Baynat, s/n E-12071 Castelló, Spain {romerom,berlanga,juanma.perez,aramburu}@uji.es 2 Aalborg University, Selma Lagerløfs Vej 300, DK-9220 Aalborg Ø, Denmark
[email protected] Abstract. The Semantic Web enables organizations to attach semantic annotations taken from domain and application ontologies to the information they generate. The concepts in these ontologies could describe the facts, dimensions and categories implied in the analysis subjects of a data warehouse. In this paper we propose the Semantic Data Warehouse to be a repository of ontologies and semantically annotated data resources. We also propose an ontology-driven framework to design multidimensional analysis models for Semantic Data Warehouses. This framework provides means for building a Multidimensional Integrated Ontology (MIO) including the classes, relationships and instances that represent interesting analysis dimensions, and it can be also used to check the properties required by current multidimensional databases (e.g., dimension orthogonality, category satisfiability, etc.) In this paper we also sketch how the instance data of a MIO can be translated into OLAP cubes for analysis purposes. Finally, some implementation issues of the overall framework are discussed. Keywords: Data warehouses, Semantic Web, Multi-ontology integration.
1 Introduction The Semantic Web is a rich source of knowledge whose exploitation will open new opportunities to the academic and business communities. One of these opportunities is the analysis of information resources for decision support tasks such as the identification of trends, and the discovery of new decision variables. Semantic annotations are formal descriptions of information resources which usually rely on widely accepted domain ontologies. The main reason for using domain ontologies is to set up a common terminology and logic for the concepts involved in a particular domain. Semantic annotations are especially useful for describing unstructured, semistructured and text data, which cannot be managed properly by current database systems. Nowadays many applications (e.g., medical applications) attach metadata and semantic annotations to the information they produce, for example medical image, laboratory tests, etc. In the near future, large repositories of semantically annotated data will be available, opening new opportunities for enhancing current decision support systems. S. Spaccapietra et al. (Eds.): Journal on Data Semantics XIII, LNCS 5530, pp. 1–36, 2009. © Springer-Verlag Berlin Heidelberg 2009
2
V. Nebot et al.
Data warehouse systems are stores of information aimed at analysis tasks. This information is extracted from existing databases and is pre-processed to harmonize its syntax and semantics. Thus, one of the main purposes of data warehouse systems is the integration of information coming from several sources. Afterwards, OLAP systems can be applied to efficiently exploit the stored information. Both types of systems rely on multidimensional data models, which distinguish the stored measures from the analysis dimensions that characterize them. In this paper we tackle the problem of combining data warehouse and Semantic Web technologies. Our proposal is a framework for designing multidimensional analysis models over the semantic annotations stored in a Semantic Data Warehouse (SDW). In our approach, an SDW is conceived as a XML repository that includes web resources, domain ontologies and the semantic annotations made with them. Being a data warehouse, this repository is subject oriented, and therefore it is aimed at recording only data that is relevant for specific analysis tasks. Our work is being carried out in the context of a larger research project about the integration and exploitation of biomedical data provided by clinicians for research tasks. The framework presented here is based on the specification of a Multidimensional Integrated Ontology (MIO) over the SDW ontologies in order to retrieve the ontology classes and instances that will later be used in the multidimensional analysis. To our best knowledge, our approach is the first one on addressing the following requirements: • Multi-ontology design. Much semantic data is generated in the context of very complex scenarios involving several domain ontologies. The framework proposed in the paper allows the selection of the concepts needed for the analysis through different ontologies. • Scalability. As domain ontologies usually have a considerably large size, the method for building MIOs must be scalable. We will achieve these scalability requirements by extracting only those modules or fragments that are necessary from the source ontologies. • Formally well-founded approach. In order to keep the semantics and inference mechanisms of the source ontologies, the proposed design process relies on formalisms that have been widely accepted for the Semantic Web (e.g., Description Logics). The main contributions of the paper can be summarized as follows: 1. A framework for designing and building Semantic Data Warehouses. 2. An application scenario and a running use case to establish the requirements and to illustrate the usefulness of our techniques. 3. A methodology for the design, automatic generation and validation of Multidimensional Integrated Ontologies. By integrating the concepts and properties of several ontologies coming from the same application domain, a MIO establishes the topics, measures, dimensions and hierarchies required by a specific data analysis application. 4. The automatic construction of a multidimensional cube, according to the specifications of a MIO, starting from the annotated data stored in the SDW, in order to allow the analysis of this data by using traditional OLAP operators. 5. The study of several alternatives for implementing the proposed SDW.
Multidimensional Integrated Ontologies
3
The rest of the paper is organized as follows. Section 2 describes an application scenario that motivates our approach. Section 3 reviews the related work including: Description Logics, OWL and OLAP; the existing approaches to annotate biomedical data; the combination of Semantic Web and data warehouse technologies; and different alternatives for exploiting knowledge from multiple ontologies. Section 4 introduces our approach to a Semantic Data Warehouse. Section 5 explains the methodology proposed for designing Multidimensional Integrated Ontologies and Section 6 gives some implementation guidelines. Finally, Section 7 presents some conclusions and future work.
Fig. 1. Generation of semantic annotations in the biomedical domain
2 Application Scenario and Use Case In this section we describe an application scenario for an SDW along with a use case that will serve to define the examples of the rest of the paper. By defining this application scenario, we will identify a list of requirements that can be considered common to many applications of SDWs, and that, therefore, can be applied to prove the usefulness of the framework proposed in this paper. Our application scenario is Biomedicine in which, at the moment, vast amounts of semantically annotated data are being generated by many different types of data management systems (see section 3.2). In order to guide the process of semantically annotating the data, current data management systems adopt specific application ontologies relying on one or more widely accepted domain ontologies. A domain
4
V. Nebot et al.
ontology is a very large corpus of semantically related data that describe the knowledge and vocabularies agreed by the relevant biomedical community. The reader can find a good review of the main biomedical ontologies in (Rubin et al., 2007). Figure 1 shows the usual process of generating semantic annotations for the data elements that biomedical activities produce. The application ontologies that rule the structure of the semantic annotations are located in the core of the data management system. At the cortex part, we find the different types of complex data elements, coming from very different biomedical activities and departments, that need to be annotated before being exploited in the context of an SDW. Typically, semantic annotations are expressed in XML or RDF formats. Document Entity clinicianID
Gene Profile
Patient age
has report
has_profile
* Rheumatology Report DateOfVisit
related_gene
*
0..1 Rheumatologic Exam DamageIndex
0..1 Treatment
has_therapy
*
measures indicant
*
*
Drug UMLS
Joint Injections
Ultrasonography physicianID
Blood Indicant level
Drug Therapy dosage timing
has_procedure
*
*
*
Gene GO
*
*
0..1 Laboratory Blood Test laboratoryID
*
Disease NCI
has_diagnosis
* *
*
Blood Cell GALEN
*
Factor GALEN
*
has finding
Joint Finding score
*
Synovial Joint GALEN
affected_joint
*
*
specialisation aggregation composition 0..1 optional
Fig. 2. A fragment of an application ontology for Rheumatology
In the biomedical scenario, semantically annotated data consists of many different types of data (e.g. lab test reports, ultrasound scans, images, etc.) originating from heterogeneous data sources. This data also presents complex relationships that evolve rapidly as new biomedical research methods are applied. As a consequence, this data cannot be properly managed by current data warehouse technology, mainly because it is complex, semi-structured, dynamic and highly heterogeneous. Figure 2 illustrates an ontology fragment for the Rheumatology domain. As the figure shows, a patient may have different rheumatology reports, authored by some clinicians, consisting of the results of some blood tests and rheumatologic exams, the diagnosis of a disease (defined in the domain NCI ontology) and the proposed treatment. The objective of these examinations is to estimate an overall damage index by performing some ultrasonography tests. The treatment is modelled as a collection of drug therapies, sometimes applied in the affected joints. The joint set is compiled
Multidimensional Integrated Ontologies
5
from the GALEN domain ontology. The patient has a genetic profile. The cells and genes involved in the genetic profiles are described by the GALEN and GO domain ontologies, respectively. Although in Figure 2 we have used UML to graphically represent the ontology fragment, the actual representation formalism will in practice rely on standard languages such as RDF/S and OWL. External concepts coming from domain ontologies are represented in the UML diagram with shaded boxes, indicating the source ontology within the attribute section (e.g. NCI, GO, etc.). Domain ontologies can be used to control the vocabulary and to bring further semantics to the annotated data. Table 1 shows an example of semantically annotated data generated from the application ontology of Figure 2 and stored as RDF triples. Table 1. Application ontology instances stored as RDF triples
In the context of this application scenario, our aim is to build a warehouse where semantically annotated data can be analysed with OLAP-based techniques. As use case, we propose to analyse the efficacy of different drugs in the treatment of several types of inflammatory diseases, mainly rheumatic ones. The analysts of this use case should define the dimensions, measures and facts that will allow the analysis of the semantic annotations, gathered from several hospitals and, therefore, expressed with different application ontologies. Notice that at this point, the analyst does neither know the values nor the roll-up relationships that will eventually be used in the resulting cube. As we will show, the framework presented in this paper will capture this information from the application and the domain ontologies involved in the analysis. Figure 3 shows the seven dimensions that we have selected in order to study this use case from different points of view, including: the patient’s age and gender, the subtype of disease (diagnosis), the biomarkers taken from the patient, the damage index of patient´s joints and the drugs administered during the follow-up visits of the patient. Since we consider that the relation that exists between disease symptoms and affected body parts is very relevant for the analysis, we have introduced the category Anatomy in the disease dimension. The biomarkers of interest include blood cells, blood factors and genes. The category Tissue has been similarly introduced in the biomarkers dimension in order to relate biomarkers with their associated tissues.
6
V. Nebot et al.
Fig. 3. Dimensions defined for analyzing rheumatology patients. We use the letter D for dimensions, F for facts, M for measures and L for dimension levels.
In this use case, OLAP technologies can be applied to perform useful analysis operations over the gathered data, as for example: • By applying roll-up operations, we can aggregate data into coarser granularities such as drug families, active principles, types of diseases, and so on. On the contrary, by means of the available drill-down operations, we can refine each of the analysis dimensions to obtain data with a finer granularity. This kind of operations can give useful information to the clinicians about the relation between diagnosis and treatment efficacy. • By applying selection and projection operations, we can restrict the analysis to patient subsets according to criteria based on age, sex, affected body parts, etc. In this section we have defined an application scenario and a use case for the SDWs we want to achieve. In this scenario we identify the following set of application requirements: 1. Integration of biomedical data, information and knowledge to gain a comprehensive view of patients. 2. Scalable data storage functionalities to store the collected semantic information as well as the relevant application and domain ontologies. 3. Flexible ways of specifying analysis dimensions, measures and facts based on medical criteria. 4. Easy exploration of large domain ontologies considering their implicit semantics, and the possible overlapping in their concepts (e.g. mappings). In the context of other application scenarios these requirements should not be much different, so from our point of view, they can be considered as a basic set of requirements for a generic analysis application of an SDW. It is worth mentioning that the contributions of this paper described in the introduction are aimed at covering all these requirements.
Multidimensional Integrated Ontologies
7
3 Background and Related Work In this section we review the basic concepts involved in the representation, generation and storage of semantic annotations of data, as well as some related work about the analysis of semantic data. 3.1 OWL, Description Logics and OLAP The Ontology Web Language (OWL) is a language for the specification of ontologies, whose definition by the W3C Consortium has empowered the biomedical community to develop large and complex ontologies like the NCI thesaurus, GALEN, etc. OWL provides a powerful knowledge representation language that has a clean and well defined semantics based on Description Logics (DL). Description Logics are a family of knowledge representation formalisms devised to capture most of the requirements of conceptual modelling. These formalisms are decidable subsets of First Order Logic that are expressive enough to capture interesting conceptual modelling properties. The main purpose of DLs is to provide a formal theory that can be used to validate conceptual schemata (Franconi & Ng, 2000) of heterogeneous databases (Mena et al., 2000), data warehousing design and multidimensional aggregation modelling (Baader & Sattler, 2003). It is worth mentioning that Baader & Sattler (2003) and Franconi & Ng (2000) apply DLs in the context of a traditional warehouse. Our proposal is different; we propose to design the warehouse starting from a collection of semantically annotated data. We use DLs for helping the warehouse designer to transform ontology fragments into analysis dimensions, by testing if these dimensions satisfy a set of properties desirable for OLAP applications. Let us briefly introduce the basic constructors of Description Logics through the basic language ALC (Schmidt-Schauss & Smolka, 1991), which is summarised as follows: ALC ::= ⊥ | A | C | ¬C | C ⊓D | C ⊔D | ∃R.C | ∀R.C The basic elements of ALC are concepts (classes in OWL notation), which can be either atomic (A) or derived from other concepts (expressions C and D). Complex concepts are built by using the classical Boolean operators over concepts, namely: and (⊓), or (⊔) and not (¬). Value restrictions on the concept individuals (instances in OWL notation) are represented through roles (object properties in OWL notation), which can be either existential (∃R.C) or universal (∀R.C). The universal concept is denoted with ⊤, whereas the empty concept is denoted with ⊥. The empty concept is usually associated with inconsistencies and contradictions in the ontology. Currently there exist several reasoners that deal with some Description Logic languages1, although most of them do not fully support the retrieval of large sets of asserted instances. Indeed, the complexity of these reasoners is PSpace-complete, which does not guarantee scalability for large domains. 1
See http://www.cs.man.ac.uk/~sattler/reasoners.html for an exhaustive list. More information about DLs can be found at http://dl.kr.org/
8
V. Nebot et al.
Additionally, several DL constructors have been proposed to capture the main elements of conceptual modelling for databases. For example, concrete domains were introduced to account for the usual data types in a conceptual database schema. It has been demonstrated that domains like the integers and strings can be easily introduced into a DL without losing decidability2 (Lutz et al., 2005). Furthermore, users can state features (i.e., relations between instances and values from these domains) with predicates expressing value comparisons. OWL languages support these constructors via the so-called data type properties. Another interesting constructor for OLAP applications is that of role composition, R◦P, which recently has been introduced in OWL. Role composition allows us to express joined relationships making the intermediate involved concepts implicit. Reasoning over role compositions has been shown to be decidable (Horroks & Sattler 2003), but it is not fully supported by current reasoners yet. Concerning data warehouse operations, Baader & Sattler (2003) introduced aggregates over concrete domains. The resulting language is called ALC (∑), and extends the basic language ALC with concrete domains and a limited set of aggregation functions, namely: sum, min, max and count. Aggregates are introduced through complex features of the form Γ(R ◦ u), which relate each instance with the aggregate Γ over all the values reachable from R followed by the feature u. For example, we can define the following complex feature sum(month ◦ income) to relate instances with their annual incomes. With this complex feature we can ask for employees having annual incomes greater than 100,000 Euros by means of the concept:
Employee ⊓ ∃year.>(sum(month ◦ income), 100000) However, DLs formalisms present important limitations for representing complex measures and aggregations. Baader & Sattler (2003) also demonstrate that handling aggregates in DLs usually leads to undecidability problems, even for very simple aggregates such as sum and count. Moreover, decidable cases present a level of computational complexity too high for practical real-world applications. Baader and Sattler indicate that some interesting inference problems for multidimensional models, such as summarizability, have not been treated by the proposed DLs. Finally, there are no reasoners able to deal with the advanced features required by these new constructors. Because of these reasons, we propose a new framework to define an integrated ontology that will be used to build a multidimensional data schema over which to apply the OLAP operations required by the analysis tasks. In this way, summarizability will be ensured by building a valid cube from this multidimensional schema so that aggregations are performed over it, out of the DL formalism. 3.2 Annotating Biomedical Data In the biomedical scenario there exist a large number of initiatives for annotating biomedical databases for the Semantic Web. For example, in the SEMEDA project, Köhler et al. (2003) use a controlled vocabulary and an RDF-like ontology to annotate 2
This occurs whenever the introduced domain satisfies the so-called admissibility property.
Multidimensional Integrated Ontologies
9
tables, attributes and their domains to derive cross-references between databases. ONTOFUSION (Pérez-Rey et al., 2005) is another approach based on the integration of local conceptual schemata into a global biomedical ontology. A good review of semantic-based approaches for biomedical data integration can be found in (Louie et al., 2006). It is worth mentioning that most of the current work in biomedical applications uses OWL as the representation language for ontologies and semantic annotations. Currently, there are several ongoing international projects that are aimed at the interchange of massive biomedical data, for example caBig3, openEHR4 and Health-eChild5 to mention a few. These projects also concern the semantic annotation of data through well-established biomedical ontologies. Other previous works propose to use OLAP techniques to analyse biomedical data. In (Wang et al. 2005) OLAP operations are applied to discover new relations between diseases and gene expressions as well as to find out new classification schemes for patients. They also propose the use of well-known domain ontologies (e.g., GO6 for classifying genes and OpenGalen7 for classifying diseases) to define analysis dimensions. However, the authors do not explain how these ontologies can be translated into OLAP dimensions and how factual data can be semantically annotated for analysis. From all the previous works and projects, three logical data layers can be identified for the application scenario, namely: the domain ontologies, the data schemata and the generated data. All this data and knowledge pieces are eventually expressed in XML, using the different standards best suited for each layer: RDF/S and OWL for the first one, RDF/S and XML Schema for the second one, and XML for the third one. We also follow this logical structure in our approach to designing an SDW (see Section 4). 3.3 Data Warehousing and Semantic Web Technologies In this section we review the work that combines data warehouse and Semantic Web technologies. We start with two papers that extend the functionality of a data warehouse with Semantic Web technologies, and then we consider previous works on analysing semantic data with multidimensional data models. Priebe & Pernul (2003) propose to use a global ontology to annotate OLAP reports and other Web resources such as textual documents. Then, users can contextualise OLAP reports by retrieving the documents related to the metadata (search keywords) attached to them. Here, the global ontology is expressed in RDF/S and it contains domain-specific information along with the values of the hierarchies used in the OLAP database. Skoutas & Simitsis (2006) work on the automation of the data warehouse’s ETL process by applying Semantic Web technologies. They propose to build an ontology that uses OWL constructs to describe and relate the source and target data source schemata. Afterwards, a reasoner is used for identifying the sequence of operations 3
caBig project: https://cabig.nci.nih.gov/ openEHR project: http://www.openehr.org/ 5 Health-e-Child project: http://www.health-e-child.org/ 6 GO (Gene Ontology): http://www.geneontology.org/ 7 Galen ontology: http://www.opengalen.org/open/crm/crm-anatomy.html 4
10
V. Nebot et al.
needed to load the warehouse. In a more recent paper, Simitsis et al. (2008) present a template-based natural language generation mechanism to transform both the formal description of the data sources expressed in the ontology, and the inferred ETL operations into a narrative textual report more suitable for the user. The works by Priebe & Pernul (2003) and Skoutas & Simitsis (2006) apply the Semantic Web infrastructure to extend the functionality of the “traditional” data warehouses, but they do not address the analysis of data gathered from semantic sources. In contrast, our proposal consists of a method for designing multidimensional analysis models over the semantic annotations stored in the SDW. To the best of our knowledge, there are only two recent papers aimed at analysing semantic data with multidimensional models, (Romero & Abello, 2007) and (Danger & Berlanga, 2008). Romero & Abelló (2007) address the design of the data warehouse multidimensional analysis schema starting from an OWL ontology that describes the data sources. They identify the dimensions that characterize a central concept under analysis (the fact concept) by looking for concepts connected to it through one-to-many relationships. The same idea is used for discovering the different levels of the dimension hierarchies, starting from the concept that represents the base level. In this work the input ontology indicates the multiplicity of each role in the relationships; and a matrix keeps, for each concept, all the concepts that are related by means of a series of one-to-many relationships. The output of the Romero & Abelló’s method is a star or snowflake schema that guaranties the summarizability of the data, suitable to be instantiated in a traditional multidimensional database. The application of this work is valid in scenarios where a single ontology of reduced size, with multiplicity restrictions, is used for annotating the source data. However, as discussed in Section 2, a real application will usually involve different domain ontologies of considerable large size; and unfortunately, the multiplicity information is rarely found in the source ontologies. Danger & Berlanga (2008) propose a multidimensional model specially devised to select, group and aggregate the instances of an ontology. The result of these operations is a set of tuples, whose members are instances of the ontology concepts. They also present the adaptation of a feature selection algorithm to discover interesting potential analysis dimensions. This algorithm builds the dimension hierarchies by selecting the relationships in the ontology that maximize the information gain. Like Romero & Abelló (2007), Danger & Berlanga only consider scenarios with a single ontology. As it can be observed, both papers are more concerned with the extraction of interesting dimensions from isolated ontologies rather than analysing a large set of stored SDW annotations. Moreover, in a real-world scenario, the SDW can contain annotations defined in several large inter-linked ontologies. Our contribution in this context is twofold. First, we define the Semantic Data Warehouse as a new semistructured repository consisting of the semantic annotations along with their associated set of ontologies. Secondly, we introduce the Multidimensional Integrated Ontology as a method for designing, validating and building OLAP-based cubes for analysing the stored annotations. The development of the Semantic Web relies on current XML technology (e.g, XML Schemas and Web Services). In Perez et al. (2008), we surveyed the combination of XML and data warehouses. The work on the construction of XML
Multidimensional Integrated Ontologies
11
repositories (Xyleme; 2001) is particularly relevant to the SDW, since the ontologies and their instance data are typically expressed in XML-like formats. Xyleme (2001) addresses the problems of gathering, integrating, storing and querying XML documents. In order to deal with the high level of dynamicity of web data sources, the Xyleme system allows users to subscribe to changes in an XML document (Nguyen et al., 2001), and applies a versioning mechanism (Marian et al., 2001) to compute the differences between two consecutive versions of an XML document. However, XML techniques for change control are not useful for ontologies, as we must keep track of non-explicit (i.e. inferred) semantic discrepancies between versions. Although some preliminary tools exist, like OWLDiff8), further research must be carried out to study the impact of these changes in the SDW design and its derived OLAP cubes. In this paper we will not treat ontology versioning as it is out of its scope. Thus, we assume that the ontologies stored in the SDW are static. 3.4 Multi-ontology Scenarios The application scenario presented in this paper reveals new data acquisition tools being applied in the biomedical domain. These tools are increasingly incorporating ontology services that allow end-users (e.g. clinicians) to properly annotate data in a standard and controlled way. This task is fulfilled by browsing and selecting terms from domain ontologies and vocabularies (Garwood et al., 2004, Jameson et al., 2008). In order to integrate and analyze the large amounts of semantic annotations generated by these tools, we propose the construction of a MIO that gathers only the right amount of knowledge from the different domain ontologies that were used to annotate the data. A lot of research works have dealt with multi-ontology scenarios, which is the key feature of a distributed environment like the Semantic Web. For example, terms and works encountered in the literature which claim to be relevant include: mapping, alignment, merging, articulation, fusion, integration and so on (Kalfoglou and Schorlemmer, 2003). The scope of this paper is not to provide a new framework for ontology integration and mapping. Instead, we propose the construction of MIOs specifically designed to meet the requirements and restrictions of the application scenario presented. However, since there is an extensive literature concerning ontology modularization and mapping, we will highlight the main approaches devised to deal with several ontologies along with their suitability for our application scenario. Finally, we will justify the approach followed to build our MIO framework. OBSERVER (Mena et al., 2000) and OIS (Calvanese et al., 2001) are some of the first approaches that tackle the problem of semantic information integration between domain specific ontologies. The former system is based on a query strategy where the user specifies queries in one ontology's terms and then these queries are expanded to other ontologies through relationships such as synonymy, hyponymy and hypernymy. The latter also uses the notion of queries which allow for mapping a concept in one ontology into an integrated view. However, these approximations are not suitable for our application scenario since our aim is to construct a new stand-alone ontology composed by pieces or fragments from several ontologies. Therefore, we have studied the developments in modular ontologies, since they seem to suit better our purposes. 8
OWLDiff: http://sourceforge.net/projects/owldiff
12
V. Nebot et al.
E-connections (Grau et al., 2005) is a formalism that was designed for combining different logics in a controlled way. It introduces a new family of properties called “link” properties which are associated with domains (component ontologies). Each domain can declare which foreign ontologies it links to. However, E-connections do not allow the specification of subsumption relationships between concepts coming from different ontologies and it works only under disjoint domains. Moreover, E-connections is carried out by extending OWL with new non-standard syntax and semantics. The Distributed Description Logics (DDL) formalism (Borgida and Serafini, 2003) provides mechanisms for referring to ontology concepts and for defining “bridge rules” that encode subsumption between concepts of different ontologies. Context OWL (C-OWL) (Bouquet et al., 2003) is an extension of DDLs that suggests several improvements, such as a richer family of bridge rules, allowing bridging between roles, etc. C-OWL also extends OWL syntax and semantics. In contrast, in C-OWL it is not allowed to reuse foreign concepts in restrictions as in E-connections. There is yet another approach called Package-based Description Logics (P-DL) (Bao et al., 2006) that tries to overcome the limitations introduced by E-connections and C-OWL by allowing both subsumption between different ontologies, and foreign concepts in restrictions. However, as in the above mentioned approaches, another non-standard syntax and semantics is introduced and reasoning support is very restricted. In all previous approaches we can observe serious limitations that prevent us from using them in the construction of our MIO framework. In first place, they all introduce changes to the syntax and semantics of OWL, therefore, all the available infrastructure such as OWL parsers and reasoners would need to be extended. Moreover, they severely restrict reuse by other organizations and only accept customized, non-standard toolsets. Concerning reasoning aspects, reasoning with multiple distributed ontologies can arise some problems with respect to completeness and performance. Completeness depends on the availability of each local reasoner, which in a distributed network could be unreachable. Moreover, the communication costs between nodes in the system can become a bottleneck, since communication problems can arise. Borgida and Serafini (2003) also establish a connection between DL and DDL that allow them to transfer theoretical results and reasoning techniques from the classical DL literature under certain circumstances. Unfortunately, their approach to construct a global DL ontology implies copying all the axioms of the local ontologies. In our application scenario, this approach is not scalable since domain ontologies are usually very large and complex. In order to address the problems of previous approaches, Stuckenschmidt and Klein (2007) define modular ontologies in terms of a subset of DDL and provide rationales for the restrictions applied. They compute subsumption relations between external concepts offline and store them as explicit axioms in the local ontologies. However, this modular approach can be computationally very expensive because in the worst case it has exponential cost. We address the previous limitations by proposing the use of alternative techniques to extract fragments and modules from ontologies and combine them in the resulting MIO framework, namely: OntoPath (Jiménez-Ruiz et al., 2007) and Upper Modules (UM) (Jiménez-Ruiz et al., 2008). The application of these tools provides a viable alternative without changing the current Semantic Web infrastructure. In this way, ontologies can be expressed using standard OWL syntax and semantics, and external tools implementing different modularization algorithms extract a fragment or module according to the specific requirements of the target application. As a result, module
Multidimensional Integrated Ontologies
13
extraction algorithms do not require any change to the OWL semantics. Moreover, we overcome the scalability problems that may arise when reasoning with several large ontologies by building a MIO that only comprises the relevant knowledge (e.g. relevant modules or fragments). Both techniques will be further explained in Section 5.3.
4 An Approach to Semantic Data Warehouses We conceive a Semantic Data Warehouse as a semi-structured data warehouse that stores ontology-based semantic annotations along with the mechanisms that allow the execution of analysis operations over the stored data. The special features of this kind of semantically-rich data will require the application of OWL and general XML technologies when building and managing the warehouse. In Figure 4, we can distinguish several components of the framework proposed for designing and analysing the SDW. As we have already stated, the core part of the framework uses the SDW ontologies to specify a Multidimensional Integrated Ontology suitable for analysis purposes. In the left side of the figure, we can see the processes in which the user of the framework (e.g. analyst) actively participates during the design of the MIO. In the centre of the figure, we show the tools needed to come up with the MIO and with the subsequent multidimensional cube. Finally, the right side of the figure shows the logical organization of the data and the schemata of the SDW. We will begin by explaining the latter. In a real-world scenario, an SDW requires storing the huge amount of annotated data to be analysed together with the application ontologies used to generate it. However, given the complexity of many applications, application ontologies are usually based on one or more community-agreed ontologies, also denoted domain ontologies, which should also be part of the warehouse. In this way, the resulting SDW would include all the data and knowledge necessary for processing complex analysis queries. The four types of data sets that an SDW stores and their relationships (right side of Figure 4) are explained in turn: • A set of domain ontologies that will contain the agreed terminology and knowledge about the subject of analysis. In our biomedical scenario, this set consists of the ontologies that could be useful for annotating patient data, such as UMLS, NCI Ontology, etc. • A set of application ontologies needed for generating the data that will be stored in the intended data warehouse. These ontologies resemble database schemata but they are more flexible in the sense that they allow incomplete, imprecise and implicit definitions for the generated data. These ontologies will use the domain ontologies to bring proper meaning to their concepts. In a real-world scenario, application ontologies should be tailored to the requirements of the users that will share activities over the generated data. For example, in the biomedical scenario, the application ontology defined by Rheumatology clinicians will be quite different from that defined by Cardiology specialists. • The set of ontological instances generated from the previous application ontologies. This constitutes the main repository of the data warehouse, and it is assumed to be the largest part of it. The analysis of ontological instances is the main purpose of the SDW, and new tools able to process complex analysis operations over them need to be developed.
14
V. Nebot et al.
• The set of MIO ontologies generated during the design process. These ontologies are the core feature of the SDW. They gather together only the relevant external knowledge so that later analysis can be performed over the ontological instances. MIOs can be thought as alternative analysis perspectives over the ontological instances. Further details about their definition, generation and validation are given in the next section. In order to generate the MIOs for the SDW, we also need a set of mappings between the ontologies whose domains overlap. This is necessary because different application ontologies can be using different domain ontologies to denote similar concepts, for example NCI or Galen for disease concepts. It is also possible that the analyst specifies dimensions with category levels that involve different domain ontologies. Therefore, we need mechanisms that reconcile the overlapping concepts borrowed from different ontologies. In our work, we represent mappings as 7-tuples 〈id, s1, s2, O1, O2, R, φ〉, where id is the unique identifier of the mapping, s1 and s2 are symbols from ontologies O1 and O2 respectively, R is the mapping relationship between these symbols, namely: equivalence (≡), subsumption (⊑) and disjointness (⊥), and φ is a confidence value in the range [0,1], which is usually estimated by the tool that discovered the mapping. From now on, an ontology symbol s that is transformed from the ontology O1 to O2 O1 → O2 by using a mapping m, is denoted with s m .
Fig. 4. Proposed framework for the design of SDWs
Multidimensional Integrated Ontologies
15
The next section is completely devoted to describing the framework presented in Figure 4, which comprises the workflow for building the MIO and performing analysis operations over the stored data.
5 Multidimensional Integrated Ontologies (MIOs) In this section we describe a technique for defining multidimensional conceptual schemata as a first step to analyze SDWs. A Multidimensional Integrated Ontology (MIO) can be considered as a customized ontology whose concepts and roles represent dimensions, categories, measures and facts. This ontology must also include all the axioms and assertions necessary for validating the intended multidimensional data model. As a result, MIOs can be used for both guiding designers in the definition of the analysis dimensions, and checking the resulting model for some interesting properties which ensure that valid final cubes will result. Figure 5 shows the intermediate role played by a MIO during the design and analysis of an SDW. On one hand, the MIO represents a consistent subset of the data from the SDW which covers the requirements stated by the analyst. On the other hand, this subset of data is used to build well-formed OLAP cubes for the multidimensional analysis.
Fig. 5. Analyzing an SDW through a MIO
Following the notation of Figure 4, the framework workflow has the following four phases: Phase 1. MIO Definition: In this step, the analyst manually describes the topic of analysis, measures and dimensions that constitute the multidimensional conceptual schemata. In order to accomplish this task, the analyst applies a Symbol Searcher, which retrieves the symbols (concepts and properties) to be used in the analysis from the SDW domain ontologies.
16
V. Nebot et al.
Phase 2. MIO Generation: Once the analyst has defined the MIO, the Module Extractor tool automatically generates the corresponding ontology with the following elements: 1. A set of modules having the necessary knowledge to make inferences with the external symbols included in the MIO. For example, the application ontology of Figure 2 uses external symbols to define the diagnosis of a disease. These symbols come from the NCI ontology, which contains axioms that are necessary for reasoning. 2. A top-level ontology with the knowledge required to integrate the previous modules. The top-level ontology is the result of the union of the upper modules (UM) extracted from the used domain ontologies in the MIO. Additionally, a set of axioms derived from the ontology mappings are included to reconcile overlapping concepts. 3. The local axioms derived from the definitions given by the analyst when defining the MIO (Phase 1). Phase 3. MIO Validation: To conclude the design phase, the resulting ontology is validated in the Consistency Checker tool in order to ensure that it will be able to generate the target cubes. If there is any inconsistency, the user is allowed to change axioms of the MIO so that a valid cube can be obtained. Phase 4. Analysis Phase: During this phase, all the instances that will be used for generating the facts and dimensions for an OLAP cube, are retrieved by the Instance Extractor. Furthermore, there is a complex process called OWLtoMDS which must take into account the restrictions of the target OLAP tool (e.g. if it only allows strict hierarchies, level stratification, covering hierarchies, etc.) to transform the hierarchies of the MIO into suitable ones for analysis. Due to the inherent complexity of this task, it is out of the scope of this paper to provide a description of these transformations. However, we believe a formal method should be designed, which should take into account the analyst preferences regarding the definition of category levels, stratification and so on while making the process as easy and automatic as possible (Pedersen et al, 1999). Finally, facts are generated by applying some Transformations to the retrieved instances so that they conform to the multidimensional schema. In this step, ontology mappings can be required to transform instances that are noncompliant with the MIO. As a result, the final OLAP cube is generated. In the next subsections we will discuss the details of the design phase of the proposed methodology by means of a running example. 5.1 Phase 1: Defining the MIO A MIO definition specifies a set of dimensions and measures that can be extracted from the Semantic Data Warehouse ontologies. The design process proposed here consists of five steps in which the analyst applies the available ontologies to design a new one with the elements needed for analysis task. The five steps are as follows: to select the topic of analysis, to specify the dimensions of analysis, to select the measures, to define potential roll-up relationships, and finally, to specify the instances to be analysed. We will develop the use case specified in Section 2 in order to illustrate the steps of the design process of the MIO. Remember that the objective of
Multidimensional Integrated Ontologies
17
the use case is to analyse the efficacy of different drugs in the treatment of several types of inflammatory diseases, mainly rheumatic ones. Step 1. In the first step, the topic of analysis is defined by selecting the concepts that are the focus of the analysis from the application ontologies. We denote by CO a concept C taken from the ontology O. In our running example, the chosen concept is Rheuma Rheuma Patient . Notice that Patient represents all the patients defined in the Rheumatology application ontology. Step 2. Next, the concepts that will be used in the dimensions of analysis must be specified (see Table 2). In this step, the local concepts of the categories in each dimension are first defined and then related to the external concepts coming from the ontologies used for the stored annotations. The following table shows the concepts selected for defining the dimensions included in the MIO specified for the analysis case of our running example. Table 2. Concepts associated to the ontology dimensions and external concepts they relate to
In order to relate these local concepts to external ones, a set of axioms has been stated LOCAL (see col. 4 of Table 2). For example, the axiom Disease ⊑ Rheumatoid_ArthritisNCI states that the symbols used for the disease dimension will be the same as those used in the
18
V. Nebot et al.
domain ontology NCI under the concept Rheumatoid_Arthritis. Then, it will be possible to do the same inferences over these symbols as over the original ontology. In other words, the semantics given by the NCI ontology is assumed for our Disease dimension. As for dimension D5, the analyst wants to relate biomarkers (e.g. blood indicants and genes) to tissues. We have performed a review of the main biomedical ontologies searching for this kind of information and we have found GALEN to contain information about blood indicants and its relation to tissues (this relation is trivial since blood indicants measure blood cells, which are found in blood tissue). However, we have not found one or more ontologies that explicitly relate genes to specific cells or tissues. Thus, we have decided to define a tailored ontology that contains this information. Both the classification of genes and cells have been taken from UMLS. Then, we have manually established the corresponding relations based on the literature. We have named this ontology UMLS_GENES. It is important to notice that in the application ontology of our example, there are no concepts associated with Age, so the dimension D3 must be derived from the data type property age. In this case, we have created the new concept Age whose instances will be derived from age range values. The concept AgeGroup is defined locally to account for the different patient age groups, for example: newborn, child, juvenile, adult and elderly people. The transformation of numerical values into Age instances is performed during the construction of the OLAP cube. Step 3. The next step of the process consists of selecting the candidate measures coming from the data type properties existing in the application ontology. In our running use case, the DamageIndex could be a measure. The measure that counts the number of affected cases, like the other aggregation measures (e.g.: sum, avg, etc.), cannot be specified at this stage due to the DL expressivity limitations. This kind of measures will be defined and calculated during the analysis phase over the cube built from the MIO. As a consequence, measures are treated as dimensions in the MIO, like in (Pedersen et al., 2001). Step 4. Roll-up relationships are the next elements to be defined. Local roll-up properties are represented as R_Ci_Cj, denoting that instances of the concept Ci will be rolled-up to instances of the concept Cj. As Table 2 shows, the local concepts LOCAL LOCAL Disease and Anatomy have been defined to represent the categories of LOCAL dimension D1. Then, the roll-up relationship R_Disease_Anatomy is created and relates both categories through the next local axiom: LOCAL
Disease
⊑
∃R_Disease_Anatomy
LOCAL
.Anatomy
LOCAL
which restricts the local concept Disease to roll up to an Anatomy concept. Analogous axioms are added for the rest of dimension categories. In Step 2 both local concepts NCI NCI and Anatomy_Kind have associated to external ones (Rheumatoid_Arthritis respectively). Therefore, the system will try to find an external roll-up relationship (e.g.
Multidimensional Integrated Ontologies
19
path of subsequent concepts and properties) in external ontologies that connects both external concepts. In this case, the following path has been found: NCI
NCI
Rheumatoid_Arthritis / Disease_Has_Associated_Anatomic_Site
NCI
/ Anatomy_Kind
Therefore, the following axiom associates the local roll-up property defined with the external roll-up property found in the ontology: Disease_Has_Associated_Anatomic_Site
NCI
⊑
LOCAL
R_Disease_ Anatomy
Table 3 shows the set of local roll-up relationships along with their corresponding external ones defined for each dimension of the running example. The local axioms that represent roll-up relationships are defined, when possible, composing roles (object properties) from the external ontologies. The external roll-up LOCAL relationship found for R_Biomarker_Tissue involves two different ontologies (UMLS_GENE and NCI). We have made use of mappings in order to relate cells of both ontologies. Table 3. Roll-up axioms defined for the MIO of the use case. We use the DL constructor ∘ to O1 O2 represent the role composition. Additionally, we use s m→ to denote a transformation of symbol s from ontology O1 to O2 by using a mapping m.
→
20
V. Nebot et al.
Step 5. In the last step of the MIO design process, the instances to be analyzed are specified through a local concept that involves all the dimensions and measures previously defined: Patient ≡ ∃hasDim_D1 . Disease ⊓ ∃hasDim_D2 .Drug ⊓ LOCAL LOCAL ∃hasDim_D3 .Age ⊓ ∃hasDim_D4LOCAL.SexLOCAL ⊓ ∃hasDim_D5LOCAL.BiomarkersLOCAL ⊓ ∃hasDim_D6LOCAL.DamageIndexLOCAL ⊓ ∃hasDim_D7LOCAL.NumberOfVisitLOCAL LOCAL
LOCAL
LOCAL
LOCAL
LOCAL
Additionally, a set of local axioms must be stated to relate dimension properties to external properties. Table 4 shows the axioms proposed for the running example. It is worth mentioning that D5 (biomarkers) involves three different parts of the application ontology, namely: blood cell, factors and genes. Table 4. Axioms associated with the intended facts of the target cube
5.2 Phase 2: MIO Generation After completing the design of the MIO, the analyst has defined the topic of the analysis, the external concepts associated with dimensions, the roll-up relationships between dimension concepts and their links to external properties. Next, the system will automatically generate the MIO. This will consist of the following three elements:
MIO = U LocalAxioms ( Di ) U TopicAxioms ∀Di
U ExternalAxioms
The set of local axioms for each dimension Di, denoted LocalAxioms(Di), will be built as the union of all the relevant specifications of the design process. For example, for the dimension D1 we have:
Multidimensional Integrated Ontologies
21
LocalAxioms(D1)={
⊑ Rheumatoid_ArthritisNCI, Disease ⊑ ∃R_Disease_AnatomyLOCAL.AnatomyLOCAL, LOCAL ⊑ Anatomy_KindNCI, Anatomy NCI LOCAL Disease_Has_Associated_Anatomic_Site ⊑ R_Disease_ Anatomy , Rheuma Rheuma LOCAL has_Report ∘ has_diagnosis ⊑ hasDim_D1 LOCAL
Disease
LOCAL
} The TopicAxioms will also be built from the specifications previously made for the topic of analysis and the measures. In our example, we will have: TopicAxioms = { Patient
LOCAL
≡
∃hasDim_D3
∃hasDim_D1
LOCAL
LOCAL
.Age
⊓
LOCAL
LOCAL
. Disease
∃hasDim_D4
LOCAL
⊓
∃hasDim_D2
.Sex
.Biomarkers ⊓ ∃hasDim_D6 ∃hasDim_D5 LOCAL LOCAL .NumberOfVisit } ∃hasDim_D7 LOCAL
LOCAL
LOCAL
LOCAL
LOCAL
LOCAL
.Drug
⊓
⊓ LOCAL
.DamageIndex
⊓
Therefore, at this stage it only remains to generate the ExternalAxioms element. The following section deals with this issue, and the subsequent section explains how to validate the resulting ontology. 5.3 Bringing External Knowledge to the MIO Concepts that will be used in the different dimensions are defined locally, but the user defines them in terms of the concepts located in external ontologies. Thus, a MIO consists of all the local axioms asserted by the user plus external knowledge that can affect the symbols of the MIO. It is desirable to integrate this external knowledge because of three reasons: 1. Semantic annotations made with symbols from domain ontologies can imply definitions and relationships that are implicit. Thus, by enriching the MIO with new hierarchical dimensions relying on the relationships provided by domain ontologies, we can discover implicit knowledge. In other words, bringing in the knowledge related to the symbols of the warehouse semantic annotations, allows us to infer implicit fact-dimension relationships useful for analysis. 2. Given that a MIO contains a set of external axioms that provides a consistent and simplified version of the original ontologies focused on a topic of analysis, it constitutes a piece of knowledge that can be reused. For example, this MIO can be a good starting point to guide users in the definition of a multidimensional cube for analysis purposes. There exists some preliminary work in this line that could benefit from MIOs (e.g. Romero & Abelló 2007). 3. A MIO is a new consistent ontology that derives from the SDW ontologies. This means that it can contain new concepts and roles that must be satisfiable with respect to the semantics of the original ontologies. We assume that the original ontologies are already consistent, and therefore satisfiability must be checked only
22
V. Nebot et al.
for the MIO local concepts. In this way, although large MIOs can be defined by reusing existing knowledge, the cost of checking it for consistency is limited to the new concepts introduced by the analyst. The construction of the MIO with external knowledge coming from the domain and application ontologies is carried out by using both the query language OntoPath (Jimenez-Ruiz et al., 2007) and some module extraction approaches recently proposed in (Jimenez-Ruiz et al., 2008). OntoPath is a novel retrieval language for specifying and retrieving relevant ontology fragments. This language is intended to extract customized stand-alone ontologies from very large, general-purpose ones. In a typical OntoPath query, the desired detail level in the concept taxonomies as well as the properties between concepts that are required by the target applications are easily specified. The syntax and aims of OntoPath resemble XPath’s in the sense that they are simple and they are designed to be included in other XML-based applications (e.g. transformations sheets, semantic annotation of web services, etc.). In our approach for building the MIO, OntoPath is used to retrieve the different dimension hierarchies along with the corresponding roll-up properties from the domain ontologies used to annotate patients. The retrieval of these ontology fragments is based on the analysis dimensions proposed by the analyst. Following the running example, the following queries would be run in order to extract the dimension hierarchies: D1 Æ Rheumatoid_ArthritisNCI / Disease_Has_Associated_Anatomic_SiteNCI/ Anatomy_KindNCI D2 Æ DrugUMLS D5 Æ
→ Gene / Located_In / Cell / m NCI NCI Anatomic_Structure_Is_Physical_Part_Of /Tissue
D5 Æ
AbsoluteMeasurement / isCountConcentrationOf GALEN GALEN isInSuspensionWithin /Tissue
UMLS_GENE
UMLS_GENE
GALEN
UMLS_GENE
NCI
GALEN
/ Cell
GALEN
/
As it can be observed, through simple path queries of subsequent concepts and properties, we obtain the fragments corresponding to the different dimension hierarchies. Notice that we make use of mappings in D5 in order to connect overlapping concepts in different ontologies. OntoPath is also used for extracting the part of the application ontology schema relevant for analysis purposes; the concepts and properties that define the facts of analysis. In our example, the OntoPath query shown in Figure 6 is evaluated to determine the relevant elements of the application ontology involved in the analysis task. Moreover, we use a logic-based approach of modular reuse of ontologies to extract the upper knowledge of all the external symbols that appear in the MIO. This modular approach is safe, since the meaning of the imported symbols is not changed, and economic, since only the module relevant for a given set of symbols (called signature) is imported. They also guarantee that no entailments are lost compared to the import of the whole ontology. We particularly extract Upper Modules (UM), which are based on ⊥-locality and are suitable for refinement. That is, we extract the upper knowledge of all the external symbols of the MIO.
Multidimensional Integrated Ontologies
WĂƚŝĞŶƚ
ZŚĞƵŵĂ
23
ZŚĞƵŵĂ
ĂŐĞ
ZŚĞƵŵĂ
ƐĞdž
ŚĂƐͺWƌŽĨŝůĞͬΎͬƌĞůĂƚĞĚͺŐĞŶĞͬΎ ŚĂƐͺZĞƉŽƌƚ
ZŚĞƵŵĂ
ͬΎ
ŚĂƐͺĚŝĂŐŶŽƐŝƐ ĚĂƚĞKĨsŝƐŝƚ
ZŚĞƵŵĂ
ͬΎ
ZŚĞƵŵĂ
ͬΎ
ZŚĞƵŵĂ
ͬΎ ĂŵĂŐĞ/ŶĚĞdžZŚĞƵŵĂ ZŚĞƵŵĂ ŚĂƐͺƚŚĞƌĂƉLJ ͬΎ
ŚĂƐͺ^ĞĐƚŝŽŶ
ŚĂƐͺĚƌƵŐ
ZŚĞƵŵĂ
ͬΎ
ŵĞĂƐƵƌĞƐͺŝŶĚŝĐĂŶƚ
ZŚĞƵŵĂ
ŚĂƐͺůŽŽĚͺĞůů
ͬΎ
ZŚĞƵŵĂ
ŚĂƐͺůŽŽĚͺ&ĂĐƚŽƌ
ͬΎ
ZŚĞƵŵĂ
ͬΎ
Fig. 6. OntoPath query for the application ontology of the use case. In OntoPath, the symbol “*” denotes any concept, and nested expressions (e.g. tree branches) are in brackets like in XPath.
In our use case, we group the external symbols according to the external ontologies they are pointing to. Then, a module containing the upper knowledge of each signature is extracted. The external signatures for our use case are the following ones: Rheuma
Sig
Rheuma
= {Patient
}
NCI
NCI
NCI
NCI
NCI
Sig = { Disease_Has_Associated_Anatomic_Site , Cell , Tissue , Anatomy_Kind , NCI NCI Rheumatoid_Arthritis , Anatomic_Structure_Is_Physical_Part_Of } UMLS_GENE
Sig
GALEN
UMLS_GENE
= { Gene
, Located_In GALEN
UMLS_GENE
UMLS_GENE
, Cell
}
Sig = {AbsoluteMeasurement , isCountConcentrationOf GALEN GALEN isInSuspensionWithin , Tissue }
GALEN
GALEN
, Cell
,
The top knowledge ontology is composed by the union of the upper modules extracted plus some additional axioms derived from the stored mappings that allow merging the upper knowledge of overlapping concepts. Mappings are stored in the data warehouse as 7-tuples 〈id, s1, s2, O1, O2, R, φ〉, where s1 and ss are symbols from ontologies O1 and O2 respectively, φ is a confidence value and R is the mapping relationship between these symbols, namely: equivalent (≡), subsumption (⊑) and disjointness (⊥). For each pair of top knowledge concepts s1, s2 for which a mapping is recorded, we add the corresponding axiom according to the mapping relationship: equivalentTo(e1, e2) for (≡), subClassOf(e1, e2) for (⊑) and disjoint(e1, e2) for (⊥).
24
V. Nebot et al.
As an example of the type of knowledge extracted with the previous approaches, in Figure 7, we show a fragment of the axioms extracted with the UM approach and the NCI OntoPath tool about the concept Rheumatoid_Arthritis under Disease . Upper Module ZŚĞƵŵĂƚŽŝĚͺƌƚŚƌŝƚŝƐ ƵƚŽŝŵŵƵŶĞͺŝƐĞĂƐĞ
ƵƚŽŝŵŵƵŶĞͺŝƐĞĂƐĞ /ŵŵƵŶĞͺ^LJƐƚĞŵͺŝƐŽƌĚĞƌ
Ontopath-based Module ZŚĞƵŵĂƚŽŝĚͺƌƚŚƌŝƚŝƐ ∃ŝƐĞĂƐĞͺ,ĂƐͺƐƐŽĐŝĂƚĞĚͺŶĂƚŽŵŝĐͺ ^ŝƚĞ͘ŽŶŶĞĐƚŝǀĞͺĂŶĚͺ^ŽĨƚͺdŝƐƐƵĞ ^ƚŝůůƐͺŝƐĞĂƐĞ
ZŚĞƵŵĂƚŽŝĚͺƌƚŚƌŝƚŝƐ
/ŵŵƵŶĞͺ^LJƐƚĞŵͺŝƐŽƌĚĞƌ EŽŶͲ EĞŽƉůĂƐƚŝĐͺŝƐŽƌĚĞƌͺďLJͺ^ƉĞĐŝĂůͺĂƚĞŐŽƌLJ
KůŝŐŽĂƌƚŝĐƵůĂƌͺ^ƚŝůůƐͺŝƐĞĂƐĞ ^ƚŝůůƐͺŝƐĞĂƐĞ
EŽŶͲEĞŽƉůĂƐƚŝĐͺŝƐŽƌĚĞƌͺďLJͺ^ƉĞĐŝĂůͺĂƚĞŐŽƌLJ EŽŶͲEĞŽƉůĂƐƚŝĐͺŝƐŽƌĚĞƌ
^LJŶŽǀŝĂůͺDĞŵďƌĂŶĞ ŽŶŶĞĐƚŝǀĞͺĂŶĚͺ^ŽĨƚͺdŝƐƐƵĞ
Fig. 7. External knowledge involved in Rheumatoid_Arthritis
Finally, in the current implementation, the MIO is composed by a set of OWL files connected through “import” statements gathering together the local axioms, topic axioms and external axioms. 5.4 Phase 3: MIO Validation The MIOs are validated at two levels: schema and instance. At the former level, we check that the generated ontology is consistent with respect to all the asserted axioms: local and external ones. If the ontology is not consistent, then we cannot generate a valid OLAP cube for it and the ontology should be fixed. For this purpose, it is necessary to detect invalid dimensions that constitute potentially not valid cubes. At the second level, once the multidimensional ontology is validated, it must be populated with instances from the data warehouse. The issues of this process will be explained in the following section. A MIO is a formal ontology in which all the knowledge has been included in order to perform the appropriate inferences and queries. This knowledge can also be used for checking certain properties and in this way, ensuring that not-valid final cubes will not result. In (Hurtado & Mendelzon, 2002) a set of structural constraints are applied to check some interesting properties of heterogeneous dimensions. These properties could be checked over the MIO ontology to indicate to the analyst that potential problems could arise in the final OLAP-based cube. Unfortunately, some of these properties can only be checked once the cube is formed (e.g. summarizability) as they depend on the specific dimension values and aggregation functions defined for the target cube. The set of properties that we can check in the multidimensional ontology are the following: • Disjointness. The member set of two categories belonging to the same dimension must be disjoint. Notice that with this constraint Stratification is also achieved, as any instance of a category can only roll up to an upper category instance.
Multidimensional Integrated Ontologies
25
• Category satisfiability. Another inference problem stated in (Hurtado & Mendelzon, 2002) is the satisfiability of a category in a dimension schema. Basically, this means that at least there exists an instance of the schema in which the member set of the category is not empty. This is equivalent to the problem of checking the satisfiability of the dimension classes with respect to the axioms of the MIO. • Shortcut free. This property is also known as “non-covering” in the OLAP literature (Pedersen et al. 2001). A shortcut occurs when a fact can be rolled up from a category Ci to another Cj without passing through an intermediate category Cx that connects both of them. This is true when the MIO contains the roles R_Ci_Cx , R_Cx_Cj and R_Ci_Cj. In other words, the graph formed by the concepts (nodes) and the set of roll-up relationships (edges) of each dimension, must not contain redundant edges. Moreover, ensuring that this graph is connected, and assuming that every instance can roll up to an instance of the concept Thing (⊤), we also ensure the Up-Connectivity property. • Orthogonality. This is the property of having a set of dimensions without dependency relationships. Dimension dependencies produce sparse cubes, as many combinations of dimension values are disallowed. Having dependent dimensions is considered a bad conceptual design (Abelló, 2002), although sometimes this is desired by the designer. In our case, we have to check when two categories of different dimensions are somehow related. Thus, first it must be ensured that the concepts of two different dimensions are all disjoint, and second that there does not exist any chain of properties relating two concepts of different dimensions (Romero & Abelló, 2007). • Summarizability (Lenz & Shoshani, 1997). The only way to achieve this property is by ensuring all the previous properties plus the functionality of all the roll-up properties. As it is difficult to ensure functionality from the original ontologies, this property will be checked over the final generated facts and dimensions. Notice that some multidimensional models (e.g. Pedersen et al., 2001) are able to deal with many-to-many relationships. This means that forcing functionality will depend on the features of the target multidimensional model. In the running example, disjointness is achieved by asserting the following axiom: LOCAL
alldisjoint( Disease LOCAL
Sex
LOCAL
, Drug
LOCAL
,Anatomy
, Follow-up
LOCAL
, Biomarker
LOCAL
LOCAL
,Tissue
LOCAL
, DamageIndex
LOCAL
, Age
LOCAL
, AgeGroup LOCAL
, DamageIndexGroup
,,
)
The resulting MIO is satisfiable and shortcut free. However, it can be demonstrated by using the axioms of the MIO that dimensions D1 and D5 are dependent, and therefore not completely orthogonal. For example, the following axioms show a dependency between the disease RA and the biomarker IL6: Rheumatoid_Arthritis ⊑ Disease ⊓ ∃Disease_Has_Associated_Anatomic_Site. Connective_and_Soft_Tissue Connective_and_Soft_Tissue ⊑ Tissue IL6 ⊑ Biomarker ⊓ ∃ Expressed_In_Cell.Synovial_Cell Synovial_Cell
⊑
Cell ⊓ ∃Anatomic_Structure_Is_Physical_Part_Of.Synovial_Membrane
Synovial_Membrane
⊑
Connective_and_Soft_Tissue
26
V. Nebot et al.
Here, we can conclude that both concepts are related somehow with Connective_ and_Soft_Tissue. Similarly, we can find some dependency between RA and blood sample biomarkers as RA is an autoimmune disease that mainly affect to macrophage cells in the blood. Indeed, the original definition of biomarker is that it provides clues to diagnose a disease, thus the strong dependency between both concepts. 5.5 Phase 4: OLAP-Based Analysis Before building the target OLAP-based cube, the MIO must be properly populated with the instances from the Semantic Data Warehouse that satisfy both the MIO and the set of specific roll-up relationships between them. This process consists of two phases: (1) the retrieval of ontological instances from the data warehouse, and (2) the transformation of the instances with an appropriate granularity for the OLAP cube. Additionally, the cube dimensions and their possible categories must be also built from the MIO concepts and roles. Subsequent sections describe these aspects with detail. Instance Retrieval Application ontology instances are stored in a RDF triple store like 3store (Harris and Gibbins, 2003) as shown in Table 1. The objective of this phase is to retrieve the appropriate instances that can populate the MIO. In order to accomplish this task we have considered two approaches. The first one seems the most straightforward and consists of using the triple store reasoning capabilities in order to extract all the required instances. A triple store such as 3store claims to support efficient processing of RDQL queries and RDF(S) entailments (RDF(S) entailments are not implemented in SparQL, the successor of RDQL). Therefore, it is trivial to translate the OntoPath query of Figure 6 into a set of RDQL queries that use the reasoning capabilities provided to extract the instances. However, some experiments have demonstrated that this kind of triple store is not scalable when dealing with RDF(S) entailments over ontologies of considerable size (e.g. a few thousand concepts and properties). Thus, a more long term solution must be devised. The second approach consists of leaving the RDF(S) entailments to OntoPath and use the triple store with the inference capabilities off. The OntoPath query of Figure 6 used for extracting the part of the AO schema relevant for analysis purposes is the one that dictates the instances to be retrieved from the SDW. The result of the above mentioned query is twofold. On one hand, OntoPath returns the sub-ontology that matches the query in the form of OWL primitives. This feature is useful when extracting the AO schema as well as the different fragments corresponding to the dimension hierarchies from domain ontologies in order to build the MIO. On the other hand, OntoPath can present the result of a query as a result set consisting of all the different sub-graphs of an ontology that match the query (RDFS entailments). Then, every OntoPath sub-graph from the result set can be translated into an appropriate RDF query language, such as SparQL. That is, every possible sub-graph returned by OntoPath corresponds to a SparQL query without RDFS entailments. Figure 8 shows part of the OntoPath query for our use case, the OntoPath result set and the translation of each sub-graph into SparQL.
Multidimensional Integrated Ontologies
27
Instance Transformations There are two kinds of transformations that must be applied to the retrieved instances and values in order to obtain consistent MIO instances, namely: 1) to convert data type values (or data type property ranges) into new instances and, 2) to change instance identifiers and instance types according to the existing mappings. The first kind of transformation is applied when a roll-up property is required over values instead of instances. For example, to roll up the feature hasAge into ageGroup we first need to convert ages (integer numbers) into instances, for example the value LOCAL 32 is converted into the instance Age_32. This instance belongs to the class Age which has been defined in the MIO. Now, we can assert that Age_32 rolls up to the instance adult through the role R_Age_AgeGroup. The second kind of transformations allows instances coming from different application ontologies to be expressed in the same terms within the MIO. This is KŶƚŽWĂƚŚYƵĞƌLJ WĂƚŝĞŶƚ
ZŚĞƵŵĂ
ŚĂƐͺZĞƉŽƌƚ
ZŚĞƵŵĂ
ͬ Ύ ͬ ŚĂƐͺ^ĞĐƚŝŽŶZŚĞƵŵĂ ͬ Ύ ͬ ŚĂƐͺƚŚĞƌĂƉLJZŚĞƵŵĂ ͬ Ύ ͬ
ŚĂƐͺĚƌƵŐZŚĞƵŵĂͬΎ ^ƉĂƌY>dƌĂŶƐůĂƚŝŽŶ
KŶƚŽWĂƚŚZĞƐƵůƚ^Ğƚ ;ƐƵďͲŐƌĂƉŚƐŵĂƚĐŚŝŶŐͿ WĂƚŝĞŶƚ
ZŚĞƵŵĂ
ŚĂƐͺZĞƉŽƌƚ
ZŚĞƵŵĂ
ͬ
ZŚĞƵŵĂƚŽůŽŐLJͺZĞƉŽƌƚͬ ŚĂƐͺ^ĞĐƚŝŽŶ
ZŚĞƵŵĂ
ͬdƌĞĂƚŵĞŶƚͬ
ŚĂƐͺƚŚĞƌĂƉLJ
ŚĂƐͺĚƌƵŐ
ZŚĞƵŵĂ
ͬƌƵŐͺdŚĞƌĂƉLJͬ
ZŚĞƵŵĂ
hD>^
ͬƌƵŐ
WĂƚŝĞŶƚ
ZŚĞƵŵĂ
ŚĂƐͺZĞƉŽƌƚ
ZŚĞƵŵĂ
ͬ
ZŚĞƵŵĂƚŽůŽŐLJͺZĞƉŽƌƚͬ ŚĂƐͺ^ĞĐƚŝŽŶ
ZŚĞƵŵĂ
ŚĂƐͺƚŚĞƌĂƉLJ
ŚĂƐͺĚƌƵŐ
ͬdƌĞĂƚŵĞŶƚͬ
ZŚĞƵŵĂ
ͬ:ŽŝŶƚͺ/ŶũĞĐƚŝŽŶƐͬ
ZŚĞƵŵĂ
hD>^
ͬƌƵŐ
SELECT * WHERE { ?person type Patient . ?person has_Report ?report . ?report type Rheumatology_Report . ?report has_Section ?section . ?section type Treatment . ?section has_therapy ?t . ?t type Drug_Therapy . ?t has_drug ?drug . ?drug type DrugUMLS } SELECT * WHERE { ?person type Patient . ?person has_Report ?report . ?report type Rheumatology_Report . ?report has_Section ?section . ?section type Treatment . ?section has_therapy ?t . ?t type Joint_Injections . ?t has_drug ?drug . ?drug type DrugUMLS }
Fig. 8. Translating from OntoPath sub-graphs into SparQL. Notice the OntoPath query results Rheuma matches Drug_Therapy and also in two sub-graphs since the range of has_therapy Joint_Injections, which is a subclass of Drug_Therapy.
28
V. Nebot et al.
performed by applying the existing mappings between the domain ontologies. For example, in our use case we have adopted NCI to represent disease concepts. If we want to include instances from an application ontology that uses GALEN for representing diseases, then we need to translate their instances to NCI terminology. This means to change their names as well as their types to NCI vocabulary. Notice that mapping-based transformations can produce both incomplete and imprecise facts. Incomplete facts can be generated if the class of an instance has no (direct or inferred) mapping associated to the target ontology. Imprecise facts are generated when the mapping is inherited (i.e. it occurs for some super-class of the instance’s class), and therefore the instance must be expressed with a broader concept. Another required transformation for instances consists of changing the detail level at which they are expressed in the ontologies. For example, in the application ontology shown in Figure 2, all the instances related to drugs are borrowed from the domain ontology UMLS, but their type within the application ontology will be always Drug. This is because when the clinician is prescribing a drug to the patient, she is not concerned with the whole taxonomy in which the drug is placed but just with the drug’s name. However, when analyzing patient data, the UMLS taxonomy for drugs is necessary to define dimension D2, and therefore the instances must have associated its actual type. For example, in Table 5, the instance Infliximab will change its type Rheuma UMLS from Drug to AntiRheumaticAgent . Considering our use case, Table 5 shows a subset of the instances that populate the LOCAL . In this case, the dimension D3 has been generated by local concept Patient transforming the values of the data type property hasAge of the Rheumatology application ontology. Instances in dimensions D1, D2 and D5 have changed its type to that of the domain ontologies from which they are taken. LOCAL
Table 5. Example of instances that populate the concept Patient in the MIO of the proposed use case. For biomarker instances (D5), we use the symbols + /– to denote presence/absence and ↑/↓ for high/low levels. ID ϴϳϴϳƵ ϴϵϵϭƵ ϴϵϵϭƵ ϴϴϴϮƵ ϴϴϴϮƵ ϵϵϭϮƵ
D1 Zϭ :/ϭ :/ϭ ZϮ ZϮ ^ϭ
D2 /ŶĨůŝdžŝŵĂď ƚĂĐĞƌŶĞƉƚ ƚĂĐĞƌŶĞƉƚ EĂƉƌŽdžĞŶ EĂƉƌŽdžĞŶ DĞƚŚŽƚƌĞdžĂƚĞ
D3 ŐĞϯϮ ŐĞϭϱ ŐĞϭϱ ŐĞϮϳ ŐĞϮϳ ŐĞϯϰ
D4 DĂůĞ DĂůĞ DĂůĞ &ĞŵĂůĞ &ĞŵĂůĞ DĂůĞ
D5 EĞƵƚƌŽƉŚŝů↑ Z&Ͳ WƌŽƚĞŝŶн ,>н ,>Ͳ ^Z↓
D6 ϭϮ ϳ ϳ ϭϰ ϭ ϭϮ
D7 ϭ ϭ ϯ ϭ Ϯ ϭ
Generating cube dimensions During the generation of the final analysis cube, the symbols of the MIO are interpreted as elements of the target multidimensional data model. Thus, concepts, properties and instances of the MIO will be interpreted as dimensions, categories, members, attributes and facts of the multidimensional model. Depending on the restrictions of the target multidimensional model, it can be necessary to transform some of the MIO symbols with the purpose of obtaining the proper interpretation. Moreover, many symbols of the ontology could be interpreted in different ways, resulting in very different cubes.
Multidimensional Integrated Ontologies
29
A dimension concept (e.g. Disease) is usually interpreted as a dimension category of the multidimensional data model. However, the members of these categories can be either the instances or the subclasses of the dimension concept. In the second case, as subclasses can be also hierarchically organised, they can produce further categories in the dimension. Figure 9 shows examples of these two interpretations. The members of the category Anatomy are the different anatomical instances (e.g. different body parts of each patient), whereas the members of the category Disease are the names of the sub-classes of Disease. Notice that two sub-categories are defined due to the hierarchical relationships between these sub-classes. Concerning the cube roll-up relationships between dimension categories, we also have different interpretations depending on the interpretation adopted for the involved categories. Thus, we have three possible interpretations, namely: 1. If both categories have instance members, then R_Ci_Cj is interpreted at instance level too, and therefore each asserted triple (i1, r, i2) associated to R_Ci_Cj defines a roll-up relation RU(i1, i2). 2. If the lower category contains instance members and the upper one contains class names, then we interpret R_Ci_Cj as before, but the roll-up relation is set to RU(i1,Cx), with Cx∈Type(i2) and Cx ⊑ Cj. 3. If the related categories Ci and Cj contain class names, and they are connected with a roll-up role R_Ci_Cj, then we have two possible situations: • If there are no asserted instances associated to R_Ci_Cj, for each R ⊑R_Ci_Cj such that C’i∈domain(R) and C’j∈range(R), a roll-up relation RU(C’i,C’j) is set. • Otherwise, the asserted triples (i1, r, i2) associated to R_Ci_Cj defines a roll-up relation RU(Cx, Cy) where Cx∈Type(i1) and Cx ⊑ Ci and Cy∈Type(i2) and Cy ⊑Cj. It is worth mentioning that the selection of the interpretation is done by the analyst. Figure 10 shows examples of these three interpretations for some categories defined in the use case.
Fig. 9. Two different interpretations for defining a dimension category
Another relevant aspect to take into consideration when building roll-up hierarchies is the multiplicity between related categories. Ideally, each roll-up relationship should have a predominant multiplicity of many-to-one in order to properly aggregate data. In our use case, the role R_Disease_Anatomy however has a one-to-many predominant multiplicity, which means that it is not useful for aggregating data in the resulting cube. In order to include Anatomical information in
30
V. Nebot et al.
Fig. 10. Different interpretations for roll-up relationships: instance-instance, instance-class and class-class roll-ups
the cube, we can either use the inverse role R_Anatomy_Disease or include Anatomy data in some attribute of the Disease members. The former solution is not valid in our use case as in the application ontology Anatomy concepts (e.g. SynovialJoint) and Disease concepts are not related to each other and therefore we cannot state reliable roll-up relations. In the second solution, we can only use anatomical data to restrict the diseases that the clinician wants to analyze. Finally, it is worth mentioning that Disease and Anatomy cannot be defined as two different dimensions because they are dependent on each other. In order to complete the cube definition, additional member attributes can be taken from any of the properties associated to the MIO concepts that do not participate in the roll-up relationships. The whole translation process from MIO to the target cube is a very complex task that will determine the possible analysis tasks to be performed through OLAP operations. As a consequence, this process deserves more attention in the future work in order to automate it as much as possible. A good starting point is the methods presented in (Pedersen et al., 1999).
6 Implementation Issues Currently we have partially implemented the proposed framework for SDWs. In this section we describe the main issues we have addressed during this preliminary implementation. In our first approach we have adopted the tried-and-tested “data warehousing” approach. Here, all source data is first extracted from the data sources (in our case both external, web-based sources and internal sources). Then, the data is transformed and various validation checks are performed. Some checks are completed before transformations are performed, and some after transformations (e.g., into a dimension) are performed, as described in Section 5. In order for the data to comply
Multidimensional Integrated Ontologies
31
with the constraints, some data cleansing will be performed, e.g., new dimension members may be added in order to balance the hierarchy to achieve summarizability. Finally, the transformed data is stored in the SDW database. Because of the complex RDF-based structure of the ontologies, we have chosen an RDF triplestore, specifically 3store (Harris and Gibbins, 2003). Although 3store provides a limited form of logical reasoning based on the RDFS subClassOf hierarchies, it does not scale well. The reason is that it makes explicit all the entailments of the ontology. In this way, we have used 3store only for storing large sets of instances generated by the application ontologies, assuming that these ontologies do not contain large concept hierarchies and therefore do not require large sets of entailments. Regarding the domain ontologies, the SDW must also provide the storage and querying mechanisms for them. Currently, there are a few approaches to store and query large OWL ontologies (Lu et al., 2007, Roldán-García et al., 2008). The main difference between these approaches and triplestores is that OWL stores must allow entailments with the same expressivity of the stored ontologies, which goes beyond the hierarchies defined in RDFS. Unfortunately, current OWL stores are not able to handle very large expressive ontologies, nor does current reasoners support secondary storage. In our current implementation we have used both OntoPath and a series of labellingbased indexes specially designed to handle very large OWL-based ontologies (Nebot and Berlanga, 2008). These indexes allow the fast retrieval of sub-graphs and the fast construction of upper modules as those required by our methodology. It is worth mentioning that with these indexes we are able to check if one concept subsumes another by simply comparing two intervals. We have evaluated these indexes over the UMLS meta-thesaurus, which contains 1.5 million concepts and 13 million relationships. By using OntoPath indexes, we are able to build upper modules for signatures of hundreds of concepts in a few minutes. In this way, we achieve the scalability of the system by efficiently building customized modules, which can be handled by current reasoners. Following the running example, in Table 6 we show some statistics about the different fragments extracted from external domain ontologies in order to enrich the dimension hierarchies. As it can be seen, the relative size of the fragments compared to the whole ontologies is drastically reduced, which shows the scalability of the MIOs used for analysis purposes. Similarly, Table 7 shows statistics about the top knowledge ontology, which is also part of the MIO. The top knowledge ontology is composed by the union of the upper modules extracted plus some additional axioms derived from the stored mappings that allow merging the upper knowledge of overlapping concepts. Once more scalability is assured since the size of the top knowledge is insignificant compared to the size of the original ontologies. Table 6. Statistics about fragments extracted for dimension hierarchies
32
V. Nebot et al. Table 7. Statistics about top knowledge extracted from every ontology
Concerning the ontology mappings, despite the large number of semi-automatic approaches that exist to generate them (see surveys presented in (Choi et al., 2006, Euzenat, 2007), current precision results are not good enough to make the automatic transformations proposed in this paper reliable. Moreover, most ontology matchers can only handle small ontologies (Hu et al., 2008), which limit their usefulness in our scenario. Fortunately, in our application scenario about Biomedicine, there exists a great interest in integrating existing knowledge resources. As a result, most ontologies are being annotated with UMLS terms and other standard vocabularies (e.g. NCI), which notably eases the mapping problem. Our preliminary experiments by using these vocabularies to link domain ontologies are promising.
7 Conclusions and Future Work In this paper we have set the bases for the multidimensional analysis of Semantic Web data in a data warehouse. We have reviewed the work that combines data warehouse and semantic web technologies. From this review we conclude that XMLrelated technologies are becoming mature enough to enable the construction of semistructured web data repositories. We have also highlighted the promising usage of the Semantic Web languages to integrate distributed data warehouses and to describe and automate the ETL process of a data warehouse. Regarding the analysis of semantically annotated data, the existing alternatives are only valid for single and small ontologies. Unfortunately, many real applications imply several large interlinked ontologies. As a solution, we have defined the Semantic Warehouse as an XML repository of ontologies and semantically annotated data of a particular application domain; and we have proposed a new framework to design conceptual multidimensional models starting from a set of application and domain ontologies. Our approach has a number of advantages. For example, the users can easily state facts and dimensions of analysis by selecting the relevant concepts from the ontologies. The methodology’s underlying multidimensional model is very simple, only facts, measures, dimensions, categories and roll-up relationships need to be identified. This will allow us to implement the model in almost any existing multidimensional database by performing the proper transformations. Regarding the scalability of the approach, we are able to manage large-sized ontologies by selecting fragments representing semantically complete knowledge modules.
Multidimensional Integrated Ontologies
33
Modeling diagrams such as those proposed in (Abelló et al., 2006; Franconi & Ng, 2000) can be very helpful to guide users when defining a MIO. As future work, we plan to study how they can be coupled with ontology editors and reasoners to facilitate the creation of MIOs. Another interesting research line is to define appropriated indexing schemes for SDWs that enable the interaction of reasoners with OLAP tools. Finally, we consider that addressing the temporal aspects of the semantic annotations, and the incremental consistency checking and reasoning with our MIObased approach are also very attractive challenges. In the future work we plan to carry out a deeper study of alternative implementations of SDWs. The main drawbacks of the current implementation include that the data may become outdated due to sources updates and that the extraction and validation process takes a long time to perform. A problematic issue which is particular to SDWs is that especially external data may have such a bad quality that the validation checks may disallow their integration in the materialized data warehouse, even if some parts of the data have sufficient quality. In this way, the options are either to allow bad data quality or to refuse some data to be admitted into the SDW. An alternative to the materialized approach consists of a virtual implementation. That is, the SDW only exists as a collection of metadata, pointing to the underlying (external and internal) data sources. The actual extraction of data from the sources is not done until query time. This also means that the validation and other constraint checks will have to be done at query time. Here, the main difference from the materialized implementation is that only the data items and ontology parts directly related to the specific query being executed are extracted, transformed, and validated. This approach is quite similar to the virtual OLAP-XML integration engine (Pedersen et al., 2002). During query processing, a triplestore can be used for intermediate storage and processing (validation inference, etc.). Again, it will in the long term be more optimal to develop a dedicated query engine for this particular scenario. Because of the smaller data volumes, both a triplestore-based and a dedicated solution will be able to perform almost all processing in main memory. The advantages include that data is always up-to-date, and that the initial processing cost is lower. Additionally, data that has partially bad quality can be handled easily as long as the problems do not affect the queries at hand. The main drawback is that queries will be much slower. To avoid this, a mixed implementation can be the solution. Acknowledgements. This work was supported by the Danish Research Council for Technology and Production, through the framework project “Intelligent Sound” (FTP No. 26-04-0092), and the Spanish National Research Project TIN2008-01825/TIN.
References [1] Abelló, A.: YAM2: A Multidimensional Conceptual Model. PhD thesis, Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Spain (2002) [2] Abelló, A., Samos, J., Saltor, F.: Yam2: a multidimensional conceptual model extending UML. Information Systems 31(6), 541–567 (2006) [3] Baader, F., Sattler, U.: Description logics with aggregates and concrete domains. Information Systems 28(8), 979–1004 (2003)
34
V. Nebot et al.
[4] Bao, J., Caragea, D., Honavar, V.: Package-based description logics - preliminary results. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 967–969. Springer, Heidelberg (2006) [5] Borgida, A., Serafini, L.: Distributed description logics: Assimilating information from peer sources. Journal on Data Semantics 1, 153–184 (2003) [6] Bouquet, P., Giunchiglia, F., van Harmelen, F., Serafini, L., Stuchenschmidt, H.: COWL: Contextualizing ontologies. In: Fensel, D., Sycara, K.P., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 164–179. Springer, Heidelberg (2003) [7] Calvanese, D., Giacomo, G.D., Lenzerini, M.: A framework for ontology integration. In: Semantic Web Working Symposium, pp. 303–316 (2001) [8] Choi, N., Song, I.-Y., Han, H.: A survey on ontology mapping. SIGMOD Record 35(3), 34–41 (2006) [9] Cuenca-Grau, B., Parsia, B., Sirin, E., Kalyanpur, A.: Automatic partitioning of OWL ontologies using E-connections. In: Description Logics. CEUR Workshop Online Proceedings, vol. 147 (2005) [10] Danger, R., Berlanga, R.: A Semantic Web approach for ontological instances analysis. Communications in Computer and Information Science 22, 269–282 (2008) [11] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007) [12] Franconi, E., Ng, G.: The i.com tool for intelligent conceptual modeling. In: Proc. of the 7th International Workshop on Knowledge Representation Meets Databases, pp. 45–53 (2000) [13] Garwood, K., McLaughlin, T., Garwood, C., Joens, S., Morrison, N., Taylor, C.F., Carroll, K., Evans, C., Whetton, A.D., Hart, S., Stead, D., Yin, Z., Brown, A.J., Hesketh, A., Chater, K., Hansson, L., Mewissen, M., Ghazal, P., Howard, J., Lilley, K.S., Gaskell, S.J., Brass, A., Hubbard, S.J., Oliver, S.G., Paton, N.W.: PEDRo: a database for storing, searching and disseminating experimental proteomics data. BMC Genomics 5(68) (2004) [14] Harris, S., Gibbins, N.: 3store: Efficient Bulk RDF Storage. In: Proc. of the First International Workshop on Practical and Scalable Semantic Systems. CEUR Workshop Online Proceedings, vol. 89 (2003) [15] Horrocks, I., Sattler, U.: Decidability of SHIQ with complex role inclusion axioms. In: International Joint Conference on Artificial Intelligence, pp. 343–348 (2003) [16] Hu, W., Qu, Y., Cheng, G.: Matching large ontologies: A divide-and-conquer approach. Data and Knowledge Engineering 67, 140–160 (2008) [17] Hurtado, C.A., Mendelzon, A.O.: OLAP dimension constraints. In: Proc. ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, pp. 169–179 (2002) [18] Jiménez-Ruiz, E., Berlanga, R., Nebot, V., Sanz, I.: OntoPath: A Language for Retrieving Ontology Fragments. In: Meersman, R., Tari, Z. (eds.) Proc. of On the Move to Meaningful Internet Systems, pp. 897–914 (2007) [19] Jiménez-Ruiz, E., Cuenca-Grau, B., Sattler, U., Schneider, T., Berlanga, R.: Safe and economic re-use of ontologies: A logic-based methodology and tool support. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 185–199. Springer, Heidelberg (2008) [20] Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. The Knowledge Engineering Review 18(1), 1–31 (2003) [21] Köhler, J., Philippi, S., Lange, M.: Semeda: ontology based semantic integration of biological databases. Bioinformatics 19(18), 2420–2427 (2003) [22] Lenz, H., Shoshani, A.: Summarizability in OLAP and statistical data bases. In: Ninth International Conference on Scientific and Statistical Database Management, pp. 132–143 (1997)
Multidimensional Integrated Ontologies
35
[23] Louie, B., Mork, P., Martin-Sanchez, F., Halevy, A., Tarczy-Hornoch, P.: Data integration and genomic medicine. Journal of Biomedical Informatics 10(1), 5–16 (2006) [24] Lu, J., Ma, L., Zhang, L., Brunner, J.-S., Wang, C., Pan, Y., Yu, Y.: SOR: A practical system for ontology storage, reasoning and search. In: Proc. of the 33th International Conference on Very Large Data Bases, pp. 1402–1405 (2007) [25] Lutz, C., Areces, C., Horrocks, I., Sattler, U.: Nominals, and Concrete Domains. Journal of Artificial Intelligence 23, 667–726 (2005) [26] Marian, A., Abiteboul, S., Cóbena, G., Mignet, L.: Change-centric management of versions in an XML warehouse. In: Proc. of the 27th International Conference on Very Large Data Bases, pp. 581–590 (2001) [27] Mena, E., Illarramendi, A., Kashyap, V., Sheth, A.P.: Observer: An approach for query processing in global information systems based on interoperation across pre-existing ontologies. Distributed and Parallel Databases 8(2), 223–271 (2000) [28] Nebot, V., Berlanga, R.: Building Ontologies from Very Large Knowledge Resources. In: Proc. Of 11th International Conference on Enterprise Information Systems (submitted, 2009) [29] Nguyen, T.B., Abiteboul, S., Cóbena, G., Preda, M.: Monitoring XML data on the web. In: Proc. of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 437–448 (2001a) [30] Nguyen, T.B., Tjoa, A.M., Mangisengi, O.: Meta Cube-X: An XML Metadata Foundation of Interoperability Search among Web Data Warehouses. In: Nguyen, T.B. (ed.) Proc. of the Third International Workshop on Design and Management of Data Warehouses. CEUR Workshop Online Proceedings, vol. 39 (2001b) [31] Pedersen, D., Riis, K., Pedersen, T.B.: XML-extended OLAP querying. In: Proc. of the 14th International Conference on Scientific and Statistical Database Management, pp. 195–206 (2002) [32] Pedersen, T.B., Jensen, C.S., Dyreson, C.E.: Extending practical pre-aggregation in online analytical processing. In: Proc. of the 25th International Conference on Very Large Data Bases, pp. 663–674 (1999) [33] Pedersen, T.B., Jensen, C.S., Dyreson, C.E.: A foundation for capturing and querying complex multidimensional data. Information Systems 26(5), 383–423 (2001) [34] Pérez, J.M., Berlanga, R., Aramburu, M.J., Pedersen, T.B.: Integrating data warehouses with web data: A survey. IEEE Transactions on Knowledge and Data Engineering 20(7), 940–955 (2008) [35] Pérez-Rey, D., Maojo, V., García-Remesal, M., Alonso-Calvo, R., Billhardt, H., MartinSánchez, F., Sousa, A.: Ontofusion: Ontology-based integration of genomic and clinical databases. Compututers in Biology and Medicine 36(7-8), 712–730 (2005) [36] Priebe, T., Pernul, G.: Ontology-based integration of OLAP and information retrieval. In: Proc. of the 14th International Workshop on Database and Expert Systems Applications, pp. 610–614 (2003) [37] Roldán-García, M., del, M., Aldana-Montes, J.F.: DBOWL: Towards a scalable and persistent OWL reasoner. In: The Third International Conference on Internet and Web Applications and Services, pp. 174–179 (2008) [38] Romero, O., Abelló, A.: Automating multidimensional design from ontologies. In: Proc. of the 10th International Workshop on Data Warehousing and OLAP, pp. 1–8 (2007) [39] Rubin, D.L., Shah, N.H., Noy, N.F.: Biomedical ontologies: a functional perspective. Briefings in Bionformatics 9(1), 75–90 (2007) [40] Schmidt-Schauss, M., Smolka, G.: Attributive concept descriptions with complements. Artificial Intelligence 48(1), 1–26 (1991)
36
V. Nebot et al.
[41] Simitsis, A., Skoutas, D., Castellanos, M.: Natural language reporting for ETL processes. In: Proc. of the ACM 11th International Workshop on Data Warehousing and OLAP, pp. 65–72 (2008) [42] Skoutas, D., Simitsis, A.: Designing ETL processes using Semantic Web technologies. In: Proc. of the ACM 9th International Workshop on Data Warehousing and OLAP, pp. 67–74 (2006) [43] Stuckenschmidt, H., Klein, M.C.A.: Reasoning and change management in modular ontologies. Data and Knowledge Engineering 63(2), 200–223 (2007) [44] Wang, L., Zhang, A., Ramanathan, M.: Biostar models of clinical and genomic data for biomedical data warehouse design. International Journal of Bioinformatics Research and Applications 1(1), 63–80 (2005) [45] Xyleme: A dynamic warehouse for XML data of the Web. IEEE Data Engineering Bulleting 24(2), 40–47 (2001)
A Unified Object Constraint Model for Designing and Implementing Multidimensional Systems François Pinet1 and Michel Schneider1,2 1
Cemagref, 24 Avenue des Landais, 63172 Aubière Cedex, France 2 LIMOS, Complexe des Cézeaux, 63173 Aubière Cedex, France franç
[email protected],
[email protected]
Abstract. Models for representing multidimensional systems usually consider that facts and dimensions are two different things. In this paper we propose a model based on UML which unifies the representations of fact and of dimension members. Since a given element can play the role of a fact or of a dimension member, this model allows for more flexibility in the design and the implementation of multidimensional systems. Moreover this model offers the possibility to express various constraints to guarantee desirable properties for data. We then show that this model is able to handle most of the hierarchies which have been suggested to take real situations into account and to characterize certain properties of summarizability. Using this model we propose a complete development cycle of a multidimensional system. It appears that this cycle can be partially automated and that an end user can control the design and the implementation of his system himself.
1 Introduction Numerous works have been devoted to the elaboration of models for multidimensional systems. Their objective has been to find an organization of facts and dimensions which can be used to implement the various operations of analysis and which secures the strict control of aggregations along dimensions. In particular it is necessary to forbid double-countings or additions of non-additive data. A great effort has been made to handle complex organisations of data encountered in reality, and different propositions have been formulated to manage complex hierarchies in dimensions: non-covering and non-strict hierarchies, specialisations, etc. The majority of these models presuppose that facts and members of dimensions are fixed. But in real applications, needs are evolutionary and multiple. So, more flexible models are required that can integrate various points of view, change perspectives of analysis, cross analyses, share data in multiple ways and assemble existing structures dynamically. In this context, the possibility of handling facts and members of dimensions symmetrically, of inverting their roles, of reorganizing dimensions, of declaring sharable and reusable structures becomes a question of prime importance. A second issue which remains very open is that of the integrity constraints. To avoid the use of incoherent data and to control the property of summarizability, it is important to be able to specify a certain number of constraints on a multidimensional model. Relatively few works have shown an interest in grasping these aspects as a whole. S. Spaccapietra et al. (Eds.): Journal on Data Semantics XIII, LNCS 5530, pp. 37–71, 2009. © Springer-Verlag Berlin Heidelberg 2009
38
F. Pinet and M. Schneider
We would also like to evoke the issue concerning design and implementation of multidimensional systems. Several works have expressed an interest in these aspects but there is at present no complete and recognized approach to designing and implementing multidimensional systems. Ideally it should be possible to base such an approach on standards and existing platforms (UML, MDA, Relational OLAP, decisional tools). In this paper we propose a model which can contribute to these issues. First it can be used to apprehend the modelling of facts and dimension members in a unified way. So a given element can be a fact for a given analysis and a member for another analysis. It can also be used to share the dimensions in various ways and to combine the results of different analyses. In this model we introduce the possibility of specifying various types of constraints which can be applied to facts and to elements of hierarchies. We then show how these constraints can be used to handle complex hierarchies in dimensions and to characterize certain situations where the property of summarizability is respected. We also use this model to propose a complete development cycle of a multidimensional system by starting from a relational data store. It appears, after all, that the unified vision which provides this model contributes to the flexibility which is needed in the design and the implementation tasks of a multidimensional system. The paper is organized as follows : Section 2 is a review of the literature related to the subject; Section 3 presents a motivating example; Section 4 presents our unified model for facts and members; Section 5 illustrates the different multidimensional structures we can represent with this model; Section 6 deals with the extension of our model to specify constraints; Section 7 shows how our model can handle different kinds of hierarchies; Section 8 is concerned with properties of summarizability in the presence of constraints; Section 9 is devoted to the design and the implementation of multidimensional structures; Section 10 concludes and draws a number of perspectives.
2 Related Works Many models have been suggested for multidimensional systems. A majority of them concentrate especially on the organization of dimensions. Generally the members of a dimension are organized in hierarchies where aggregation paths are composed of many-to-one relationships. The semantics for these relationships is very variable: containment function [43], functional dependences [14, 26], part-whole relationships [1], drilling relationships [50]. In [14], the functional dependences are also used to relate facts to dimensions. In [47] a dimension is viewed as a lattice, and two functions anc (ancestor) and desc (descendant) are used to perform the roll up and the drill down operations. Different works [24, 32, 34, 38] have proposed more elaborate models to take into account situations which are met in reality such as unbalanced and non-ragged hierarchies, non-strict hierarchies, and specialization/generalization hierarchies (also called heterogeneous hierarchies). In [24] it is shown that attributes occurring in sub-classes can be considered as optional dimension levels. Normal forms of dimensions are also proposed which guarantee summarizability. In [34] solutions for handling heterogeneous and mixed-granularity hierarchies are suggested. Mixed-granularity occurs when a class supports aggregations at different levels. This situation is often encountered in heterogeneous hierarchies. In [38] different data
A Unified Object Constraint Model
39
hierarchies (balanced and non-balanced, ragged and non-ragged) are considered and their influence on rollup paths and on the expressive power of the OLAP cube is studied. In [41] an extended multidimensional data model is proposed. It is also based on a lattice structure and it provides a solution for modelling non-strict hierarchies where many-to-many relationships can occur between two levels in a dimension. Classification of complex hierarchies are proposed in [32, 34]. The work of [32] discusses their mapping to the relational model. In [28, 29] normal forms are suggested for the star and the snowflake relational models. Some works have investigated more precisely the interconnections between facts and dimensions. In [41] a set of facts and a dimension are linked through a factdimension relation. In [46] different solutions are discussed to deal with many-tomany relationships between facts and dimensions in a relational context. The YAM model [4] allows the use of O-O semantic relationships between different star structures. The possibility of making several analyses simultaneously and of combining their results is also an important feature of this model. The relative roles of facts and dimensions have also been the subject of a number of investigations. In [43], the authors propose PUSH and PULL operators for swapping the roles of measures and dimensions. In [42], the requirement of symmetric treatment of dimensions and measures is formulated. The paper of [3] demonstrates the convertibility of fact and dimension roles in multi-fact multidimensional schemes. The MultiDim model [33] is an extension of the entity relationship formalism and can be used to deal with spatial multidimensional structures. It is important to note various propositions [5, 9, 14, 30] for cubic models where the primary objective is the definition of an algebra for multidimensional analysis. Expressiveness of the algebra is the main topic of these works. It is not easy to compare the possibilities of these various models and algebras. Several authors have contributed to this objective. In [7] requirements for OLAP applications are defined and four multidimensional models are compared. No model meets all the requirements. The work of [2] compares sixteen different models of different levels (conceptual, logical, physical). It appears that conceptual models offer the possibility of representing much more semantics, but they do not incorporate an algebra for analysing the data. The authors of [35] collect, clarify and classify the different extensions to the multidimensional model proposed in recent years. Constraints are not often considered in these models. Among the preliminary works dealing with constraints we can mention those of [8, 17, 24, 26]. As is underlined in [12], it is important to be able to specify structural and also semantic constraints. Structural constraints are related to the structure, while semantic constraints are related to the analysis context. The authors propose a rich set of semantic constraints used to impose restrictions within a dimension or between several dimensions. Other works concentrate on constraints to handle the problem of the summarizability of measures along hierarchies. In [27] and [45] general conditions are established for summarizability in multidimensional structures. In [41], summarizability is linked to the distributivity of the aggregate functions along the partitions of the categories of a dimension. In [18] a powerful framework is proposed to specify constraints involving data and metadata and to use these constraints to characterize summarizability properties.
40
F. Pinet and M. Schneider
In order to deal simultaneously with the structural aspects and the dynamic aspects, several authors have proposed oriented object models, some of them being based on an extension of UML. The models of [16, 39, 44, 49] are basic models which incorporate and formalize essential notions (fact classes, dimension classes, roll-up associations, measures) and offer a number of specific features. The work of [49] introduces the notion of cube classes with a set of possible operations to define the analysis. In [16], a fact class can be aggregated from several others fact classes. A UML profile is suggested. It can be manipulated through an extension of the CASE Tool Rational Rose. The models of [4] and [31] are much more sophisticated. In [4], there are 6 types of nodes in the multidimensional graph. Various types of associations are available. It is possible to change the dimensions of a cube. In [31], typical multidimensional structures are defined through packages. 14 stereotypes are suggested for packages, classes, associations and attributes. Each of these two models is supported by a specific UML profile. Since this profile is complex to manage, its correct use is controlled through the definition of constraints in natural language or in OCL expressions. Concerning the design and the implementation of multidimensional systems, various propositions have been made depending on a given context or platform. In [15] an environment is suggested which is able to generate the implementation of a star or snowflake multidimensional structure from a conceptual schema. The generation process takes into account the limitations of the OLAP target system (Cognos Powerplay or Informix Metacube). In [13], a solution is proposed to derive multidimensional structures from E/R schemas. The work of [37] also suggests a method for developing dimensional models from E/R schemas. Different options for the resulting schema can be chosen (flat, star, snowflake, constellation). In [44] a multidimensional structure is derived from a UML schema and in [20] from a relational one. The work of [48] addresses the problem of integrating the data from heterogeneous databases and storing it in the repository of the multidimensional structure. In this work, the multidimensional structure is seen as a set of materialized views. So the problem becomes one of view selection. Different algorithms are proposed and compared for solving it. The work of [36] describes how to align the whole data warehouse development process with an MDA framework.
3 A Motivating Example In this section we introduce a significant example to motivate our proposition. This example will be used throughout the paper. We suppose that a manufacturing organization has stored different information in a data store about its production over a long period (several years). The UML class diagram of this data store is represented in figure 1. Each series of production is stored in the Manufacturing class. A manufacturing concerns a given product and is made up of different operations. An operation is made by a machine. There is a many-to-many association between the Manufacturing class and the Machine class. An occurrence of this association corresponds to an operation of a given manufacturing made by a given machine. The same operation can occur several times in a manufacturing with the same machine or with different machines. A
A Unified Object Constraint Model
41
serial number is used to mark the order of an operation in the sequence of a manufacturing. For example a welding operation can be associated to the manufacturing M100, the first time with serial number 2 and machine MA2, and the second time with serial number 4 and a machine MA6. This means that this operation occurs in position 2 and in position 4 in the sequence of operations for the manufacturing M100. Stock movements are stored in the Stock_movement class. A stock movement concerns a given product. The corresponding quantity can be positive (input) or negative (output).
0DQXIDFWXULQJ PDQXBQXPEHU PDQXBGDWH TXDQWLW\
RSHUDWLRQBQDPH GXUDWLRQ VHULDOBQXPEHU 0DFKLQH PDFKLQHBQDPH XQLWBFRVW
3URGXFW
&DWHJRU\
SURGXFWBQXPEHU SURGXFWBQDPH
)DPLO\
FDWHJRU\BQDPH
IDPLO\BQDPH
6WRFNBPRYHPHQW
:DUHKRXVH
PRYHPHQWBQXPEHU PRYHPHQWBGDWH TXDQWLW\
ZDUHKRXVHBQDPH
7RZQ
)DFWRU\
IDFWRU\BQDPH VL]H
WRZQBQDPH SRSXODWLRQ
5HJLRQ
UHJLRQBQDPH
0DQXDO TXDOLILFDWLRQ $XWRPDWLF ORDGLQJBGXUDWLRQ ^GLVMRLQWFRPSOHWH`
Fig. 1. The schema of a data store for a manufacturing organization
Different multidimensional structures can be derived from this schema depending on the needs for analysis. First, simple analysis such as counting the number of instances can be made on any class independently of the others. Grouping on attributes of this class may also be possible. For example it is possible to count the factories for each size. Grouping by using attributes of associated classes is also possible. For example it is possible to count the factories for each town or each region. This corresponds to a more sophisticated analysis. One can also compute the average manufactured quantity for each product and each factory. Only associated classes through a many-to-one association can be used. These classes define the members of the dimensions which can be used in the analysis.
42
F. Pinet and M. Schneider
From these preliminaries observations, we can conclude that: i) Any class can be viewed as a fact. ii) Any class which can be reached through a many-to-one association (directly stated or derived from the composition of several associations) can be viewed as a level of a dimension. Example 1: In the schema of Figure 1, the class Category can be reached from the class Manufacturing, by composing two one-to-many associations. This means that an instance of Manufacturing is linked to a unique instance of Category. So Category plays the role of a member for Manufacturing. In other words one can analyse Manufacturing by using information coming from Category. It results from the previous observations that any class can play the role of a fact or a member depending on the needs for analysis. Saying that a class A is a fact means that we can apply the count operator on the instances of A or we can apply any aggregation operator on an attribute of A. This attribute can be one of different types (numeric, string, date, etc.); the only condition is that we can apply the aggregation operators on it that are useful for the decisional applications. Such an attribute is often called a fact attribute or a measure attribute or more simply a measure. Saying that a class B is a member of a dimension means that there is an attribute of B whose values can be used to aggregate the instances of another class C connected to B through a many-to-one association. Another important observation must be underlined. We have implicitly made the hypothesis that the members of the dimensions which are used in the analysis are derived from the data store itself. In some situations, dimensions can be obtained from external databases. For example the Product dimension can be loaded from an external database memorizing all the characteristics of the different products.
4 A Unified Model for Multidimensional Structures We first propose a single element type to represent facts and members. We then suggest a graph representation of a multidimensional structure based on an UML extended model which can be used to exhibit desirable properties of such structures. 4.1 A Unique Element Type for Modelling Facts and Members We show that the two structures of a fact type and of a member type are very close. So we suggest a unique unified representation. A fact type typically has the following structure Fact_name[(fact_key), (list_of_reference_attributes), (list_of_other_attributes)]
where - Fact_name is the name of the type; - fact_key is the name of the key for the type; it identifies each instance of the type; - list_of_reference_attributes is a list of attribute names; each attribute has a value which is a reference to a member instance in a dimension or a reference to another fact instance; - list_of_other_attributes is a list of attribute names; each attribute is a measure attribute (or more simply measure) for the fact type or a degenerated dimension.
A Unified Object Constraint Model
43
The set of referenced dimensions comprises the dimensions which are directly referenced through the list_of_reference_attributes. A degenerated dimension [21] is an attribute which itself represents a given dimension; in others terms all the values we need for the analysis along this dimension are associated to this attribute. Each fact attribute can be analyzed along each of the referenced dimensions or the degenerated dimensions. Analysis is achieved through the computing of aggregate functions on the values of this attribute. There may be no measure. Such a case is called factless fact [21]. In this case a fact records the occurrence of an event or a situation. An analysis consists in counting occurrences satisfying a certain number of conditions. Only the fact_key is mandatory. In this case, the analysis possibilities are very limited: we can only count the instances of the type. A member type has the following structure: Member_name[(member_key), (list_of_reference_attributes), (list_of_property_attributes)]
where - Member_name is the name of the type; - member_key is the name of the key for the type; the values of member_key are used for the analysis (typically member_key is used as a parameter of the aggregation operators); - list_of_reference_attributes is a list of attribute names where each attribute is a reference to the successors of the member instance in the dimension; - list_of_property_attributes is a list of attribute names where each attribute is a property for the member. Only the member_key is mandatory. A property attribute [19], also called weak attribute, is used to describe a member. It is linked to its member through a functional dependence, but it does not introduce a new member and a new level of aggregation. For example a member town in a dimension may have property attributes such as population, administrative position, etc. These attributes are not of interest in to specifying groupings but they can be useful in the selection predicates of queries to filter certain groups. Fact type and member type have a very similar structure. Moreover, property attributes of a member type can be considered as measures and can be analyzed along the successors of this member acting as roots of partial dimensions. For example, suppose that in a dimension we have a member town with a property attribute population which references the member region. One can analyze population by using region: one can calculate aggregates on population with groupings on region_name. So, we can represent a fact or a member with a unique type called element type. An element type has the following structure: Element_name[(element_key), (list_of_references), (list_of_specific_attributes)]
where - Element_name is the name of the type; - element_key identifies each instance of the type;
44
F. Pinet and M. Schneider
- list_of_references is composed of attribute names; each attribute references another element ; these references determine the multidimensional structure; - list_of_specific_attributes is a list of attribute names; each attribute is a measure for a fact or a property for a dimension member. The list of references determines the dimensions along which the element can be analyzed. Each specific attribute can be analyzed along each of these dimensions. Analysis is achieved through the computing of aggregate functions on the values of this attribute. The aggregation is specified through the values of the chosen members in the dimensions. Example 2. As an example let us consider the following type which corresponds to the manufacturing class in Figure 1: Manufacturing[(manu_number), (product_number, factory_name), (manu_date, quantity)]
The key is (manu_number). There are two references to members in dimensions: product_number, factory_name. There are two other attributes: manu_date and quantity. The first can be considered as a degenerated dimension. The second can be considered as a measure. It can be analyzed through aggregate operations by using the two references and the different members which can be reached. But in some circumstances manu_date can be considered as a measure and duration as a degenerated dimension. 4.2 Multidimensional Schema Based on an Extended UML Representation For representing our model with UML we need very few extensions: fact class and member class, external identified class which is a generalisation of the two previous classes, measure, aggregating association (also called roll-up association), and constraint. We want to avoid overloading the multidimensional graph so as to facilitate interactions with the users (designer, developer, end user). We thus propose to represent it through UML tags (fact class and member class, external identified class, measure) or through graphic notations (aggregating association, constraint, see section 6 for the representation of constraints). These propositions can easily be inserted into existing UML profiles. We now show how this UML extension can be used to represent the constituents of our model. An element type A can be represented by a UML class CA with the following conventions: − the element_key of A becomes an attribute of class CA which serves as an external identifier; this identifier is tagged with {id} (abbreviation for {id=true}) in the graphical notation; a class which possesses an external identifier is called an identified class; − each reference r of A to another element type B is represented by a many-to-one association between class CA of A and class CB of B where class CB must possess the attribute r as its external identifier. It is called aggregating association and is drawn with a double line arrow from CA to CB; − others attributes become normal attributes of the class.
A Unified Object Constraint Model
45
Note that one should not confuse “aggregating” and “aggregation” associations; aggregation association is a common UML relationship with different semantics. Using these conventions the Manufacturing type of example 2 is represented as indicated in Figure 2a. By default, the cardinalities of the aggregating association is 1 (side of the arrow) and 1..n (opposite side of the arrow). In such a case the association is said to be total. If the cardinalities on the side of the arrow is 0..1 the association is said to be partial. In this case certain instances of the source class are not linked to an instance of the target class.
0DQXIDFWXULQJ
0DQXIDFWXULQJ^I`
PDQXBQXPEHU^LG` PDQXBGDWH TXDQWLW\
PDQXBQXPEHU^LG` PDQXBGDWH TXDQWLW\^P`
3URGXFW
SURGXFWBQXPEHU^LG` «
)DFWRU\
IDFWRU\BQDPH^LG` «
3URGXFW
)DFWRU\
SURGXFWBQXPEHU^LG` «
IDFWRU\BQDPH^LG` «
D E
Fig. 2. Representation of the Manufacturing type with our extended UML model
An aggregating association has precise semantics: it indicates that the instances of the source class CA can be grouped into partitions. A partition is composed of all the instances of CA which are associated to the same instance of the target class CB. In other words a partition is associated to a given value of the external identifier of class CB. The aggregation operation consists in computing an aggregate for each partition and to associate this aggregate to the corresponding instance of CB. Representing this association by an arrow is important for different reasons. First the arrow indicates the direction of the aggregation and the ROLLUP operations. Second it becomes possible to use the graph theory for characterising certain properties of multidimensional structures. The representation of a complete structure results from the representation of each type and its references with the conventions proposed in this section. This leads to an extended UML class diagram. However such a diagram, in order to represent a multidimensional schema, must satisfy certain properties which are clarified in the following definition. Definition 1 (multidimensional schema) A multidimensional schema (MDS) is a connected UML class diagram which involves only identified classes and aggregating associations. A fact multidimensional schema (FMDS) is an MDS where at least one class has been chosen to play the role of a fact and where all the other classes can be reached (through a path) from the fact classes. In order to be general, we do not impose the acyclicity property. It thus becomes possible to analyze people, for example, depending on their relatives (i.e. grouping by
46
F. Pinet and M. Schneider
parents, grandparents, etc.). However, acyclicity, if required for an application, can be imposed through a constraint (cf. Section 6). These notions of MDS and FMDS formalize the intuition of a multidimensional space at the schema level where a fact represents a point in this space [3]. The interest in modelling a multidimensional space for a domain is to supply a complete outline on all the analyses which can be run for this domain, to study the complementarity of these analyses and their possible connections. It constitutes a good reference to determine whether it is possible to satisfy all the decisional needs from existing data and thus to decide if evolutions in the data organisation are necessary. We will need conventions for representing facts and measures. For this purpose, we will continue to use the tagging principle of UML. So a fact class is an identified class whose name is tagged with {f}. A measure is an attribute of a fact class whose name is tagged with {m}. For this kind of attribute we will also use the tag {mode} to indicate if the measure is additive, semi-additive or non-additive. If this tag is omitted, the measure is considered to be additive. Example 3. Figure 2a represents an MDS. If we tag the Manufacturing class as a fact (Figure 2b), it becomes an FMDS. A fact class can possess several measures and several degenerated dimensions. Separation between theses two kinds of attributes is not rigid. It depends on the analysis. An attribute can act as a measure for an analysis and a degenerated dimension for another one and vice-versa. An FMDS can possess several fact classes. This means that these fact classes share a number of multidimensional elements. Later we will justify the fact that an FMDS is able to model complex multidimensional structures having several fact types as the galaxy structure [22]. Definition 1 does not impose a procedure to build a multidimensional schema. One can first extract the various multidimensional elements from the schema of a data store and then choose the fact for analysis. One can, on the contrary, first choose a fact and then extract the dimensions which can be reached from it. We can also connect a fact to predefined dimensions. Definition 2 (dimension and sub-dimension) A dimension is an MDS where a unique class root exists such that we can reach all the other classes from it. The semantics of a dimension results from the semantics of its root. Let D be a dimension; a sub-dimension of D is a sub-graph of D which is itself a dimension. Example 4. Figure 2a can represent a dimension the semantics of which is that of a manufacturing process. This means that a given fact class can point to this dimension for the purpose of analysis by using the different characteristics of a manufacturing process. The MDS which is constructed from the classes Manufacturing and Factory with the aggregating association between them is a sub-dimension of the previous one. 4.3 Properties of MDS Property 1 (sub-graph). Any connected sub-graph of an MDS is also an MDS.
A Unified Object Constraint Model
47
Proof. The sub-graph integrates only identified classes and aggregating associations. The sub-graph is also connected (hypothesis). So, it is an MDS. This property means it is possible to delete the elements which are not of interest for the analysis. It can be also exploited to define a sub-dimension in the perspective of sharing for example. This sub-dimension could then be specified as a package as in [36]. Property 2 (transitivity). Let (X,Y) and (Y,Z) be two aggregating associations in an MDS. One can derive an aggregating association (X,Z) which results from the transitive combination of (X,Y) and (Y,Z). Proof. Each instance Xi of X can be connected to a unique instance Yj of Y. Yj can also be connected to a unique instance Zk of Z. Xi is then connected to Zk and cannot be connected to another instance of Z. So the association between X and Z is an aggregating association. Note that this association is total if the aggregating associations (X,Y) and (Y,Z) are total. This property is well-known for models which use the notion of categories for organizing the dimensions. We have reformulated it in the context of our model. This property means that an MDS can be simplified by deleting nodes that are of no use. If there are two aggregating associations in an MDS (X,Y) and (Y,Z) and we do not need node Y, we can delete Y, (X,Y), (Y,Z) and install a new aggregating association (X,Z) by deriving it from (X,Y) and (Y,Z). Property 3 (fusion of two MDS). Let G1 and G2 be two different MDS. Suppose that G1 possesses a class C1 with an attribute r1 which is the external identifier r2 of a class C2 of G2 such that all possible values of r1 are also values of r2 and such that the semantics of r1 is similar to that of r2. One can derive an aggregating association between C1 and C2 and the graph which results from the connection of G1 and G2 with this association is an MDS. Proof. The graph which results from the connection is connected and possesses only identified classes and aggregating associations. So it is an MDS. This property can be used to complete a dimension by an existing sub-dimension. It thus becomes possible to define a multidimensional space from several sources or from pre-existing packages. For example, Figure 3 represents an MDS resulting from the connection of the MDS of Figure 2a with the identified class Time. 7LPH
0DQXIDFWXULQJ
GDWH^LG` FDWHJRU\
PDQXBQXPEHU^LG` TXDQWLW\
3URGXFW
)DFWRU\
7LPH
SURGXFWBQXPEHU^LG` «
IDFWRU\BQDPH^LG` «
GDWH^LG` FDWHJRU\
Fig. 3. Result of the connection of the Time class with the MDS of Figure 2a
48
F. Pinet and M. Schneider
4.4 Practical Uses of an MDS An MDS is a very simple representation which can largely facilitate the design of a multidimensional system. By using well-known graph algorithms one can easily resolve problems such as: construct the largest MDS contained in the schema of a data store; construct the largest MDS having a given root; construct the smallest MDS which contains a list of elements supplied by a user. Such algorithms can thus facilitate the design of reusable packages [36]. Besides, as underlined previously, an MDS integrates various analysis needs. By implementing an MDS one can simultaneously satisfy several groups of users. The same physical structure can thus be shared between different kinds of users and be used for different viewpoints. Our model can produce MDS which insure the traditional objective of a multidimensional system where various ways to analyze a fact are sought out. But it can also be used to produce an MDS able to answer the question: what are the various analyses which can be made with existing data? Finally an MDS can be used as a support to imagine ergonomic graphical interfaces for manipulating multidimensional structures.
5 The Variety of Multidimensional Structures Which Can Be Handled 5.1 The Different Possible Configurations In this sub-section we explain how with our unified model we can represent traditional multidimensional structures. First, a fact can directly reference any dimension member through an aggregating association. Usually a dimension is referenced through its root. But it is also interesting and useful to have references to members other than the roots. This means that a dimension can be used by different facts with different granularities. For example, a fact can directly reference town in a localisation dimension and another can directly reference region in the same dimension. This second reference corresponds to a coarser granule of analysis than the first. Moreover, a fact F1 can reference any other fact F2. So F2 also plays the role of a member. This type of reference is necessary to model certain situations (see the next sub-section). This means that a measure of F1 can be analyzed by using the identifier of F2 (acting as the grouping attribute of a normal member) and also by using the dimensions referenced by F2. Figure 4 illustrates the typical structures we can represent with our model. Case (a) corresponds to the simple case, also known as star-snowflake structure, where there is a unique fact type F1 and there are several separate dimensions D1, D2, etc. The other cases correspond to a situation where one dimension is shared. A dimension or a part of a dimension can be shared between two references of the same fact (cases (b) and (c)) or between two different facts (cases (d), (e), (f)). In all these cases, a fact can reference another fact directly (this means that this fact belongs to the corresponding dimension). This corresponds to the situation which is called facts of fact and which is illustrated in the next sub-section.
A Unified Object Constraint Model
)^I`
'
)^I`
'
'
49
)^I`
'
'
' '
D E F )^I`
)^I`
'
)^I`
'
)^I`
'
)^I`
)^I`
'
' '
G H I
Fig. 4. Modelling typical multidimensional structures (Other configurations are possible where a fact references another fact)
5.2 Illustrating the Modelling of a Realistic Case: Facts of Fact The case “facts of fact” has been widely studied since it is frequently encountered in real situations. It corresponds to a many-to-many association between a fact and a dimension. It is also called “degenerated fact” [11]. To illustrate the modelling of this case let us consider the class diagram of Figure 1. One can analyze the attribute quantity of Manufacturing class by using the dimensions having the classes Product and Factory as roots since the Manufacturing class is connected to Product and Factory classes through many-to-one associations. But we cannot analyze the attribute duration (of an operation) a priori, since it not attached to a fact class which is the source of many-to-one associations. The solution consists in introducing a new fact class called Manu_operation (with the attributes duration, serial_number, operation_name) which can be now connected to the classes Manufacturing and Machine through two many-to-one associations respectively. We thus have two fact classes Manufacturing and Manu_operation which are connected through an aggregating association (Figure 5). Manufacturing is called the primary fact class and Manu_operation the secondary fact class. For the secondary fact class, the primary fact class acts as a multi-dimension. Therefore, all the dimensions of the primary fact class can also be used as dimensions of the secondary fact class. This clearly appears in the FMDS of the global structure (Figure 5). The attribute duration can thus be analyzed using the Machine dimension but also the Manufacturing dimension. It should be noted that serial_number and operation_name can be considered as degenerated dimensions. Indeed, one can analyze the duration of an operation by using serial_number (for example, what is the average duration of an operation when it occurs in position 2?) or operation_name (for example, what is the average duration for the welding operations for manufacturings of the year 2007?).
50
F. Pinet and M. Schneider &RPSRQHQWBSDUW^I` PDQXBRSHUDWLRQBNH\^LG` FRPSRQHQWBSDUWBNH\^LG` TXDQWLW\^P` VHULDOBQXPEHU GXUDWLRQ^P` 0DFKLQH 0DQXIDFWXULQJ^I` &RPSRQHQW PDFKLQHQDPH^LG` FRPSRQHQWBNH\^LG` PDQXBQXPEHU^LG` « « PDQXBGDWH TXDQWLW\^P` 6XSSOLHU 3URGXFW VXSSOLHUBNH\^LG` « 0DQXBRSHUDWLRQ^I`
Fig. 5. Modelling facts of fact
It is possible to associate a second secondary fact class to the same primary fact class. To illustrate this possibility let us suppose that in the previous example we now wish to analyze the component parts which are needed for a manufacturing and which are necessary to buy in advance from suppliers outside the company. For a given manufacturing, several component parts are involved. We thus introduce another secondary fact class Component_part which references the Manufacturing class (Figure 5). The measure attribute quantity of the Component_part class can thus be analyzed not only according to the Component dimension (for example: which supplier supplied the largest number of component parts?) but also according to the Manufacturing (for example: how many component parts is a manufacturing likely to require on average?) and of the dimensions Product and Time (for example, what is the total number of component parts which were supplied by a given supplier for each month in year 2005?). This situation of “facts of fact” occurs when we want to analyze a fact attribute characterizing a many-to-many association where one of the involved classes is also a fact class. The corresponding MDS can be simply derived from the UML class diagram by transforming each many-to-many association between the classes X and Y into a class XY and connecting it to classes X and Y through two many-to-one associations. There is another situation referring to a many-to-many relationship between a fact and a dimension when we want to analyze the fact along dimension [46]. For example let us suppose that we want to analyze an attribute cost of the Manufacturing class along the dimension Machine in order to determine the contribution of each machine. Several solutions have been suggested to solve this situation [46]. These solutions are very similar to those suggested for dealing with non-strict hierarchies (cf. Section 7).
6 Constraints in MDS/FMDS Multidimensional models have to verify various types of constraints so as to guarantee quality required for analysis. In the state of the art we have underlined that
A Unified Object Constraint Model
51
constraints are not always considered as multidimensional models. We show in this section that our model can be enriched with different types of constraints that either come from the UML model or that are specific to multidimensional structures. 6.1 Constraints on Aggregating Associations These constraints are close to those introduced by [12]. But instead of defining these constraints on a whole hierarchy, we introduce them on aggregating associations. They can therefore be used at any level of a hierarchy. 6.1.1 Constraints on a Separate Aggregating Association Total and partial constraints. For a separate aggregating association (X,Y), each instance of the source class X must either participate or not in the association. The first case corresponds to a total constraint and the second to a partial constraint. These constraints are specified through the cardinalities for the participation of X in the association. The total constraint corresponds to a 1..1 cardinality (default value) and the partial constraint corresponds to a 0..1 cardinality. As an example, consider the hierarchy of Figure 6 which is adapted from [18]. It is a case of a non-covering hierarchy since some levels are skipped. This hierarchy is also incomplete since some instances do not participate to an aggregating association. Onto constraint. A class Y of an MDS other than a root verify the onto constraint for
the aggregating association (X,Y) coming from X if for each instance of Y there is an instance for (X,Y). In others words, all the instances of Y can be reached from an instance of X. An onto constraint can thus be specified through the cardinality 1..n for the participation of Y in the aggregating association (X,Y).
ZZZZ
:DVKLQJWRQ1HZnotEmpty() = self.Machine->notEmpty() The constraint {i} in Figure 7 is expressed as follows: context Manu_Operation inv: self.Operation->notEmpty() implies self.Machine->notEmpty() Specification of the alternative paths constraint. The constraint corresponding to the two alternative paths (Time_root, Week, Year) and (Time_root, Month, Year) of Figure 8 can be specified as: context Time_root inv: self.Week.Year = self.Month.Year It is not appropriate to specify with OCL constraints such as one_path() and acyclic() since their argument is a whole schema or sub-schema. The best way consists in specifying and verifying it through the development platform during the design phase.
7 Handling Different Kinds of Hierarchies with Our Model Different kinds of hierarchies have been recognised as being useful for multidimensional systems [32, 34]. We discuss in this section how our model, with the help of the constraints introduced in the previous section, can handle these hierarchies. These hierarchies were initially considered on dimensions. For our model the conditions that these hierarchies respect can be considered on an MDS. Symmetric hierarchy (or strict hierarchy) There is only one path and each instance of class X, where (X,Y) is an aggregating association, does not necessarily have a corresponding instance in class Y. With our model, we impose the {one_path} constraint on the schema and the cardinality 1 for the participation of X in (X,Y). Asymmetric hierarchy There is only one path and an instance of class X where (X,Y) is an aggregating association cannot have a corresponding instance in class Y. With our model, we impose the {one_path} constraint on the schema and the cardinality 0..1 for the participation of X in (X,Y). Multiple alternative hierarchies There are several alternative paths as defined in 6. We control this kind of hierarchy by using the alternative paths constraint. Onto hierarchy This is a symmetric hierarchy where all the classes are onto for all possible incoming aggregating associations. We can force such a hierarchy with the cardinalities as we have seen in the previous section.
A Unified Object Constraint Model
55
Non-covering hierarchy We have illustrated with Figure 6 how our model handles this kind of hierarchy. Generalized hierarchy (heterogeneous hierarchy) This kind of hierarchy includes sub-classes that correspond to a generalization /specialization association. An example is given in Figure 9. This association introduces heterogeneous sub-hierarchies since each subclass has its own attributes and aggregation levels. The specialization can be handled in a simple way if it is disjoint and total. There is then a 1-1 correspondence between every instance of the super-class (i.e. an instance Mi of Machine) and an instance of the sub-classes (i.e. an instance MAj of Manual or an instance Ak of Automatic). Aggregates coming from Mi can thus be propagated either on MAj or Ak. One can install aggregating associations from the super-class to each of its specialized sub-classes and use the constraints {d,t} to characterize the situation. We use so the same symbolism as in Section 6 to represent the different possibilities which one can encounter with the aggregating associations. This solution is used in [34] for the normalization phase of a multidimensional conceptual schema. 0DFKLQH
0DFKLQH
zz^GW` ^GLVMRLQWFRPSOHWH`
0DQXDO
$XWRPDWLF
0DQXDO
$XWRPDWLF
Fig. 9. Modelling a generalized hierarchy
2UGHU
2UGHU RRRRRRRR
3XUFKDVHU
3XUFKDVHU SSSSS zzz^GW`
^GLVMRLQW FRPSOHWH` DVWVWV $GPLQ6WDII
$GPLQ'LYLVLRQ
7HDFKLQJ6WDII
7HDFKLQJ'LYLVLRQ
$GPLQ6WDII
$$7
$GPLQ'LYLVLRQ
Fig. 10. Modelling a mix-granularity hierarchy
7HDFKLQJ6WDII
7HDFKLQJ'LYLVLRQ
56
F. Pinet and M. Schneider
If the specialization is not complete, a Machine instance cannot be linked to an instance of a subclass and so some aggregates cannot be transmitted to upper levels in order to guarantee summarizability. The solution consisting in introducing a “placeholder” [34] can be taken into account by our model. If the specialization is not disjoint, a Machine instance can be linked to several instances of the subclasses and it is therefore no longer possible to distribute the aggregates of the super-class to its sub-classes. Solutions comparable to those recommended for non-strict hierarchies can be used to solve this case and can be taken into account by our model. Mix-granularity hierarchy Figure 10 represents a mix-granularity hierarchy. This example is adapted from [34]. AdminDivision, AdminStaff and TeachingStaff are subclasses of Purchaser. Specialization is total and disjoint. The mixed granularity comes from the fact that class AdminDivision appears at two different levels of granularity with two different roles: the first as target of an aggregating association and the other as target of a specialization. This situation can be modelled by representing the generalization/specialization association with aggregating associations and by using constraints on aggregating associations. A non disjoint specialization poses the same problem as the one evoked previously. Non-strict hierarchy A hierarchy is non-strict if there is at least one many-to-many association between two different classes. An example is given in Figure 11a where a product is manufactured in different versions (primary version, standard version, de luxe version), and a version refers to several products. This situation poses a specific problem: how can we aggregate on version the quantity (of products) available at the level of manufacturing since a product has several targets at the upper level. Different solutions have been suggested to solve this problem [32]. One solution [21] consists in transforming a many-to-many association between X and Y into a fact which refers to X and Y as dimensions. For our example of Figure 11a we thus obtain a new fact class Product_version (Figure 11b) whose instances are all the possible combinations of a product and a version. In this case, the only attribute of this class is its identifier. It refers to the class Product and the class Version through two aggregating associations. The fact class Manufacturing now refers to the new fact class Product_version. This solution means it is necessary to modify the semantics of Manufacturing: there is now an instance of Manufacturing for each possible occurrence of Product_version. Consequently the measure quantity gives the quantity per product and version. Note that the new fact class cannot be used for analysis and the schema can be simplified as indicated by Figure 11c by using the transitivity property of aggregating associations. The schema of Figure 11c makes it possible to perform an analysis on Product only, on Version only, or on Product + Version. Others solutions [32, 34] which have been recommended to deal with non-strict hierarchies can be also taken into account with our model.
A Unified Object Constraint Model
0DQXIDFWXULQJ^I`
0DQXIDFWXULQJ^I`
PDQXBQXPEHU^LG` PDQXBGDWH TXDQWLW\^P`
PDQXBQXPEHU^LG` PDQXBGDWH TXDQWLW\BSHUBSURGXFWBYHUVLRQ^P`
3URGXFW
3URGXFWBYHUVLRQ^I`
SURGXFWBQXPEHU^LG` «
3URGXFWBYHUVLRQBQXPEHU^LG`
9HUVLRQ
YHUVLRQBQDPH^LG` «
D
3URGXFW
9HUVLRQ
SURGXFWBQXPEHU^LG` «
YHUVLRQBQDPH^LG` «
E F
57
0DQXIDFWXULQJ^I` PDQXBQXPEHU^LG` PDQXBGDWH TXDQWLW\BSHUBSURGXFWBYHUVLRQ^P`
3URGXFW
9HUVLRQ
SURGXFWBQXPEHU^LG` «
YHUVLRQBQDPH^LG` «
Fig. 11. Handling a non-strict hierarchy
8 Summarizability in the Presence of Constraints on Aggregating Associations Summarizability is an important property which has been studied in certain works. In [45] and [8], global conditions which have to respect a hierarchy are expressed and justified. In [18], a formal frame is suggested to study non-covering and incomplete hierarchies and an algorithm for checking the summarizability is proposed. It is shown that taking the meta-data into account is not sufficient; data are also required. In this section we study summarizability only at the schema level. We first give a definition of summarizability which is adapted to our model. We then show how the constraints on aggregating associations can be used for characterizing properties of summarizability. This characterization is not as general as that of [18] but it is easier to use and it seems sufficient in the majority of real situations. We study only the summarizability for acyclic FDMS. So we suppose that the acyclic() constraint is always satisfied. Definition 2 (summarizability) Let G be an acyclic FMDS; let m be a measure associated to a fact node R of G; let X, Y be two nodes of G which can be reached from R. Y is said to be summarizable from
58
F. Pinet and M. Schneider
X relative to m if all the aggregates or values of m at the level of X can be propagated (by using the available paths in the FMDS) at the level of Y with no omission and no double counting. This definition concerns only the conservation of the values of a measure. It does not clarify how aggregates of level n are calculated from those of level n-1. For example for the operator SUM, an aggregate of level n is obtained simply by adding the corresponding aggregates of level n-1 when the definition is respected. For other operators, it is clear that supplementary information will be necessary. For example, for the operator AVG, we need also the number of fact instances associated to each aggregate. For the following we will consider a unique measure m. So we do not mention it. Property 4 (transitivity). Let X, Y, Z be nodes of an acyclic FMDS G. If X is summarizable from X and Z is summarizable from Y then Z is summarizable from X. Proof. This property results directly from the definition. Example 5 In case of Figure 12a, it appears intuitively, that D is summarizable from R and X is summarizable from D because of the total constraint which is associated to the aggregating associations (R,D) and (D,X). So X is summarizable from R (transitivity property). A is summarizable from R because of the total constraint which is associated to the aggregating association (R,A). X is summarizable from A because all the aggregates at the level of A are propagated to X through the different paths starting from A and ending in X. Indeed, the different constraints associated to these paths guarantee no omission and no double counting for this propagation. The combination of these 5^I`
5^I`
$
$
^GW`zz
$
^GW`zz
%
'
^GW`zz
^GW`zz
%
'
^GW`zz
%
'
^GW`zz
&
;
5^I`
&
;
&
;
D E F
Fig. 12. Marking of an FMDS for determining summarizability
A Unified Object Constraint Model
59
two properties by transitivity provides another reason to conclude that X is summarizable from R. Intuitively, summarizability results from the existence of one or several paths which are used to propagate aggregates without double counting and without omission. These paths therefore possess properties of conservation which we formalize by using constraints on aggregating associations. Definition 3 (total path) A total path from X to Y is a path where each edge (aggregating association) is total. Definition 4 (conservative junction) A conservative junction of X relative to Y is a dt junction of node X where each aggregating association of the junction points to Y, or points to a node U which is linked to Y through a total path, or points to a node which possesses a conservative junction relative to Y. Property 5 Let X, Y be two nodes of an acyclic FMDS G. The following holds: i) Y is summarizable from X if there is a total path from X to Y; ii) Y is summarizable from X if X possesses a conservative junction relative to Y. Proof. If there is a total path from X to Y, then all the aggregates which possess the instances of X are propagated to the instances of Y. Furthermore, because every instance of X points only to a single instance of Y, there cannot be a double counting. So Y is summarizable from X according to the definition. If X possesses a conservative junction relative to Y, this means that each branch of the junction guarantees a propagation of the corresponding aggregates with no omission and no double countings. Since the junction is total and disjoint, the union of the flows of the different branches represent all the aggregates which possess the instances of X with no omission and no double countings. So Y is summarizable from X. This property is based on the well-known disjointness and completeness conditions [27]. Adapting these conditions to our context makes it possible to reason upon dimension paths and to deal with the various situations which can occur in complex hierarchies. Algorithm 1: marking an FMDS from a node X to test its summarizability Inputs: an FMDS G, a node X of G Step 1: Mark X (with * for example). Step 2: Mark any node U such that (U,V) is a total edge and V is marked. Mark any node U such that H is a “dt” junction of U and each edge of H points to a marked node. If a new node has been marked iterate step 2, else continue with step 3. Step 3: Return G. Output: G marked from node X Example 6 The FMDS of Figure 12a can be marked from X as indicated in Figure 12b. C and D are marked first, then R and B, then A. If we change the constraint of the edge (C,X) into partial (Figure 12c), we can mark only D and R.
60
F. Pinet and M. Schneider
Property 6 Let G be an acyclic FMDS and X one of its nodes. Let S be the set of nodes marked from node X with Algorithm 1. X is summarizable from any node of S. Proof. If from node X the Algorithm 1 marks a node Z of the FMDS, then at least one of the following situations occurs: i) There is a total path with marked nodes from Z to X; ii) Z possesses a conservative junction relative to X and all the nodes of the paths from U to Z are marked; iii) There is a total path with marked nodes from R to U where U possesses a conservative junction relative to X and all the nodes of the paths from U to X are marked; The summarizability of X thus results from property 5 for situations i) and ii) and from property 5 and property 4 for situation iii). There is another situation where we can make conclusions with regard to results concerning summarizability: this is the case of a node which is the target of an aggregating association which belongs to a "dt" junction. Definition 5 (partial summarizability) Let G be an acyclic FMDS, X be a node of G with a "dt" junction, Y be a node of G which is the target of one of the reference of the "dt" junction. Y is said to be partially summarizable from X relative to a measure m if the instances of X possess all the aggregates or values of m with no double countings and no omission. Justification. In the case of definition 5, the aggregates which possesses X are partitioned among the references of the "dt" junction. So each node Y which is the target of one of these references receives the aggregates or values of m of the corresponding partition. In these aggregates there are no double countings and no omissions (Y receives all the possible aggregates resulting from the "dt" constraint). Property 7 If Y is partially summarizable from X, and Z is summarizable or partially summarizable from Y, then Z is partially summarizable from X. Proof. The proof results from the definition. Z receives all the possible aggregates from Y (and thus from X) relative to the measure m with no double countings and no omission. Definition 6 (summarizability of an FMDS) An FMDS is said to be summarizable from a node X which possesses all the values of a measure m if each node is summarizable or partially summarizable from X relative to m. Practical use of these results By using these results we can test the summarizability of any node Y from the fact node which possesses the measure m. This fact node can be a root of the FMDS or any other node which is an antecedent of Y. Summarizability can be tested relative to each fact node which possesses a measure which is used in an analysis.
A Unified Object Constraint Model
61
9 Designing and Implementing Multidimensional Systems In this section we discuss the capacity of our model to support a systematic approach for the development of a multidimensional system which covers all the steps of the life cycle and which integrates specification and check of constraints. The main interest in this approach is that it works in a semi-automated manner. We show how an end user can control the designing and the implementation process through a Helping Development System (HDS) of which we draw up the main characteristics. 9.1 Overview of the Approach A multidimensional system is often implemented over an existing data store. So, several works have shown an interest in the design of a multidimensional schema by starting from the schema of a data store. Some works have recommended that the starting point could be a conceptual schema [44], and others a logical or a physical schema [20]. In fact these two recommendations are complementary. The conceptual schema provides the semantics of the data while the physical schema provides the exact names of the structures where the data is stored. Our approach can be used to extract useful information from both schemas.
([WUDFWDQLQLWLDO0'6IURPDQ H[LVWLQJVRXUFHGDWDVWRUH« RSWLRQDO
,QWHJUDWHH[LVWLQJ0'6RU)0'6 DQGHODERUDWHWKH)0'6
&RQVROLGDWHRUUHYLVHWKH)0'6 LQFOXGLQJFRQVWUDLQWV
&RQFHSWXDOVFKHPDRI DQH[LVWLQJVRXUFH
([LVWLQJ0'6DQG )0'6
)0'6RIWKHV\VWHP
7HVWVXPPDUL]DELOLW\RSWLRQDO
QR VXPPDUL]DELOLW\RN \HV 0DSLQWRWKHUHODWLRQDOPRGHO JHQHUDWHWDEOHVDQGFRGHFRQVWUDLQWV
5HODWLRQDOVFKHPDRI WKHV\VWHP
/RDGWKHGDWDDQGYHULI\WKH FRQVWUDLQWV
5HODWLRQDOGDWDEDVHRI WKHV\VWHP
Fig. 13. Overall process for designing and implementing a multidimensional system
62
F. Pinet and M. Schneider
PDQXBRSHUDWLRQPDQXBRSHUDWLRQBNH\ PDQXBQXPEHU RSHUDWLRQBQDPH PDQXDOBPDFKLQHBQDPH DXWRPDWLFB PDFKLQHBQDPHVHULDOBQXPEHUGXUDWLRQ SNPDQXBRSHUDWLRQBNH\ INPDQXBQXPEHUUHIHUHQFHVPDQXIDFWXULQJPDQXBQXPEHUQRWQXOO INPDQXDOBPDFKLQHBQDPHUHIHUHQFHVPDQXDOBPDFKLQHPDQXDOBPDFKLQHBQDPH INDXWRPDWLFBPDFKLQHBQDPHUHIHUHQFHVDXWRPDWLFBPDFKLQHDXWRPDWLFBPDFKLQHBQDPH PDQXIDFWXULQJPDQXBQXPEHUPDQXBGDWHTXDQWLW\SURGXFWBQXPEHU SNPDQXBQXPEHU INSURGXFWBQXPEHUUHIHUHQFHVSURGXFWSURGXFWBQXPEHUQRWQXOO PDQXDOBPDFKLQHPDFKLQHBQDPHXQLWBFRVWTXDOLILFDWLRQ SNPDFKLQHBQDPH DXWRPDWLFBPDFKLQHPDFKLQHBQDPHXQLWBFRVWORDGLQJBGXUDWLRQ SNPDFKLQHBQDPH SURGXFWSURGXFWBQXPEHUSURGXFWBQDPHFDWHJRU\BQDPH SNSURGXFWBQXPEHU
Fig. 14. Schema of a relational data store
The overall process is given in Figure 13. The first four phases are related to the design of the multidimensional schema. The last two control the implementation. We suppose that the data store is supported by a relational system. Indeed most existing data bases and data stores are implemented through the relational technology. We also suppose that the primary key, foreign key and not null constraints are explicitly declared. We will illustrate the process by using a simple example of a relational data store which is a partial implementation of the conceptual schema of Figure 1. A description of the different tables is given in Figure 14. For each table we have reported the constraints (primary key, foreign key, not null) which have been declared in its SQL creation script. Note that the specialization of the machine class has been implemented by using two separate tables for the two specialized classes. With the relational technology there is no declarative solution to impose the {disjoint, complete} constraint. We suppose that this constraint is controlled through a trigger. Our approach does not suppose that all the useful data are stored in a single source. It allows for the integration of data at the design stage provided by existing MDS or FMDS. We propose an implementation of the multidimensional system with the relational technology. 9.2 Details on the Six Steps Step 1: Extract an initial MDS from an existing source (data store, …) The HDS (Helping Development System) automatically performs this stage from user requirements. Here the user can stipulate what he requires i) a complete MDS of the data store, or ii) a partial MDS built from a given class (which will therefore be the root), or iii) the smallest MDS containing given classes and\or attributes supplied as parameters. These last two options ii) and iii) are very useful when the data store includes a great number of classes and attributes. The HDS shows the useful part of the conceptual schema to allow the user to choose these parameters. Figure 15 shows the MDS which can be extracted from the relational data store given in Figure 14. The aggregating associations can be extracted from the conceptual schema or/and from the relational schema. It should be noted that declarations of foreign keys are
A Unified Object Constraint Model
63
very convenient for extracting the aggregating associations, but they cannot generally provide all the cardinalities. The constraints deduced by the HDS were visualized by using the graphic representations of Section 6. Note that the {d,t} constraint for the aggregating associations to Manual_machine and Automatic_machine was deduced from the conceptual schema. Indeed it is difficult to induce its precise semantics from the code of the trigger. To study the summarizability it is important that the HDS recovers all the possible constraints which have been declared or implemented on the data store. 0DQXBRSHUDWLRQ
zz^GW` 0DQXDOBPDFKLQH $XWRPDWLFBPDFKLQH PDQXDOBPDFKLQHBQDPH^LG` DXWRPDWLFBPDFKLQHBQDPH^LG` XQLWBFRVW XQLWBFRVW TXDOLILFDWLRQ ORDGLQJBGXUDWLRQ
PDQXBRSHUDWLRQBNH\^LG` RSHUDWLRQBQDPH VHULDOBQXPEHU GXUDWLRQ
0DQXIDFWXULQJ
PDQXBQXPEHU^LG` PDQXBGDWH TXDQWLW\
3URGXFW
SURGXFWBQXPEHU^LG` SURGXFWBQDPH FDWHJRU\BQDPH
Fig. 15. The MDS extracted from the relational source of Figure 14
The example which we considered is very simple. In reality sources are more complex and so require more elaborate techniques to extract the elements which are of interest for the design of the MDS. Step 2: Integrate existing MDS or FMDS and elaborate the FMDS Existing multidimensional schemas can be integrated on the basis of property 3. We have illustrated this with the case of the Figure 3. Elaboration of the FMDS consists in choosing the fact classes and the measures. Step 3: Consolidate or revise the FMDS (including constraints) At this level the user can bring a certain number of changes to the initial schema to take his needs into account more precisely. First he can delete attributes or classes of which he will have no utility in the analysis. In the case of a class deletion, the HDS must install the various aggregating associations which can be deduced between the remaining classes (by using the transitive property). The user can also change the names of attributes and elements types. These various modifications require the HDS to store the correspondences between the elements of the current status of the multidimensional schema and the elements of the physical source so as to allow further loading of the data. Finally the user can strengthen existing constraints or add new constraints.
64
F. Pinet and M. Schneider
For example, in the case of Figure 15, we suppose that the user does not plan to use the manufacturing element type. So he can request its deletion. This results in the new schema of Figure 16 where a direct reference to Product has been installed from the Manu_operation class. PDQXBRSHUDWLRQBNH\^LG` RSHUDWLRQBQDPH VHULDOBQXPEHU GXUDWLRQ^P` zz^GW` 3URGXFW 0DQXDOBPDFKLQH $XWRPDWLFBPDFKLQH SURGXFWBQXPEHU^LG` PDQXDOBPDFKLQHBQDPH^LG` DXWRPDWLFBPDFKLQHBQDPH^LG` SURGXFWBQDPH XQLWBFRVW XQLWBFRVW FDWHJRU\BQDPH TXDOLILFDWLRQ ORDGLQJBGXUDWLRQ 0DQXBRSHUDWLRQ^I`
Fig. 16. The FMDS after revision by the user
Meanwhile the HDS must update its mapping table in order to keep trace of this change and to be able to recover the correct data from the source. In Figure 17 we have represented the part of the mapping table which takes the former deletion into account. (OHPHQWRIWKH)0'6
&RUUHVSRQGLQJHOHPHQWRIWKHVRXUFH
PDQXBRSHUDWLRQSURGXFWBQXPEHU
PDQXRSHUDWLRQPDQXBQXPEHU→PDQXIDFWXULQJSURGXFWBQXPEHU
Fig. 17. Extract of the mapping table
Step 4: Test summarizability This step is optional. It consists in using the algorithms of Section 8 to test the summarizability of the FMDS obtained in the previous step. In the case of the FMDS of Figure 16, the three leave nodes are summarizable or partially summarizable from the root and relative to the measure "duration". Step 5: Map into the relational model (generate tables and codes for checking constraints) We have said that there are tools for generating the SQL code from OCL which means the constraints can be verified. We have experimented OCLtoSQL [10] to verify this possibility. Moreover OCLtoSQL generates the tables of the database from the conceptual schema. The mapping proposed by OCLtoSQL is quite straightforward. It consists in mapping each class Ci into a table Ti. Aggregating associations between classes are implemented via foreign keys (recall that an aggregating association corresponds to a many-to-one association).
A Unified Object Constraint Model
65
The constraints specified in the FMDS must be first translated into OCL. For the moment we translate it manually. But it is not a hard task to implement a module which does this work automatically. This double translation may seem heavy but it has the great advantage of making the two phases of design and implementation independent and of strengthening an MDA approach. Indeed, starting with the same OCL codes for constraints, we can envisage using another technology for the implementation.
Fig. 18. Relational schema generated by OCLtoSQL
To experiment OCLtoSQL we have used the example of Figure 16. The {d,t} constraint is expressed in OCL as follows: context Manu_operation inv: self.Manual_machine_name->notEmpty xor self.Automatic_machine_name->notEmpty
66
F. Pinet and M. Schneider
Figures 18 shows the relational database generated by OCLtoSQL and Figure 19 shows the SQL query produced for verifying the {d,t} constraint. This query returns all Manu_operation instances that do not satisfy the constraint; thus, if the query returns no data, the database complies with the constraint. Note that OCLtoSQL produces SQL queries that reference views. For instance OV_MANU_OPERATION in Figure 19 is a view that allows access to data in the table MANU_OPERATION. The declaration of these views is also automatically generated by OCLtoSQL. It should also be noted that we modified the configuration file of the parser of the OCLtoSQL generator so that the function notEmpty can be used to check NULL values.
Fig. 19. SQL query produced by OCLtoSQL to verify the {d,t} constraint of Figure 16
The proposed mapping can be used to represent hierarchies explicitly but raises performance problems notably for links between tables which are necessary in queries. These problems can be solved by using different techniques: indices, partitions, materialized views. Mappings are more successful when the different classes of a hierarchy are regrouped in a single table but the multidimensional representation is lost. OCLtoSQL is an open software which can be easily adapted. It would be possible to introduce the generation of other kinds of relational mappings. Step 6: Load the data and verify the constraints This step consists in loading the data in the different tables of the multidimensional relational schema. The mapping table is used to establish the correct correspondences with the tables and the attributes of the data store. In the case of our simple example there is a bijective correspondence for each table and attribute. The only exception is for the attribute product_number of table manu_operation. The mapping table can be used to generate the SQL statement automatically for the loading. We show this statement below where we have prefixed the tables of the data store with ds. Insert into manu_operation select manu_operation_key, product_number, manual_machine_name, automatic_machine_name, manu_number, operation_name, serial_number, duration from ds_manu_operation, ds_manufacturing where ds_manu_operation.manu_number = ds_manufacturing.manu_number;
A Unified Object Constraint Model
67
This statement consists in making an appropriate link between the two tables manu_operation and manufacturing of the data store. Using the mapping table, we can design an algorithm for the generation of the SQL statements for loading the data. So, the two steps for the implementation of the multidimensional system can be made entirely automatic.
10 Conclusion and Perspectives In this paper we propose a model which unifies the notions of fact and of dimension member. With this model a multidimensional structure is represented by an MDS (MultiDimensional Schema). An MDS can be viewed as a directed graph where the nodes are the classes and the edges are the aggregating associations used for the analysis. Any node in the graph can be chosen as a fact. A fact is then analyzed through the paths which originate in this node. The notion of analysis dimension disappears and is replaced by the notion of analysis path. An MDS which possesses at least one fact becomes an FMDS (Fact MultiDimensional Schema). An FMDS can have several facts and can thus collect the analysis needs of several groups of users. We presented our model by using an extension of UML through only five specializations: two specializations of classes (identified class, fact class), one specialization of associations (aggregating association) and two specializations of attributes (identifier, measure). We have shown that this model can be used to represent traditional multidimensional structures: star structure, galaxy structures, facts of fact which correspond to a many-to-many association between a fact and a dimension, and more complex structures where there are several roots sharing a common sub-graph. Several classes of constraints have been defined in this model. First local constraints can be used to express conditions which have to satisfy the aggregating associations coming from a node. We have proposed a very convenient way to represent it graphically. With global constraints it is possible to define conditions concerning several nodes. Using these constraints we have shown that our model can be used to represent most of the hierarchies proposed in the literature. We have also shown that we were able to characterize new summarizability properties in a fairly simple way. Finally, using this model as a basis, we studied a complete development process to design a multidimensional schema from the conceptual and physical schemas of a data store and to implement it with a relational DBMS. This process can be used to express constraints and to verify the properties of summarizability. The design phase is assisted and the implementation phase can be made automatic. Thanks to such a process, the final user can himself control the development of his multidimensional system. This model is very flexible since it can help to design multidimensional systems in different ways. Starting from a fact it is possible to connect it with dimension hierarchies extracted from existing data. It is also possible to extract the whole multidimensional space (MDS) which is available from a source and then to choose the interesting facts and their analysis paths. We can also assemble different multidimensional spaces extracted from separate sources. This notion of multidimensional space
68
F. Pinet and M. Schneider
can be very useful since it constitutes a good reference to determine whether it is possible to satisfy all the decisional needs from existing data and so to decide if evolutions in the data organisation are necessary. It is possible from an MDS to derive several different FMDS (Fact Multidimensional Spaces). An FMDS is a more specialized MDS adapted to the needs of an end user. Several perspectives can be envisaged. The first is the design of a complete prototype for the development process. Different problems arise, but with the current state of the art it should be possible to find solutions for each of them. First of all, this process starts from two schemas, conceptual and physical, of a data store. It is indispensable that these two schemas should be completely synchronized and that all the constraints expressed at the conceptual level are implemented at the physical level. The use of a development software platform based on an MDA approach can guarantee this property. It is also necessary to maintain the correspondences between the elements of the multidimensional schema and the elements of the physical schema of the data store, and then to exploit these correspondences for the generation of the statements for loading data. Solutions suggested for the purpose of expressing mappings and rewriting queries in mediation systems can be exploited for this issue. Since our model is an extension of UML, we think that it is possible to adapt one of the numerous development platforms which are available in the UML environment. Implementation using views would be well worth studying. Data no longer needs to be loaded and refreshed. The main problem which is posed is the expression and the checking of the constraints. Another problem which can arise with views is that of performance, especially when the data volume is large. A variant with materialized views could therefore be more interesting [48]. It would be useful to characterize the expressive power of the frame which we suggest to study summarizability. Intuitively it is less expressive than that of [18] since it is based only on the schema, but we think it is sufficient to tackle the majority of real situations. Henceforth, our model can be extended for the expression of the constraints of the frame of [18] and to permit the implementation of their algorithm for testing summarizability. Another problem, which we have not studied in this paper, is the reuse of aggregate data between data cubes when an element plays different roles. With our model it is easy to add to every class of a dimension member an attribute which allows storing the aggregate corresponding to a given measure. It is a way to represent a cube at the conceptual level. The marking of this type of attribute by the name of the measure would allow to represent in the same class several aggregates associated to different measures and so to deal with different roles associated to different cubes. It is then possible to make cross-analyses. In this paper, we have not tackled the problem of querying the multidimensional structure. We think that the FMDS with its UML notation can constitute the basis for an ergonomic interface for the end user. Elements involved in a query can easily be located and marked by the user. It would be necessary to study how to specify the required results graphically.
A Unified Object Constraint Model
69
References 1. Abelló, A., Samos, J., Saltor, F.: Understanding Analysis Dimensions in a Multidimensional Object-Oriented Model. In: Proc. of Intl. Workshop on Design and Management of Multidimensional structures, DMDW 2001, Interlaken, Switzerland, June 4 (2001) 2. Abelló, A., Samos, J., Saltor, F.: A framework for the classification and description of multidimensional data models. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 668–677. Springer, Heidelberg (2001) 3. Abelló, A., Samos, J., Saltor, F.: Understanding facts in a multidimensional object-oriented model. In: Proceedings of the 4th ACM International Workshop on Data Warehousing and OLAP (DOLAP 2001), pp. 32–39 (2001) 4. Abelló, A., Samos, J., Saltor, F.: YAM2: A Multidimensional Conceptual Model Extending UML. Information Systems 31, 541–567 (2006) 5. Agrawal, R., Gupta, A., Sarawagi, S.: Modelling Multidimensional Databases. In: 13th International Conference on Data Engineering, ICDE 1997, Birmingham, UK, April 7-11, pp. 232–243 (1997) 6. Balsters, H.: Modelling Database Views with Derived Classes in the UML/OCLFramework. In: Stevens, P., Whittle, J., Booch, G. (eds.) UML 2003. LNCS, vol. 2863, pp. 295–309. Springer, Heidelberg (2003) 7. Blaschka, M., Sapia, C., Hofling, G., Dinter, B.: Finding your way through multidimensional data models. In: Quirchmayr, G., Bench-Capon, T.J.M., Schweighofer, E. (eds.) DEXA 1998. LNCS, vol. 1460, pp. 198–203. Springer, Heidelberg (1998) 8. Carpani, F., Ruggia, R.: An Integrity Constraints Language for a Conceptual Multidimensional Data Model. In: 13th International Conference on Software Engineering and Knowledge Engineering (SEKE 2001), pp. 220–227 (2001) 9. Datta, A., Thomas, H.: The Cube Data Model: A Conceptual Model and Algebra for online Analytical Processing in Multidimensional structures. Decision Support Systems 27(3), 289–301 (1999) 10. Demuth, B., Hußmann, H., Loecher, S.: OCL as a specification language for business rules in database applications. In: Gogolla, M., Kobryn, C. (eds.) UML 2001. LNCS, vol. 2185, pp. 104–117. Springer, Heidelberg (2001) 11. Giovinazzo, W.A.: Object-oriented data warehouse design: building a star schema. Prentice Hall PTR, Upper Saddle River (2000) 12. Ghozzi, F., Ravat, F., Teste, O., Zurfluh, G.: Constraints and Multidimensional Databases. In: ICEIS, pp. 104–111 (2003) 13. Golfarelli, M., Maio, D., Rizzi, S.: Conceptual Design of Multidimensional structures from E/R Schemes. In: Proc. of the 32th HICSS (1998) 14. Gyssens, M., Lakshmanan, V.S.: A Foundation for Multi-dimensional Databases. In: Proc. of the 23rd Conference on Very Large Databases, pp. 106–115 (1997) 15. Hahn, K., Sapia, C., Blaschka, M.: Automatically Generating OLAP Schemata from Conceptual Graphical Models. In: Proc. of the 3rd ACM International Workshop on Data Warehousing and OLAP, DOLAP 2000, McLean, Virginia, USA (2000) 16. Herden, O.: A Design Methodology for Data Warehouses. In: Proc. of the CAISE Doctoral Consortium, Stockholm (2000) 17. Hurtado, C.A., Mendelzon, A.O.: OLAP Dimension Constraints. In: PODS 2002, pp. 169– 179 (2002) 18. Hurtado, C.A., Gutiérrez, C., Mendelzon, A.O.: Capturing summarizability with integrity constraints in OLAP. ACM Trans. Database Syst. (TODS) 30(3), 854–886 (2005) 19. Hùsemann, B., Lechtenbörger, J., Vossen, G.: Conceptual Multidimensional Structures Design. In: Proc. of Intl. Workshop on Design and Management of Multidimensional structures (DMDW 2000), Stockholm, Sweden, June 5-6 (2000)
70
F. Pinet and M. Schneider
20. Jensen, M.R., Holmgren, T., Pedersen, T.B.: Discovering multidimensional structure in relational data. In: Kambayashi, Y., Mohania, M., Wöß, W. (eds.) DaWaK 2004. LNCS, vol. 3181, pp. 138–148. Springer, Heidelberg (2004) 21. Kimball, R.: The Data Warehouse Toolkit. John Wiley & Sons Inc., New York (1996) 22. Kimball, R., Reeves, L., Ross, M., Thornwaite, W.: The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses. John Wiley & Sons Inc., New York (1998) 23. Klasse Objecten: OCL tools Web site (2005), http://www.klasse.nl/ocl 24. Lechtenbôrger, J., Vossen, G.: Multidimensional Normal Forms for Multidimensional structure Design. Information Systems 28(5) (2003) 25. Lehner, W.: Modeling Large Scale OLAP Scenarios. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, pp. 153–167. Springer, Heidelberg (1998) 26. Lehner, W., Albrecht, J., Wedekind, H.: Normal Forms for Multidimensional Data Bases. In: l0th Intl. Conference on Scientific and Statistical Data Management (SSDBM 1998), Capri, Italy, pp. 63–72 (1998) 27. Lenz, H.J., Shoshani, A.: Summarizability in OLAP and Statistical Data Bases. In: Proc. of the 9th SSDBM, pp. 132–143 (1997) 28. Levene, M., Loizou, G.: Why is the Star Schema a Good Multidimensional structure Design (1999), http://citeseer.ni.nec.com/457156.html 29. Levene, M., Loizou, G.: Why is the Snowflake Schema a Good Multidimensional structure Design. Information Systems 28(3), 225–240 (2003) 30. Li, C., Wang, X.S.: A Data Model for Supporting on-line Analytical Processing. In: Proc. of the Fifth International Conference on Information and Knowledge Management, pp. 81– 88 (1996) 31. Luján-Mora, S., Trujillo, J., Song, I.Y.: A UML Profile for Multidimensional Modeling in Data Warehouses. Data & Knowledge Engineering 59, 725–769 (2006) 32. Malinowski, E., Zimányi, E.: Hierarchies in a multidimensional model: From conceptual modeling to logical representation. Data Knowl. Eng. (DKE) 59(2), 348–377 (2006) 33. Malinowski, E., Zimanyi, E.: Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications, 435 pages. Springer, Heidelberg (2008) 34. Mansmann, S., Scholl, M.H.: Empowering the OLAP Technology to Support Complex Dimension Hierarchies. International Journal of Data Warehousing and Mining 3(4), 31– 50 (2007) 35. Mansmann, S., Scholl, M.H.: Extending the Multidimensional Data Model to Handle Complex Data. Journal of Computing Science and Engineering 1(2), 125–160 (2007) 36. Mazon, J.N., Trujillo, J.: An MDA Approach for the Development of Data Warehouses. Decision Support Systems (45), 41–58 (2008) 37. Moody, L.D., Kortink, M.A.R.: From Enterprise Models to Dimensional Models: A Methodology for Multidimensional Structure and Data Mart Design. In: Proc. of the International Workshop on Design and Management of Multidimensional structures, DMDW 2000, Stockholm, Sweden (2000) 38. Niemi, T., Nummenmaa, J.: Logical Multidimensional Database Design for Ragged and Unbalanced Aggregation Hierarchies. In: Proc. of the International Workshop on Design and Management of Multidimensional structures, DMDW 2001, Interlaken, Switzerland (2001) 39. Nguyen, T.B., Tjoa, A.M., Wagner, R.: An Object Oriented Multidimensional Data Model for OLAP. In: Lu, H., Zhou, A. (eds.) WAIM 2000. LNCS, vol. 1846, pp. 69–82. Springer, Heidelberg (2000) 40. OMG: OCL 2.0 specification version 2.0. OMG specification, 185 pages (2005), http://www.omg.org
A Unified Object Constraint Model
71
41. Pedersen, T.B., Jensen, C.S.: Multidimensional Data Modelling for Complex Data. In: Proc. of the Intl. Conference on Data Engineering, ICDE 1999, pp. 336–345 (1999) 42. Pedersen, T.B., Jensen, C.S., Dyreson, C.E.: A foundation for capturing and querying complex multidimensional data. Information Systems 26(5), 383–423 (2001) 43. Pourabbas, E., Rafanelli, M.: Characterization of Hierarchies and some Operators in OLAP Environment. In: DOLAP 1999, Kansas City, USA, pp. 54–59 (1999) 44. Prat, N., Akoka, A., Comyn-Wattiau, I.: A UML-based data warehouse design method. Decision Support Systems (DSS) 42(3), 1449–1473 (2006) 45. Rafanelli, M., Shoshani, A.: STORM: A Statistical Object Representation Model. In: Statistical and Scientific Data Base Management Conference (SSDBM), pp. 14–29 (1990) 46. Song, I.Y., Medsker, C., Rowen, W., Ewen, E.: An Analysis of Many-to-Many Relationships Between Fact and Dimension Tables in Dimension Modeling. In: Proceedings of the Int’l. Workshop on Design and Management of Data Warehouses, pp. 6-1–6-13 (2001) 47. Vassiliadis, P., Skiadopoulos, S.: Modelling and Optimisation Issues for Multidimensional Databases. In: Proc. of the 12th Intl. Conference CAISE, Stockholm, Sweden, pp. 482–497 (2000) 48. Theodoratos, D., Ligoudistianos, S., Sellis, T.: View selection for designing the global Multidimensional structure. Data & Knowledge Engineering 39, 219–240 (2001) 49. Trujillo, J., Palomar, M.: An object Oriented Approach to Multidimensional Database Conceptual Modeling (OOMD). In: Proc. of the 1st ACM international workshop on Data warehousing and OLAP, Washington, pp. 16–21 (1998) 50. Tsois, A., Karayannidis, N., Sellis, T.: MAC: Conceptual Data Modeling for OLAP. In: Proc. of the Intl Workshop on Design and Management of Multidimensional structures (DMDW 2001), Interlaken, Switzerland, June 4 (2001) 51. Warmer, J., Kleppe, A.: OCL: The constraint language of the UML. Journal of ObjectOriented Programming 12, 13–28 (1999)
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics Sandipto Banerjee1 and Karen C. Davis2 1
MicroStrategy, Inc., 1861 International Drive, McLean, VA USA 22102
[email protected] 2 Electrical & Computer Engineering Dept., University of Cincinnati, Cincinnati, OH USA 45221-0030
[email protected]
Abstract. Models for conceptual design of data warehouse schemas have been proposed, but few researchers have addressed schema evolution in a formal way and none have presented software tools for enforcing the correctness of multidimensional schema evolution operators. We generalize the core features typically found in data warehouse data models, along with modeling extended hierarchy semantics. The advanced features include multiple hierarchies, noncovering hierarchies, non-onto hierarchies, and non-strict hierarchies. We model the constructs in the Uni-level Description Language (ULD) as well as using a multilevel dictionary definition (MDD) approach. The ULD representation provides a formal foundation to specify transformation rules for the semantics of schema evolution operators. The MDD gives a basis for direct implementation in a relational database system; we define model constraints and then use the constraints to maintain integrity when schema evolution operators are applied. This paper contributes a formalism for representing data warehouse schemas and determining the validity of schema evolution operators applied to a schema. We describe a software tool that allows for visualization of the impact of schema evolution through the use of triggers and stored procedures. Keywords: Data warehouse conceptual modeling, data warehouse schema evolution.
1 Introduction There are a number of conceptual models proposed in the literature for data warehouse design. A few examples include Star [11], Snowflake [11], ME/R [42], Cube [45], StarER [43], DFM [18], and EVER [4]. All of these models represent facts containing measures that are aggregated over dimensions. A dimension consists of levels and a hierarchy defines a relationship between the levels. Thus a multidimensional model consists of facts, dimensions, measures, levels, and hierarchies that typically conform to the following constraints: 1. A many-to-one relationship between a fact and a dimension (a fact instance relates to one dimension instance for each dimension, whereas a dimension instance may relate to many fact instances). S. Spaccapietra et al. (Eds.): Journal on Data Semantics XIII, LNCS 5530, pp. 72–96, 2009. © Springer-Verlag Berlin Heidelberg 2009
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
73
2. A many-to-one (roll-up) relationship between the levels of a dimension. 3. Hierarchies in a dimension have a single path to roll-up or drill-down. We refer to these features as the core features of a data warehouse conceptual model. Pedersen and Jensen [35, 36], Hümmer et al. [22], and Tsois et al. [44] discuss shortcomings of traditional models for adequately capturing real-world scenarios. Each paper presents a different set of features; we adopt four of the most common to extend the traditional semantics of hierarchies in data warehouses. In our work, we consider the following: • • • •
multiple hierarchies: a dimension can have multiple paths to roll-up or drill-down information. non-covering hierarchies: a parent level in a non-covering hierarchy is an ancestor (not immediate parent) of the child level (our graphical convention is illustrated in Figure 1(a)). non-onto hierarchies: an instance of a parent level in a dimension can exist without a corresponding data instance in the child level to drill-down to (Figure 1(b)). non-strict hierarchies: two levels in a dimension can have a many-to-many relationship (Figure 1(c)).
Non-covering, non-onto, and non-strict may be arbitrarily combined; notation for a non-covering and non-strict relationship (the case of a many-to-many relationship between an ancestor and a child level) is shown as one example in Figure 1(d)).
(a) non-covering
(b) non-onto
(c) non-strict
(d) non-covering and non-strict
Fig. 1. Notation for Extended Hierarchies in Dimensions
In order to illustrate the extended hierarchy semantics and schema evolution (later in the paper), we combine three examples from the literature. The schema shown in Figure 2 uses a modified DFM notation; the dimension name is added as a first level in order to use a different name for the dimension rather than the default value of the root level. As in DFM, the levels are shown with the finer granularity rolling up to the coarser granularity, e.g., Co-ordinate is the finest granularity in the Location dimension and Country is the most general. Golfarelli et al. present an example of a Sale fact schema with measures Qty Sold, Revenue, and No. of Customers [19]. Hurtado et al. present an example of a Product dimension [25]. The Product dimension models multiple hierarchies because the level Corporation drills-down to Company or Category and the level Item rolls-up to Brand or Category. The paths of the dimension that describe the roll-up from the lowest to the highest hierarchy level are Item, Category, Corporation and Item, Brand, Company, Corporation.
74
S. Banerjee and K.C. Davis
Jensen et al. present an example of Time and Location dimensions [26]. The Location hierarchies are defined such that Country drills-down to Province, and Province drills-down to City, for example. A non-onto hierarchy is defined for City and IP add because some cities do not have an internet provider address. Three non-strict hierarchies are defined: (1) between District and Street because a street might belong to many districts, (2) between City and Cell because a cell can be shared by many cities, and (3) between Province and Cell because a cell can be shared by many provinces. Three non-covering hierarchies are defined: (1) between District and Co-ordinate because some co-ordinates may not be in streets, (2) between Province and District because some districts may not belong to a city, and (3) between City and Co-ordinate because some co-ordinates may not be in a street or a district.
Fig. 2. Sale Schema
The diversity of models and schemas in practice motivates use of schema/model management tools such as a multilevel dictionary approach [1] to allow interoperability as well as generic reporting and analysis tools. In this paper, we define a formal metamodel for data warehouse core features using ULD [5, 6, 7] along with a multilevel dictionary definition (MDD). The ULD definition provides formal semantics while the MDD structures can be directly implemented in a relational database system. In addition to the basic structural constructs of a data warehouse model, we also define schema evolution operators and the constraints that must be satisfied to ensure schema correctness. The correctness of these operators is enforced by stored procedures and triggers. It is important to investigate and specify schema evolution for several reasons. The data warehouse environment is dynamic; a few examples include changes due to changing user needs (e.g., requests for additional information or faster responses) and changing sources. Often data warehouses are initially designed as data marts. Growth, extension, and modification are expected to occur with a data mart.
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
75
Previously authors have discussed operators supporting schema evolution over the core data warehouse features. The schema evolution operators defined by Quix [39] and Bouzeghoub and Kedad [8] are primarily concerned with view modification and maintenance and are not further considered here. Our approach focuses on schema correctness rather than synchronization of data with source schemas. Blaschka et al. [9] propose a formal model that is conceptually similar to ours and use an algebra to define schema evolution. Hurtado et al. [25] focus on changes to dimensions only. Chen et al. [10] and Golfarelli et al. [17] give models that rely on general schema evolution operators. Chen et al. have operators for adding, deleting, and renaming tables and attributes, while Golfarelli et al. introduce a graph-based formalism that allows adding and deleting nodes and edges. The nodes represent entities (facts, measures, and dimensions) while the edges represent functional dependencies. None of these data warehouse schema evolution papers introduce a tool for enforcing changes as we do here. The tool allows for rapid creation of a data warehouse schema and immediate checking of the validity of schema evolution operations. Although authors have discussed operators supporting data warehouse schema evolution [8, 9, 10, 17, 25, 28, 39], none of them consider advanced data warehouse features such as non-strict hierarchies, non-covering hierarchies, and non-onto hierarchies. Extending the semantics of hierarchies increases the expressive power of the model and thus creates new challenges for enforcing correctness when schemas change back and forth between traditional and extended semantics. For example, if a non-covering hierarchy is deleted, such as the relationship between Province and District (where a district may not belong to a city, but it may belong to a province), then a valid hierarchy must be in place between the affected levels. Since an alternate path is defined such that a district can roll-up to a city, and a city rolls-up to a province, the deletion of the non-covering hierarchy is valid. Additional operators to adjust the data instances would be required but are not the focus of this paper. In Sections 2 and 3, we introduce a formal model and constraints for extended hierarchy semantics. In Section 4, we describe a multilevel dictionary implementation of the model and use the Sale schema in Figure 2 to illustrate schema evolution operations. We summarize our contributions and discuss directions for future work in Section 5.
2 Model The Uni-level Description Language (ULD) [5, 6, 7] provides a metadata description technique that can be used to represent a broad range of data models. ULD has been used to represent a variety of models such as ER, RDF and XML as well as provide the basis for a universal browsing tool for information originating in diverse data models. We use a ULD as well as a Multilevel Dictionary Definition (MDD) approach [1] to define our model and schema evolution operators. ULD facilitates modeling as it provides a uniform representation of data, schema, model, and metamodel layers, and their interrelationships. The ULD representation also provides a formal foundation to specify transformation rules for the semantics of schema evolution operators. The MDD supports description of target models and schemas in a tabular format. The MDD gives a basis for direct implementation of the constructs of data models/schemas in a relational database system. Triggers and stored procedures can be
76
S. Banerjee and K.C. Davis
used to implement the semantics of schema creation and evolution that we formally define in the ULD. Essentially, we first define the constructs and their semantics formally in ULD, then we represent the constructs in MDD for ease of understanding and implementation, and finally we implement the MDD constructs in a relational database system using triggers and stored procedures to enforce semantics. The ULD metamodel is used to define the core features of a data warehouse. Figure 3 summarizes the formal definition and highlights the extended features in italics. The ULD construct types we use are setct and structct, where the former represents a set of objects and the latter represents a structured object with subcomponents. The notation “construct c := [x => y]” represents the construct type structct, where the expression “x => y” represents a component of the construct c, x is called the component selector, and y is the type of the component. The setct construct type is represented as set-of. For example, a FactSet is defined as a set-of Fact, where a Fact is a structct with subcomponents for name, measures, attributes, and primary key fields. Constructs can be related to each other by the conformance (::) construct. A conformance construct can be further specified using first-order logic constraints. We use the notation “[n:1]” to represent the cardinality between related constructs. For example, the cardinality between levels in a Hierarchy is “[1:n]” because a parent can have n children and a child can have 1 parent. A Fact instance relates to one instance of a Dimension (“Dimension cardinality [n:1]”) while a Dimension has “[1:n]” cardinality to a Fact. A non-strict hierarchy (NShierarchy) has a many-to-many (“[m:n]”) Level cardinality. construct Schema := [Sname => uld-string, Sfact =>FactSet, Sdimension => DimensionSet] construct FactSet := set-of Fact construct DimensionSet := set-of Dimension construct Fact := [Fname => uld-string, Fattribute => AttributeSet, Fmeasure => MeasureSet, FPkey => FactPKeySet] :: Dimension cardinality [n:1] construct MeasureSet := set-of Measure construct Measure := [Mname => uld-string, Mdomain => number] construct FactPKeySet := set-of FactPkey construct FactPkey := [SubkeyName => uld-string] construct AttributeSet := set-of Attribute construct Attribute := [AttributeName => uld-string] construct LevelSet := set-of Level construct HierarchySet := set-of Hierarchy construct Level := [Lname => uld-string] construct Hierarchy := [Hparent => Level, Hchild => Level]:: Level cardinality [1:n] construct Cube := [Cname => uld-string, Cfact => Fact, Cdimension => DimensionSet] construct Dimension := [Dname => uld-string, Dlevel => LevelSet, DPkey => uld-string, Dhierarchy => HierarchySet, Dpath => PathSet]:: Fact cardinality [1:n] construct PathSet := set-of Path construct Path := set-of PathHierarchy construct PathHierarchy := [PHierarchy => Hierarchy] construct NChierarchy := [NCparent => Level, NCchild => Level]::PathHierarchy construct NOhierarchy := [NOparent => Level, NOchild => Level]::Hierarchy construct NShierarchy := [NSparent => Level, NSchild => Level]:: Hierarchy :: Level cardinality [n:m] Fig. 3. Data Warehouse Constructs in ULD
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
77
A data warehouse schema consists of facts and dimensions. A Schema construct is defined by Sname, a unique name, Sfact, a set of all facts in a schema, and Sdimension, a set of all dimensions in a schema. A fact consists of measures, attributes and a primary key. A Fact construct is defined by Fname, Fmeasure, Fattribute and FPkey. Fname is a unique name, and Fmeasure is of the type MeasureSet that represents a set of measures in a fact. The Measure construct has a unique name and a domain number. Fattribute is of the type AttributeSet that represents a set of attributes in a fact. Attributes are descriptive (non-aggregable) while measures can be aggregated. FPkey is of the type FactPKeySet that represents a set of subkeys that form the concatenated primary key for a fact. A dimension in the core model consists of levels, hierarchies and a primary key. A Dimension construct is defined by Dname, Dlevel, DPkey, and Dhierarchy levels. A Dname indicates a unique name for a dimension. A Dlevel is of the type LevelSet that represents a set of levels in a dimension. DPkey is the primary key for a dimension. It can be extended to be represented as a concatenated key similar to the primary key of the Fact construct. Dhierarchy is of the type HierarchySet that represents a set of hierarchies among the levels of a dimension. A Level construct consists of a unique name and represents the dimension level over which the measures are aggregated. The roll-up or drill-down relationship between two levels of a dimension is captured by the construct Hierarchy. The construct Hierarchy consists of Hparent and Hchild. Hchild and Hparent convey the information that a child level rolls-up to the parent level. To represent a dimension with multiple hierarchies the construct definition of a dimension is modified to include Dpath that represents a hierarchy path in a dimension. A Dpath construct is of the type PathSet, a set of paths in a dimension. A unique name, a set of levels, a set of hierarchies, a set of paths over which the data is aggregated, and a primary key define an extended dimension construct. A hierarchy instance consists of an instance of a parent level, instance of child level, and a conformance relationship to a hierarchy. The conformance relationship maps the hierarchy instance to a hierarchy construct. A Cube construct represents a fact and the dimensions connected to a fact. Some authors also define the cube as a view [8, 39]. In our research we define a cube by Cname, Cfact, and Cdimension. Cname is a unique name identifying a cube. Cfact is a fact and Cdimension is a set of dimensions that are connected to that fact. When a cube is created, the measures of the fact are aggregated over the levels of the dimensions. Consider a scenario where a cube called SalesCube is created when a Sales fact with measures Cost Price and Sales Price is connected to the dimensions Time and Product. The levels of the Time dimension are Year, Month and Day and the levels of the Product dimension are Product Type and Product Color. When a cube is created the measures Cost Price and Sales Price are aggregated over the levels Year, Month, Day, Product Type, and Product Color. We represent a non-covering hierarchy by mapping the instances of the levels that exhibit non-covering hierarchy behavior [31]. We define a non-covering hierarchy between two levels as: construct NChierarchy := [NCparent => Level, NCchild => Level]::Path
78
S. Banerjee and K.C. Davis
The NChierarchy construct represents a non-covering hierarchy between two levels along a path in the dimension. It consists of a parent level, a child level and a conformance relation to a path. For example, non-covering hierarchy is defined for the levels Country, City and a conformance relation to the path p1 (Country, State, City). The path represents a set of hierarchies in a dimension. The parent and child level in the non-covering hierarchy have a conformance relationship to a dimension path such that the parent level is an ancestor of the child level and not a direct parent. The construct NCInstance is called a non-covering hierarchy instance and represents the instances of the levels that exhibit non-covering hierarchy semantics. The construct NCInstance consists of a parent level instance and a child level instance and a conformance relationship to a non-covering hierarchy. The difference between a non-covering hierarchy instance and a direct hierarchy instance is the conformance relationship. A non-onto hierarchy occurs when an instance of a parent level does not have an instance of the child level to drill-down to. We define a non-onto hierarchy between two levels as: construct NOhierarchy := [NOparent => Level]:: Hierarchy The NOhierarchy construct represents a non-onto hierarchy and consists of a parent level and a conformance relation to an existing hierarchy in a dimension. The parent level represents the level that has instances that cannot drill-down to instances of the child level in a hierarchy. A non-strict hierarchy occurs when there is a many-to-many relationship between the parent and child level in a hierarchy. A non-strict hierarchy between two levels of a dimension is defined as: construct NShierarchy := [NSparent => Level, NSchild => Level]:: Hierarchy :: Level cardinality [m:n] The NShierarchy construct represents a non-strict hierarchy and consists of a parent level and child level and a conformance relation to an existing hierarchy in a dimension. The parent level represents the level where an instance drills-down to more than one instance of the child level in a hierarchy. The correctness and consistency of the constructs in our model are maintained by introducing constraints discussed in the next section.
3 Constraints Addition/deletion constraints, represented in first-order logic, are used to maintain the correctness of the model when a fact, dimension, level, hierarchy or cube is added/deleted. In this section, the core constraints for addition and deletion are introduced first, followed by a discussion of the constraints for the extended hierarchy semantics. Addition constraints AC1-AC4 in Table 1 enforce the semantics that a fact has to be added to an existing schema, a measure must belong to a fact, a dimension belongs to a schema, and a level can only be added to an existing dimension, respectively, for
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
79
the core semantics. As an example, AC1 states that for any x that is a construct instance of a fact (c-inst(x, Fact)), there must be a schema z (c-inst(z, Schema)) that has a set of facts as a component (member-of(Sfact, z)) to which x is an instance (c-inst(x, Sfact)) if x is to be added as a fact in schema z. Table 1. Addition Constraints (Core Features) ID AC1
Construct Fact
Constraint
AC2
Measure
∀(x) ∃(y) c-inst(x, Measure) /\ c-inst(y, Fact) /\ memberof(Fmeasure, y) /\ addMeasure(x, y) Æ c-inst(x, Fmeasure)
AC3
Dimension
∀(x) ∃(z) c-inst(x, Dimension) /\ c-inst(z, schema) /\ memberof(SDimension, z) /\ addDimension(x, z) Æ c-inst(x, SDimension)
AC4
Level
∀(x) ∃(y) c-inst(x, Level) /\ c-inst(y, Dimension) /\ memberof(Dlevel, y) /\ addLevel(x, y)Æ c-inst(x, Dlevel)
AC5 (a) Hierarchy
∀(x) ∃(z) c-inst(x, Fact) /\ c-inst(z, Schema) /\ member-of(Sfact, z) /\ addFact(x, z) Æ c-inst(x, Sfact)
∀(x) ∃(y) c-inst(x, Hierarchy) /\ c-inst(y, Dimension) /\ memberof(Dhierarchy, y) /\ addHierarchy(x, y) Æ c-inst(x, Dhierarchy) ∀(x, y) ∃(z) c-inst(x, Hierarchy) /\ member-of(Hparent, x) /\ c-inst(y,
(b)
Hparent) /\ c-inst(z, Dimension) /\ member-of(Dlevel, z) /\ addHierarchy(x, z) Æ c-inst(y, Dlevel)
(c)
∀(x, y) ∃(z) c-inst(x, Hierarchy) /\ member-of(Hchild, x) /\ c-inst(y, Hchild) /\ c-inst(z, Dimension) /\ member-of(Dlevel, z) /\ addHierarchy(x, z)Æ c-inst(y, Dlevel)
(d)
∀(x, a, b, c) c-inst(x, Hierarchy) /\ member-of(Hparent, x) /\ mem-
ber-of(Hchild, x) /\ c-inst(a, Hparent) /\ c-inst(b, Hparent) /\ c-inst(c, Hchild) /\ isParent(a, c) /\ isParent(b, c) Æ equivalent(a, b)
∀ (x, a, b, c) c-inst(x, Hierarchy) /\ member-of(Hparent, x) /\ mem-
(e)
ber-of(Hchild, x) /\ c-inst(a, Hchild) /\ c-inst(b, Hchild) /\ c-inst(c, Hparent) /\ isParent(c, a) /\ isParent(c, b) Æ equivalent(a, b)
AC6 (a) Cube
∀(x) ∃(z) c-inst(x, Cube) /\ member-of(Cfact, x) /\ addFact(z, x) Æ c-inst(z, Fact)
(b)
∀(x) ∃(z) c-inst(x, Cube) /\ member-of(Cdimension, x) /\ addDimension(z, x) Æ c-inst(z, Dimension)
(c)
∀(x, y) ∃(a, b) c-inst(x, Fact) /\ member-of(SubkeyName, x) /\ c-
inst(y, SubkeyName) /\ c-inst(a, Dimension) /\ member-of(DPkey, a) /\ c-inst(b, DPkey) /\ ConnectDimensionToFact(a, x) Æ c-inst(b, SubkeyName)
The addition constraints for the Hierarchy construct are: AC5a. AC5b. AC5c. AC5d. AC5e.
A hierarchy can only be added to an existing dimension. A parent level in a hierarchy is a level in the dimension. A child level in a hierarchy is a level in the dimension. A level in a hierarchy has at most one parent level. A level in a hierarchy has at most one child level.
80
S. Banerjee and K.C. Davis
Constraints AC5(d) and (e) enforce roll-up/drill-down along a single path in a dimension; this constraint is relaxed when we discuss support for multiple hierarchies. The addition constraints for the Cube construct are: AC6a.
To define a valid cube construct, there must be a corresponding fact construct in the schema. AC6b. A cube dimension can be created if the dimensions connected to the fact exist in the schema. AC6c. A dimension can be connected to a fact in a cube if the subkey of the concatenated primary key of a fact is a foreign key reference to the primary key of the dimension. We use familiar relational model terminology in Constraint AC6(c) to simplify the discussion. To be more general, set and functional dependency notation could be used here. As a concrete example to illustrate an addition constraint, consider a Time dimension. If we wish to create a child level called Month that rolls-up to Quarter, both AC5(b) and AC5(c) would have to be satisfied. We use AC5(c) to illustrate the components of the rule. Let z be the Dimension with name “Time” and set of levels, Dlevel, with values {“Second,” “Minute,” “Hour,” “Day,” “Week,” “Month,” “Quarter,” “Year”}. In order to create a hierarchy x (a pair of levels) in z, where the child component of x is y, y must exist as a level in the schema. Since “Month” is an element of Dlevel, creating part of the hierarchy called x with child y (“Month”) is valid for the Dimension “Time.” Valid deletions must satisfy the following conditions (DC1-DC4 in Table 2). DC1 allows a level in a dimension to be deleted if it is not a parent nor a child level in a hierarchy. DC2 specifies that a dimension can be deleted from the schema if there are no hierarchies or levels of the dimension, and the dimension is not part of a cube. DC3 states that a fact can be deleted from the schema if there are no corresponding cubes based on that fact. DC4 regards deletion of a cube; a cube can be deleted if there are no fact and dimensions in the cube. Schema evolution operators build on these constraints to delete constructs that depend on other constructs. The deletion constraint for a Level requires that levels participating in hierarchies cannot be deleted; hierarchies must be removed first using the schema evolution operator DeleteHierarchyInDimension (described in Section 4.2). As an example, deleting a dimension from a schema would first require deleting the hierarchies, then the levels, followed by removing the dimension from any cubes in which it participates, then deleting the dimension by removing it from the schema. The addition constraints for the extended hierarchy semantics are summarized in Table 3 (deletion constraints are in Table 4) and described below. We show the modified hierarchy constraint (AC5d′ and AC5e′) and AC7 to support multiple hierarchies. Constraints AC8-AC10 support non-covering hierarchies, non-onto hierarchies, and non-strict hierarchies. Previous authors have defined similar constraints in terms of cardinality (e.g., Tryfona et al. [43] and Malinowski and Zimányi [31]); we use predicates as a higher level of abstraction here to define constraints and schema evolution operators.
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
81
Table 2. Deletion Constraints (Core Features) ID Construct DC1 (a) Level
Constraint
∀(x, y) ∃(z) c-inst(x, Dimension) /\ member-of(DLevel, x) /\ c-inst(y, DLevel) /\ IsHierarchy(z, x) /\ member-of(ParentLevel, z) /\ ¬cinst(y, ParentLevel) Æ delete(y, DLevel)
∀(x, y) ∃(z) c-inst(x, Dimension) /\ member-of(DLevel, x) /\ c-inst(y,
(b)
DLevel) /\ IsHierarchy(z, x) /\ member-of(ChildLevel, z) /\ ¬c-inst(y, ChildLevel) Æ delete(y, DLevel)
DC2 (a) Dimension
∀ (x, y) c-inst(x, Dimension) /\ member-of(DHierarchy, x) /\ ¬cinst(y, DHierarchy) Æ delete(x, Dimension)
(b)
∀ (x, y) c-inst(x, Dimension) /\ member-of(DLevel, x) /\ ¬c-inst(y, DLevel) Æ delete(x, Dimension)
(c)
∀ (x, y) c-inst(x, Dimension) /\ c-inst(y, Cube) /\ memberof(CDimension, y) /\ ¬c-inst(x, CDimension) Æ delete(x, Dimension)
DC3
Fact
DC4 (a) Cube
(b)
∀ (x, y) c-inst(x, Fact) /\ c-inst(y, Cube) /\ member-of(Cfact, y) /\ ¬c-inst(x, Cfact) Æ delete(x, Fact) ∀ (x, y, z) c-inst(x, Cube) /\ member-of(Cfact, x) /\ c-inst(y, Cfact) /\ member-of(Cdimension, x) /\ c-inst(z, Cdimension) /\ ¬c-inst(y, Cfact) /\ ¬c-inst(z, Cdimension) Æ delete(x, Cube) ∀ (x, z) c-inst(x, Cube) /\ member-of(Cfact, x) /\ memberof(Cdimension, x) /\ ¬c-inst(z, Cdimension) Æ delete(x, Cube)
In a traditional data warehouse model, the levels of a dimension share a direct relationship and the roll-up/drill-down is along a straight path. However, in real world situations, it is possible that a level has more than one child or parent level. Data models have been proposed to support multiple hierarchies in a dimension [2, 3, 14, 15, 16, 19, 23, 24, 27, 29, 30, 31, 35, 37, 42, 43, 44]. To represent a dimension with multiple hierarchies, the ULD construct Dimension is modified to include Dpath. A hierarchy path of a dimension, Dpath, represents all possible combinations of roll-up relationships from the bottom-most child level to the top-most parent level. The addition constraints are modified to support multiple hierarchies in a dimension by ensuring that a parent level in a hierarchy has at least one child level instead of exactly one (AC5(d′)), and a child level has at least one parent (AC5(e′)). We introduce addition and deletion constraints for the path construct (AC7 and DC5, respectively.) The hierarchies of a path in a dimension must be hierarchies of the dimension. A path of a dimension is modified when a hierarchy is added, deleted or modified. As an example, consider AC10(a) for adding a non-strict hierarchy. There must be a dimension hierarchy in the schema that the NShierarchy conforms to; the NShierarchy is created from an existing one-to-many hierarchy (parent, child) that becomes many-to-many. This is specified by the constraint and enforced by the schema evolution operator AddNonStrictHierarchy (described in Section 4.2). A non-covering hierarchy or ragged hierarchy occurs when an instance of a level rolls-up to a level that is not a direct parent of that level [26, 31, 36]. We model the
82
S. Banerjee and K.C. Davis Table 3. Addition Constraints (Extended Hierarchy)
(x, a) (b) c-inst(x, Hierarchy) /\ member-of(Hparent, x) /\ c-inst(a, Hparent) c-inst(b, Hchild) (x, a) (b) c-inst(x, Hierarchy) /\ member-of(Hchild, x) /\ c-inst(a, Hchild) c-inst(b, Hparent) (a,b,c,d) (e) c-inst(a, Dimension) /\ member-of(Dpath, a) /\ cinst(b, Dpath) /\ member-of(PathHierarchy, b) /\ c-inst(c, PathHierarchy) /\ member-of(Dhierarchy, a) /\ c-inst(e, Dhierarchy) equivalent(c, e) (x, y) (z) c-inst(x, NChierarchy) /\ c-inst(y, Dimension) /\ memberof(Dhierarchy, y) /\ c-inst(z, Dhierarchy) conformance(x, z) (x, y) (z) c-inst(x, NChierarchy) /\ member-of(NCparent, x) /\ cinst(y, path) /\ conformance(x, y) /\ hierarchy(z, y) /\ memberof(Hparent, z) /\ c-inst(x, Hparent) non-coveringparent(y, x) (x, y) (z) c-inst(x, NChierarchy) /\ member-of(NCchild, x) /\ cinst(y, path) /\ conformance(x, y) /\ hierarchy(z, y) /\ memberof(Hchild, z) /\ c-inst(x, Hchild) non-coveringchild(y, x) (a,b,c, x) non-coveringparent(b, a) /\ non-coveringchild(c, a) /\ cinst(x, Hierarchy) /\ member-of(Hparent, x) /\ member-of(Hchild, x) /\ c-inst(b, Hparent) /\ c-inst(c, Hchild) c-inst(a, NChierarchy) (x, y) (z) c-inst(x, NOhierarchy) /\ c-inst(y, Dimension) /\ memberof(Dhierarchy, y) /\ c-inst(z, Dhierarchy) conformance(x, z) (x, y) (z) c-inst(x, NOhierarchy) /\ member-of(NOparent, x) /\ cinst(y, NOparent) /\ ConformanceHierarchy(z, x) /\ memberof(Hparent, z) /\ c-inst(y, Hparent) non-onto parent(y, x) (x, y) (z) c-inst(x, NOhierarchy) /\ member-of(NOchild, x) /\ cinst(y, NOchild) /\ ConformanceHierarchy(z, x) /\ member-of(Hchild, z) /\ c-inst(y, Hchild) non-onto child(y, x) (x, y) (z) c-inst(x, NShierarchy) /\ c-inst(y, Dimension) /\ memberof(Dhierarchy, y) /\ c-inst(z, Dhierarchy) conformance(x, z) (x, y) (z) c-inst(x, NShierarchy) /\ member-of(Nsparent, x) /\ cinst(y, NSparent) /\ ConformanceHierarchy(z, x) /\ membernon-strict parent(y, x) of(Hparent, z) /\ c-inst(y, Hparent) (x, y) (z) c-inst(x, NShierarchy) /\ member-of(NSchild, x) /\ cinst(y, NSchild) /\ ConformanceHierarchy(z, x) /\ member-of(Hchild, z) /\ c-inst(y, Hchild) non-strict child(y, x)
mapping of instances for the dimension levels in a non-covering hierarchy because we assume that the non-covering hierarchies between the levels are for a limited number of instances; an alternative approach of creating a new hierarchy between the levels is redundant and complicates the schema at the logical level. A hierarchy instance consists of an instance of a parent level, an instance of child level, and a conformance relationship to a hierarchy (AC8e). The conformance relationship maps the hierarchy
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
83
instance to a hierarchy construct. The NChierarchy construct represents a noncovering hierarchy between two levels along a path in the dimension. It consists of a parent level, a child level and a conformance relation to a path. The path represents a set of hierarchies in a dimension. The parent and child level in the non-covering hierarchy have a conformance relationship to a dimension path such that the parent level is an ancestor of the child level and not a direct parent (AC8a). A non-onto hierarchy occurs when an instance of a parent level does not have an instance of the child level to drill-down to. Non-onto hierarchies are typically supported by allowing the child level in a hierarchy to have null values [31, 35, 36]. We allow a direct hierarchy to change to a non-onto hierarchy. Similarly to a noncovering hierarchy, a non-onto hierarchy has a conformance relationship (AC9) and an instance mapping between parent and child (not shown here). According to a strict hierarchy, there is a one-to-many cardinality relationship between a parent and child level in a hierarchy. A non-strict hierarchy occurs when there is a many-to-many relationship between the parent and child level in a hierarchy. Pedersen et al. survey data models for non-strict hierarchy support and conclude that none of the proposed models at that time supported non-strict hierarchy [37]. Some researchers mention the problems caused by many-to-many cardinality such as aggregating the measure twice [2, 14, 40] but do not provide the solution to support it. Malinowski and Zimanyi propose modeling non-strict hierarchy [31] in two different ways: (1) a bridge table [27], and (2) mapping non-strict hierarchies to strict hierarchies [37]. In our model we implement a bridge table to support aggregation over a non-strict hierarchy. A detailed model for maintaining correct aggregation during schema evolution has been developed but is beyond the scope of this paper. Table 4. Deletion Constraints (Extended Hierarchy) ID DC5
Construct Path
Constraint
DC6
Noncovering hierarchy
∀(a, b, c) ∃(x) c-inst(a, NChierarchy) /\ member-of(NCchild, a) /\
DC7
Non-onto hierarchy
∀(a, hi) c-inst(a, NOhierarchy) /\ conformance-of(hi, a) /\ ¬c-inst(hi, DHierarchy) Æ delete(a, NOhierarchy)
DC8
Non-strict hierarchy
∀(a, hi) c-inst(a, NShierarchy) /\ conformance-of(hi, a) /\ ¬c-inst(hi, DHierarchy) Æ delete(a, NShierarchy)
∀ (a, b) c-inst(a, Dimension) /\ member-of(Dhierarchy, a) /\ member-of(Dpath, a) /\ ¬c-inst(b, Dhierarchy)Æ delete(b, Dpath) member-of(NCparent, a)/\ c-inst(b, NCparent) /\ c-inst(c, NCchild) /\ c-inst(x, Hierarchy) /\ member-of(Hparent, x) /\ member-of(Hchild, x) /\ c-inst(b, Hparent) /\ c-inst(c, Hchild) Æ delete(a, NChierarchy)
In the next section, a multilevel dictionary definition is given to illustrate the metaconstructs and constructs of our core model along with an example schema.
4 Multilevel Dictionary Implementation Atzeni, Cappellari, and Bernstein [1] describe an implementation of a multilevel dictionary to manage schemas and describe models of interest. Their approach allows
84
S. Banerjee and K.C. Davis
model-independent schema translation because all models are described with metaconstructs in a supermodel that generalizes all other models. They produce an XML schema representation of an Entity-Relationship schema as an example of the kind of model-independent reporting supported by their approach. They contribute a visible description of models and specification of translations. We utilize the same approach here for representing the core features of a data warehouse, although we use the metaconstructs of ULD rather than the exact supermodel given by Atzeni et al. [1]. Note that both the ULD and the multilevel dictionary approaches model transformation rules outside of their structural formalisms; they use Datalog rules that are independent of the engine used to interpret them. This section details our structural description. 4.1 An Example Schema An example of a data warehouse schema is shown in Figure 2 using DFM notation [18] and our extensions (Figure 1). Our metamodel, corresponding to the ULD constructs in Section 2, is shown in Table 5. The model is represented in a tabular format similar to the multilevel dictionary method. As an example, the construct c4 has a setct construct type and represents the information that FactSet is a set-of Fact. A multilevel dictionary representation of Schema is given in Table 6. The representation includes some set-valued attributes to ease understanding of the definition in a top-down fashion; for implementation in a relational DBMS (Section 4.2), the tables are normalized. The schema has a unique name, SalesSchema. The FactSet includes the fact Sales and the DimensionSet is a set of all the dimensions in the schema. The records in the column FactSet and DimensionSet have a foreign key reference to the Fact and Dimension construct tables. The multilevel dictionary representations of Fact, names of primary key fields, and measures are listed in Tables 7, 8, and 9, respectively. For example, f1 is a fact with a unique name Sales, primary key fields Product#, Time#, and Location#, and measures m1, m2, and m3 that represent Qty Sold, Revenue, and No. Customer, respectively. Table 10 gives a multilevel dictionary representation of a Dimension construct. A Dimension is described by a unique name, a set of levels, a primary key, and sets of hierarchies and paths over the levels. For example, the Location dimension has the levels L19, L20, and L21 representing City, Province, and Country. The primary key for the dimension is Location#, and h22 and h23 are the hierarchies of the dimension. An empty record in the HierarchySet indicates the lack of a hierarchy among the levels of a dimension. Table 11 lists the multilevel dictionary representation for Level construct. A Hierarchy construct is defined by a parent level and a child level. The parent level drills-down to a child level. For example, levels L8 and L7 are parent and child level, respectively, and represent the information Year drills-down to Quarter. Table 12 lists the multi-level dictionary representation for the Hierarchy construct. Table 13 lists multiple paths that occur in dimensions. For example, p4 is the path Item, Category, Corporation in the dimension Product. Non-covering hierarchies, such as that between District and Province in the Location dimension (nc3) are given in Table 14, along with the path the hierarchy conforms to (p9). Table 15 records the non-onto hierarchy between City and IP Add, while Table 16 records the non-strict relationships between levels, such as Province and Cell.
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics Table 5. Metamodel: Constructs and Construct Types Schema OID Name S1 DW Schema
Construct Type OID Name ct1 struct-ct ct2 set-ct ct3 union-ct ct4 atomic-ct ct5 predefined
Constructs
Schema S1 S1 S1 S1 S1
OID
Name
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24
Schema FactSet Fact FactPKeySet FactPkey DimensionSet Dimension LevelSet Level MeasureSet Measure HierarchySet Hierarchy DimensionPkey Cube AttributeSet Attribute PathSet Path PathHierarchy NChierarchy NOhierarcy NShierarcy
Construct Type ct1 ct2 ct1 ct2 ct1 ct2 ct1 ct2 ct1 ct2 ct1 ct2 ct1 ct1 ct1 ct2 ct5 ct2 ct2 ct1 ct1 ct1 ct1
Value c2, c6 c3 c4, c10, c17 c5 c18 c7 c8, c12, c15 c9 c18 c11 c18 c13 c9, c9 c18 c3, c6 c18 c20 c21 c13 c9, c9, c21 c9, c9, c13 c9, c9, c13
Table 6. Model: Schema Definition Schema OID Name s1 SalesSchema
Construct Name c1
FactSet f1
DimensionSet d1, d2, d3
Table 7. Fact Construct Fact OID FName f1 Sale
Construct Name c3
Table 8. Primary Keys of the Fact FactPrimaryKey OID Name pk1 Time# pk2 Product# Pk3 Location#
Construct Name c5 c5 c5
MeasureSet m1, m2, m3
Pkey pk1, pk2, pk3
Table 9. Measure Construct Measure OID Name m1 Qty Sold m2 Revenue m3 No. Customer
Construct Name c11 c11 c11
85
86
S. Banerjee and K.C. Davis Table 10. Dimension Construct Dimension OID DName d1
Time
Construct Name c7
d2
Product
c7
d3
Location
c7
DLevel
DPkey
DHierarchy
Dpath
L1, L2, L3, L4, L5, L6, L7, L8 L9, L10, L11, L12, L13 L14, L15, L16, L17, L18, L19, L20, L21
pk1
h1, h2, h3, h4, h5, h6, h7 h8, h9, h10, h11, h12 h13, h14, h15, h16, h17, h18, h19, h20, h21, h22, h23
p1, p2
Table 11. Level Construct Level OID Name L1 Second L2 Minute L3 Hour L4 Day L5 Week L6 Month L7 Quarter L8 Year L9 Item L10 Brand L11 Company L12 Category L13 Corporation L14 Co-ordinate L15 Street L16 District L17 IP add L18 Cell L19 City L20 Province L21 Country
Construct Name c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9 c9
pk2 pk3
p3, p4 p5, p6, p7, p8, p9
Table 12. Hierarchy Construct Hierarchy OID ParentLevel h1 L2 h2 L3 h3 L4 h4 L5 h5 L6 h6 L7 h7 L8 h8 L10 h9 L11 h10 L12 h11 L13 h12 L13 h13 L17 h14 L18 h15 L15 h16 L16 h17 L19 h18 L19 h19 L19 h20 L20 h21 L20 h22 L20 h23 L21
ChildLevel L1 L2 L3 L4 L4 L6 L7 L9 L10 L9 L11 L12 L14 L14 L14 L15 L17 L18 L16 L17 L18 L19 L20
4.2 Schema Evolution Implementation In this section, an overview of the MDD implementation of the formal model is given along with the algorithm for a schema evolution operator. An example of schema evolution is given that uses the schema in Figure 2. All user interaction is conducted within the Microsoft SQL Server 2000 environment. A user may either use a simple GUI to execute a stored procedure where he or she will be prompted to supply the
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
87
arguments or may use the SQL command line. The procedure then updates records in the tables (such as those presented in Section 4.1) that may in turn cause triggers to execute on the tables. Example screenshots of a trigger and a stored procedure are given in Figures 5 and 6, respectively. Table 13. Path Construct Path OID Phierarchy p1 h1, h2, h3, h4 p2 h1, h2, h3, h5, h6, h7 p3 h8, h9, h11 p4 h10, h12 p5 h13, h17, h22, h23 p6 h13, h20, h23 p7 h14, h18, h22, h23 p8 h14, h21, h23 p9 h15, h16, h19, h22, h23
Construct Name c20 c20 c20 c20 c20 c20 c20 c20 c20
Table 14. Non-Covering Construct. NChierarchy OID NCparent Nc1 L16 Nc2 L19 Nc3 L20
NCchild L14 L14 L16
Conformance p9 p9 p9
Table 15. Non-Onto Construct. NOhierarcy OID NOparent No1 L19
Conformance h17
Table 16. Non-Strict Construct. NShierarchy OID NCparent ns1 L19 ns2 L20
NCchild L18 L18
Conformance h18 h21
The MDD definition for the ULD constructs such as facts, dimensions, levels, and hierarchies are implemented using tables in Microsoft SQL Server 2000 as shown in Figure 4. Enforcement of the semantics is accomplished by triggers over the tables that implement addition and deletion constraints over the features. A database trigger is procedural code that is automatically executed in response to certain events on a particular table in a database. The events that cause a trigger to execute are insert, update, and delete. For example, when a hierarchy is added to the DHierarchySet table, the child level of a hierarchy has to be an existing level in the dimension. The
88
S. Banerjee and K.C. Davis
trigger ensures that when a hierarchy is added to DHierarchySet the child level in the hierarchy is a level in DLevelSet. A trigger is created over the DHierarchySet and is executed when a record is added or updated in the table. If a new record is inserted in this table and if the child level is not present in the table DLevelSet, the transaction is rolled back. We implement 23 addition constraints (Tables 1 and 3) and 12 deletion constraints (Tables 2 and 4) as triggers for enforcing correct semantics for schema evolution. A sample trigger is shown in the screenshot in Figure 5 for deletion of a non-onto hierarchy.
Fig. 4. MDD Implementation
Fig. 5. Screenshot of a Trigger
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
89
The records in the tables represent a schema and the addition, deletion, and update of records represent the changes to a schema (i.e., schema evolution.) In our tool, the evolution operators have been implemented as stored procedures that accept arguments from the user and accordingly make changes to a schema. The changes rely on triggers to enforce semantics. A stored procedure is a subroutine written in T-SQL code to access the SQL Server relational database system. The tool supports typical evolution operators over the core features of a data warehouse [8, 9, 10, 12, 13, 17, 25, 28, 39] such as add a new level to a dimension, add a hierarchy between two levels of a dimension, delete a level from a dimension, and add a dimension to fact, as well as add/delete multiple, non-strict, non-onto, and non-covering hierarchies [22, 31, 35, 36, 44]. Our operators subsume those proposed in the literature and additional capabilities for addition and deletion of extended hierarchies. We implement these operators as 23 stored procedures, described in Table 17 along with their arguments (i.e., tables.) An example of a stored procedure is shown in the screenshot given in Figure 6 for adding a non-covering hierarchy.
Fig. 6. Screenshot of a Stored Procedure Excerpt
An example of an evolution operator is AddNonCoveringHierarchy that creates a noncovering hierarchy relationship between two levels of a dimension. The arguments passed are schema name, dimension, child level, and parent level. The arguments passed by the evolution operator are added to the tables only if they are consistent with the addition constraints of NChierarcyConstruct. This ensures the correctness of schema evolution. The stored procedure that implements the AddNonCoveringHierarchy operator is shown in Figure 6 and an algorithm for it is given in Figure 7. The evolution operator to delete a non-covering hierarchy checks that the participating levels conform to some path that remains valid after the non-covering, ancestor-child relationship is removed.
90
S. Banerjee and K.C. Davis Table 17. Evolution Operators as Stored Procedures
Evolution Operator AddSchema
Tables Affected schema
AddFactToSchema AddDimensionToSchema
schema, fact, measures, primary key schema, dimension, primary key
AddLevelToDimension
schema, dimension, level
AddHierarchyToDimension
schema, dimension, parent level, child level
AddCubeToSchema
schema, cube, fact, set of dimension
AddDimensionToFactInCube
schema, cube, fact, dimension
DeleteDimensionFactInCube
schema, cube, fact, dimension
DeleteCubeFromSchema
schema, cube
DeleteFactFromSchema
schema, fact
DeleteHierarchyInDimension DeleteLevelInDimension
schema, dimension, parent level, child level schema, dimension, level
DeleteDimensionInSchema
schema, dimension
DeleteSchema RenameFactInSchema
schema schema, fact
RenameDimensionInSchema
schema, dimension
RenameLevelInDimension
schema, dimension, level
AddNonStrictHierarchy
schema, dimension, parent level, child level schema, dimension, parent level, child level schema, dimension, parent level, child level schema, dimension, parent level, child level schema, dimension, parent level, child level schema, dimension, parent level, child level schema, dimension, parent level, child level schema, dimension
DeleteNonStrictHierarchy AddNonOntoHierarchy DeleteNonOntoHierarchy AddNonCoveringHierarchy DeleteNonCoveringHierarchy AddMultipleHierarchyToDimension ReExaminePath
Description Creates a schema to which facts and dimensions can be added Adds a fact to an existing schema Adds a dimension to an existing schema Adds a level in a existing dimension Adds a hierarchy between two existing levels of a dimension Creates a cube by joining a set of existing dimensions to a fact Connects a dimension to a fact in a existing cube Deletes a dimension from a fact in a existing cube Deletes a cube in an existing schema Deletes a fact from an existing schema Deletes a hierarchy in a dimension Deletes a level in a existing dimension Deletes a dimension from an existing schema Deletes a schema Rename a fact in an existing schema Rename a dimension in an existing schema Rename a level in an dimension Adds a non-strict hierarchy in a dimension Deletes a non-strict hierarchy in a dimension Adds a non-onto hierarchy in a dimension Deletes a non-onto hierarchy in a dimension Adds a non-covering hierarchy in a dimension Deletes a non-covering hierarchy in a dimension Adds a multiple hierarchy in a dimension Modifies the path as the hierarchy changes
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
91
Consider a scenario where there is an existing schema with a dimension Location that has the levels Country, Province and City and the hierarchies such that Country drillsdown to Province and Province drills-down to City. The AddNon-CoveringHierarchy operator is used to create a non-covering hierarchy between the levels Country and City. The hierarchy is successfully created since it satisfies Steps 1 to 5 in the algorithm shown in Figure 7. Consider a second scenario when the operation is to add a noncovering hierarchy between Province and City. In this case, the schema evolution operation is unsuccessful in creating the hierarchy as it does not satisfy the constraint that a parent and child level of a non-covering hierarchy cannot be a direct hierarchy (AC8d). In other words, the child must roll-up to an ancestor, not a direct parent. _______________________________________________________________ Algorithm AddNonCoveringHierarchy Input: Schema Name, Dimension Name, Child Level, Parent Level Output: A non-covering hierarchy is created between the parent and child levels of a dimension Step 1: Check if Schema Name is valid. Step 2: Check if Dimension Name is a dimension in the schema. Step 3: Check if Parent and Child Level are existing levels of the dimension. Step 4: Check if Parent Level is an ancestor (not a direct parent) of Child Level along a path in the dimension. Step 5: If Steps 1 to 4 are satisfied, add Parent and Child Levels in the NCHierarchy table. _________________________________________________________________________ Fig. 7. Evolution Operator Example Table 18. Schema Evolution on Location Dimension [26] Operator AddMultipleHierarchy AddMultipleHierarchy AddMultipleHierarchy AddMultipleHierarchy AddMultipleHierarchy AddMultipleHierarchy AddNonCoveringHierarchy AddNonCoveringHierarchy AddNonCoveringHierarchy AddNonOntoHierarchy AddNonStrictHierarchy AddNonStrictHierarchy AddNonStrictHierarchy PathFinder
Arguments SalesSchema, Location, Province, IP add SalesSchema, Location, City, IP add SalesSchema, Location, IP add, Co-ordinate SalesSchema, Location, Province, Cell SalesSchema, Location, City, Cell SalesSchema, Location, Cell, Co-ordinate SalesSchema, Location, Province, District SalesSchema, Location, District, Co-ordinate SalesSchema, Location, City, Co-ordinate SalesSchema, Location, City, IP add SalesSchema, Location, Province, Cell SalesSchema, Location, City, Cell SalesSchema, Location, District, Street SalesSchema, Location
Constraints AC5, AC7 AC5, AC7 AC5, AC7 AC5, AC7 AC5, AC7 AC5, AC7 AC8 AC8 AC8 AC9 AC10 AC10 AC10 AC7
In order to show schema evolution via adding constructs to a schema, we assume that a schema called SalesSchema with a fact called Sale and measures shown in Figure 2 has been created. We assume that the direct hierarchies in the Location dimension have already been created. The operations needed to construct the extended
92
S. Banerjee and K.C. Davis
hierarchies of the Location dimension (as in the final result shown in Figure 2) are given in Table 18. In this example of a complex evolution task, at least one example of every kind of extended hierarchy semantics is illustrated. Our tool provides a methodology to define a data warehouse schema and explore evolution over core and additional features. Users can use the tool to create semantically correct data warehouse schemas and populate them. The evolution operators can be used to make changes to a schema by modifying the tables representing the metadata about the schema. The triggers ensure the correctness and consistency of a schema and help a user to visualize the process of schema design and evolution. Most importantly, the semantics of advanced features for design and evolution of schemas can be explored experientially and there is no other tool that supports this kind of activity.
5 Conclusions and Future Work We introduce a formal model of core features and specify semantics of schema evolution operators for a data warehouse. We contribute a tool that implements schema evolution operators as stored procedures that invoke triggers to enforce schema correctness. Our tool allows a user to create, populate, and query over a basic data warehouse schema without requiring any knowledge more specialized than the ability to use a relational database to operate the tool. In addition, no other system supports investigating the impact of schema evolution over both core and advanced data warehouse schema modeling constructs. Our research provides a basis for further investigation of generic tools for schema and model management such as browsing [5, 6, 7] and reporting and visualization [1]. Because MDD definitions exist for ER and OODB models and ULD definitions exist for XML, RDF, and relational models, information interchange can be achieved by defining mappings between these models and ours. Automating the generation of triggers and stored procedures from rules specified in Datalog or first-order logic also remains as a topic for future investigation. In our work in progress, we have defined multidimensional instance lattices in order to enforce correct semantics of schema evolution over data instances. We have experimented with our tool using schema evolution modeled by others [17] as well as a variety of test cases since there are no schema evolution examples in the literature using extended hierarchy semantics. A complete implementation for data migration between evolved schemas remains as future work. We are currently implementing model management capabilities with our tool. We have added a front-end tool that utilizes StarER [43] as a data warehouse conceptual model, represents it in our MDD logical model, and produces a Microsoft SQL Server Analysis Services implementation. We are currently incorporating other conceptual data warehouse models (ME/R [42] and DFM [18]) at the front-end along with algorithms to support schema merging over the metamodel defined here. Tools for translating conceptual schemas have been proposed [20, 21, 29, 38] but none use a multilevel data dictionary (model gen) approach as we do here; our approach is intended to allow interoperability among heterogeneous models rather to support a specific data model. We focus on the relational model as our implementation model
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
93
vehicle here since our structures and rules translate naturally to triggers, and our operators translate naturally to stored procedures. The target model (a ROLAP implementation) could be replaced with a MOLAP implementation in the future. Alternative platform-independent approaches to data warehouse schema specification could be explored as the basis for schema evolution; Mazón et al. [32, 34] propose a model-driven architecture approach to specify conceptual schemas and cube metadata for supporting OLAP applications. Query-view-transformation (QVT) mappings could potentially be used for specifying evolution operators. Future work also includes investigating how the model can be integrated with other related research such as versioning and cross-version querying [17, 41, 47] and ETL process modeling [46]. Using ETL conceptual modeling to specify data migration paths between versions to facilitate on-demand query processing for other versions appears to be a promising research direction. Another interesting approach proposed for maintaining an information system containing a relational database considers the correctness of queries when database evolution occurs [33]; investigating whether or how to extend the paradigm to accommodate multidimensional modeling is an open topic that could leverage the foundation we provide here. Similarly, Kaas et al. [28] investigate the impact of schema evolution on star and snowflake schemas for browse and aggregation queries. It would be interesting to explore the impact of additional evolution operators over extended semantics and create a framework for rewriting queries based on the operators.
References [1] Atzeni, P., Cappellari, P., Bernstein, P.: A multilevel dictionary for model management. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, Ó. (eds.) ER 2005. LNCS, vol. 3716, pp. 160–175. Springer, Heidelberg (2005) [2] Agrawal, R., Gupta, A., Sarawagi, S.: Modeling Multidimensional Databases. In: Proceedings of the 13th International Conference on Data Engineering (ICDE), Birmingham, U.K, April 7-11, 1997, pp. 232–243 (1997) [3] Abelló, A., Samos, J., Saltor, F.: YAM2 (Yet Another Multidimensional Model). In: Proceedings of the International Database Engineering & Applications Symposium (IDEAS), Edmonton, Canada, July 17-19, 2002, pp. 172–181 (2002) [4] Bækgaard, L.: Event-Entity-Relationship Modeling in Data Warehouse Environments. In: Proceedings of 2nd ACM Second International Workshop on Data Warehousing and OLAP (DOLAP), Kansas City, Missouri, USA, November 6, 1999, pp. 9–14 (1999) [5] Bowers, S., Delcambre, L.: On Modeling Conformance for Flexible Transformation over Data Models. Knowledge Transformation for the Semantic Web 95, 34–48 (2003) [6] Bowers, S., Delcambre, L.: The uni-level description: A uniform framework for representing information in multiple data models. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 45–58. Springer, Heidelberg (2003) [7] Bowers, S., Delcambre, L.: Using the Uni-Level Description (ULD) to Support DataModel Interoperability. Data and Knowledge Engineering 59(3), 511–533 (2006) [8] Bouzeghoub, M., Kedad, Z.: A logical model for data warehouse design and evolution. In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds.) DaWaK 2000. LNCS, vol. 1874, pp. 178–188. Springer, Heidelberg (2000)
94
S. Banerjee and K.C. Davis
[9] Blaschka, M., Sapia, C., Höfling, G.: On schema evolution in multidimensional databases. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 153–164. Springer, Heidelberg (1999) [10] Chen, J., Chen, S., Rundensteiner, E.: A transactional model for data warehouse maintenance. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 247–262. Springer, Heidelberg (2002) [11] Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26(1), 65–74 (1997) [12] Claypool, K., Natarajan, C., Rundensteiner, E., Rundensteiner, E.: Optimizing Performance of Schema Evolution Sequences. In: Dittrich, K.R., Guerrini, G., Merlo, I., Oliva, M., Rodriguez, M.E. (eds.) ECOOP-WS 2000. LNCS, vol. 1944, pp. 114–127. Springer, Heidelberg (2001) [13] Claypool, K., Rundensteiner, E., Heineman, G.: Evolving the Software of a Schema Evolution System. In: Balsters, H., De Brock, B., Conrad, S. (eds.) FoMLaDO 2000 and DEMM 2000. LNCS, vol. 2065, pp. 68–84. Springer, Heidelberg (2001) [14] Datta, A., Thomas, H.: A Conceptual Model and Algebra for On-line Analytical Processing in Data Warehouses. In: Proceedings of the 7th Workshop for Information Technology and Systems (WITS), Atlanta, Georgia, USA, December 13-14, 1997, pp. 91–100 (1997) [15] Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Datacube: A Relational Aggregation Operator Generalizing Group-by, Cross-tab, and Sub-totals. Journal of Data Mining and Knowledge Discovery, ch. 1, 29– 53 (1997) [16] Gyssens, M., Lakshmanan, L.: A Foundation for Multi-Dimensional Databases. In: Proceedings of the 23rd International Conference on Very Large Databases (VLDB), Athens, Greece, August 25-29, 1997, pp. 106–115 (1997) [17] Golfarelli, M., Lechtenbörger, J., Rizzi, S., Vossen, G.: Schema Versioning in Data Warehouses. In: Wang, S., Tanaka, K., Zhou, S., Ling, T.-W., Guan, J., Yang, D.-q., Grandi, F., Mangina, E.E., Song, I.-Y., Mayr, H.C. (eds.) ER Workshops 2004. LNCS, vol. 3289, pp. 415–428. Springer, Heidelberg (2004) [18] Golfarelli, M., Maio, D., Rizzi, S.: The Dimensional Fact Model: A Conceptual Model for Data Warehouses. International Journal of Cooperative Information Systems (IJCIS) 7(2-3), 215–247 (1998) [19] Golfarelli, M., Rizzi, S.: A Methodological Framework for Data Warehousing Design. In: Proceedings of the 1st International Workshop on Data Warehousing and OLAP (DOLAP), Washington, DC, USA, November 2-7, pp. 3–9 (1998) [20] Golfarelli, M., Rizzi, S.: WAND: A CASE Tool for Data Warehouse Design. In: Proceedings of the 17th International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2-6, pp. 7–9 (2001) [21] Hahn, K., Sapia, C., Blaschka, M.: Automatically Generating OLAP Schemata from Conceptual Graphical Models. In: Proceedings of the 3rd ACM International Workshop on Data Warehousing and OLAP (DOLAP), pp. 9–16 (2000) [22] Hümmer, W., Lehner, W., Bauer, A., Schlesinger, L.: A decathlon in multidimensional modeling: Open issues and some solutions. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 275–285. Springer, Heidelberg (2002) [23] Hüsemann, B., Lechtenbörger, J., Vossen, G.: Conceptual Data Warehouse Design. In: Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW), Stockholm, Sweden, June 5-6, pp. 6:1–6:11(2000)
Modeling Data Warehouse Schema Evolution over Extended Hierarchy Semantics
95
[24] Hurtado, C., Mendelzon, A.: Reasoning about summarizability in heterogeneous multidimensional schemas. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 375–389. Springer, Heidelberg (2000) [25] Hurtado, C., Mendelzon, A., Vaisman, A.: Maintaining Data Cubes under Dimension Updates. In: Proceedings of 15th International Conference of Data Engineering (ICDE), Sydney, Australia, March 23-26, pp. 346–355 (1999) [26] Jensen, C., Kligys, A., Pedersen, T., Timko, I.: Multidimensional Data Modeling for Location Based Services. The VLDB Journal 13(1), 1–21 (2004) [27] Kimball, R.: The Data Warehouse Toolkit. John Wiley & Sons, Inc., New York (1996) [28] Kaas, C., Pedersen, T.B., Rasmussen, B.: Schema Evolution for Stars and Snowflakes. In: Proceedings of the 6th International Conference on Enterprise Information Systems, Porto, Portugal, April 14-17, pp. 425–433 (2004) [29] Luján-Mora, S., Trujillo, J., Song, I.: A UML Profile for Multidimensional Modeling in Data Warehouses. Data and Knowledge Engineering 59(3), 725–769 (2006) [30] Li, C., Wang, X.: A Data Model for Supporting On-Line Analytical Processing. In: Proceedings of the 5th International Conference on Information and Knowledge Management (CIKM), Rockville, Maryland, November 12-16, pp. 81–88 (1996) [31] Malinowski, E., Zimányi, E.: OLAP hierarchies: A conceptual perspective. In: Persson, A., Stirna, J. (eds.) CAiSE 2004. LNCS, vol. 3084, pp. 477–491. Springer, Heidelberg (2004) [32] Mazón, J.-N., Trujillo, J.: An MDA Approach for the Development of Data Warehouses. Decision Support Systems 45(1), 41–58 (2008) [33] Papastefanatos, G., Vassiliadis, P., Vassiliou, Y.: Adaptive Query Formulation to Handle Database Evolution. In: Proceedings of the Conference on Advanced Information Systems Engineering: CAiSE Forum, Luxembourg (2006) [34] Pardillo, J., Mazón, J.-N., Trujillo, J.: Model-driven Metadata for OLAP Cubes from the Conceptual Modeling of Data Warehouses. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 13–22. Springer, Heidelberg (2008) [35] Pedersen, T., Jensen, C.: Research Issues in Clinical Data Warehousing. In: Proceedings of the 10th International Conference on Scientific and Statistical Database Management, Capri, Italy, July 1-3, pp. 43–52 (1998) [36] Pedersen, T., Jensen, C.: Multidimensional Data Modeling for Complex Data. In: Proceedings of 15th International Conference on Data Engineering (ICDE), Sydney, Australia, March 23-26, pp. 336–345 (1999) [37] Pedersen, T., Jensen, C., Dyreson, C.: A Foundation for Capturing and Querying Complex Multidimensional Data. Information Systems 26(5), 383–423 (2001) [38] Prat, N., Akoka, J., Comyn-Wattiau, I.: A UML-based Data Warehouse Design Method. Decision Support Systems 42, 1449–1473 (2006) [39] Quix, C.: Repository Support for Data Warehouse Evolution. In: Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW 1999), Heidelberg, Germany, June 14-15, p. 4 (1999) [40] Rafanelli, M., Shoshani, A.: STORM: A Statistical Object Representation Model. In: Michalewicz, Z. (ed.) SSDBM 1990. LNCS, vol. 420, pp. 14–29. Springer, Heidelberg (1990) [41] Rizzi, S., Golfarelli, M.: X-Time: Schema Versioning and Cross-Version Querying in Data Warehouses. In: Proceedings of the 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, April 15-20, pp. 1471–1472 (2007) [42] Sapia, C., Blaschka, M., Höfling, G., Dinter, B.: Extending the E/R Model for the Multidimensional Paradigm. In: Kambayashi, Y., Lee, D.-L., Lim, E.-p., Mohania, M., Masunaga, Y. (eds.) ER Workshops 1998. LNCS, vol. 1552, pp. 105–116. Springer, Heidelberg (1999)
96
S. Banerjee and K.C. Davis
[43] Tryfona, N., Busborg, F., Christiansen, J.: StarER: A Conceptual Model for Data Warehouse Design. In: Proceedings of the 2nd ACM International Workshop on Data Warehousing and OLAP, Kansas City, Missouri, USA, November 6, pp. 3–8 (1999) [44] Tsois, A., Karayannidis, N., Sellis, T.: MAC: Conceptual data modeling for OLAP. In: Proceedings of the 3rd International Workshop on Design and Management of Data Warehouses (DMDW), Interlaken, Switzerland, June 4, p. 5 (2001) [45] Vassiliadis, P.: Modeling Multidimensional Databases, Cubes and Cube Operations. In: Proceedings of 10th International Conference on Scientific and Statistical Database Management (SSDBM), Capri, Italy, July 1-3, pp. 53–62 (1998) [46] Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A Generic and Customizable Framework for the Design of ETL Scenarios. Information Systems 30(7), 492–525 (2005) [47] Wrembel, R., Morzy, T.: Managing and querying versions of multiversion data warehouse. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 1121–1124. Springer, Heidelberg (2006)
An ETL Process for OLAP Using RDF/OWL Ontologies Marko Niinim¨aki and Tapio Niemi Helsinki Institute of Physics, Technology Programme, CERN, CH-1211 Geneva 23 Tel.: +41 22 7676179; Fax: +41 22 7673600
[email protected],
[email protected]
Abstract. In this paper, we present an advanced method for on-demand construction of OLAP cubes for ROLAP systems. The method contains the steps from cube design to ETL but focuses on ETL. Actual data analysis can then be done using the tools and methods of the OLAP software at hand. The method is based on RDF/OWL ontologies and design tools. The ontology serves as a basis for designing and creating the OLAP schema, its corresponding database tables, and finally populating the database. Our starting point is heterogeneous and distributed data sources that are eventually used to populate the OLAP cubes. Mapping between the source data and its OLAP form is done by converting the data first to RDF using ontology maps. Then the data are extracted from its RDF form by queries that are generated using the ontology of the OLAP schema. Finally, the extracted data are stored in the database tables and analysed using an OLAP software. Algorithms and examples are provided for all these steps. In our tests, we have used an open source OLAP implementation and a database server. The performance of the system is found satisfactory when testing with a data source of 450 000 RDF statements. We also propose an ontology based tool that will work as a user interface to the system, from design to actual analysis. Keywords: OLAP, ontology, ETL.
1 Introduction The amount of available digital data is ever increasing mostly due to the World Wide Web and intrawebs of organizations. In corporate environments, taking maximal benefit of these data is essential for competitiveness and the success of a company. This is challenging since the data are located in various locations and stored in different formats. It is obvious that only a very limited part of the data can be found and analysed manually. Automated and powerful enough methods to fetch the right raw data and process it into a form usable for analysis are thus needed. To make this possible, the data must be made available in a machine readable format. The Semantic Web technologies can be seen as a working solution, since they give a standard method (Resource Description Framework, RDF) to store data items and their relationships in a machine readable form. Data extraction can be done by using RDF query languages. It is also possible (though not always trivial) to map XML or relational data into the RDF form. S. Spaccapietra et al. (Eds.): Journal on Data Semantics XIII, LNCS 5530, pp. 97–119, 2009. c Springer-Verlag Berlin Heidelberg 2009
98
M. Niinim¨aki and T. Niemi
For actual data analysis there exist several techniques from statistical methods to spreadsheet tools. For analysing large amounts of data and trying to find cause and effect relationships, On-Line Analytic Processing (OLAP) has become popular during the past years. Briefly, OLAP is a decision making tool for analysts. It is designed to analyse large data warehouses in a multidimensional fashion and on different levels of details. The multidimensional data structure of OLAP is called an OLAP cube (for details, see e.g. [10]). In theory, one can populate an OLAP cube instance using all the data in data sources. In many cases, this can lead to a massively sparse and inefficient OLAP cube. In some cases, such as using the World Wide Web as an information source, this approach would be totally impossible. Practically, the cube should be created on-demand, so that only required parts of the data sources are extracted and imported into the OLAP cube. To do so, we have to locate the data sources, convert them in a format that makes the semantics of the data explicit, and create an OLAP cube using parts of the data. In this paper, we combine these two powerful technologies, Semantic Web and OLAP, resulting in a method of creating a fully functional tool for analysis. The method uses an ontology and mapping files that connect the data sources with the ontology. Extracting, transforming and loading the data (the so-called ETL process) into an OLAP system is done according to the ontology. We demonstrate our method by implementing it using open source tools. Our starting point is a general, “meta-model” OLAP ontology that defines that there are dimensions and measures, and that the measure is a function of the dimensions. From this general OLAP ontology we form application specific sub-ontologies. These will be the starting point for the actual analysis, and the data will be formatted to conform with them. Resource Description Framework (RDF) and Ontology Web Language (OWL) are used to express these ontologies in a manner that is formal enough for analysis. We show that this model is analogous to a relational OLAP model. This gives us a clear theoretical background and a possibility to use relational OLAP servers in the implementation. As an example in this paper, we use a Trade ontology containing information about trade of different products between countries. We demonstrate our method by combining this ontology with another one, describing country characteristics of the old editions of the World Fact Book1 . The resulting OLAP cube can be used to analyse, for example, if countries with lots of telephones per 100 inhabitants actually export a great value in electronics products. Since our use case is based on the idea of distributed, heterogeneous data, the data must be (i) converted into a form that conforms with our specific ontologies. This warrants that we use uniform names of dimensions and measures, as well as identifiers (like country names) that are declared by our ontology. Then (ii) parts of the data must be chosen and (iii) loaded into an OLAP system. Finally, the user can query and analyse the data using the methods offered by the OLAP system. Phases (i) and (ii) have been discussed in our previous papers ([27], [29], [31]). Our paper about combining OLAP and RDF ontologies [29] was mainly related to data integration from distributed, heterogeneous sources, while a Grid implementation of our 1
https://www.cia.gov/library/publications/the-world-factbook
An ETL Process for OLAP Using RDF/OWL Ontologies
99
system was introduced in [31]. In the current paper we develop these ideas further and present a method for an automated construction of the OLAP server schema. Moreover, we give a set of rules how the OLAP schema can be manipulated by modifying the dimensions. This manipulation enables the users to combine their heterogeneous data (like trade and World Fact Book data) so that the model of the data still remains a legitimate OLAP cube. Finally we give a proof-of-concept implementation and show its feasibility by running a set of tests. The rest of the paper is organised as follows: Related work and background are discussed in Section 2, whereas Section 3 explains our research questions and methodology. In Section 4 we explain the architecture and basic functionality of our system. Section 5 explains the implementation in more detail. Tests and performance results are also shown there. Finally, Section 6 contains a summary and discussion of future work.
2 Related Work and Background 2.1 Related Work There are lots of studies focusing on different parts of our method but as far as we know, none of these studies combine these technologies in the similar way as we do. That is, using OWL as the meta model, RDF for presenting OLAP data, and an RDF query language for extracting the OLAP data. Integrating data from different sources has been the subject of many studies, for a summary, see e.g. [18]. In the context of the so-called Semantic Web, ontology-based data integration using RDF was possibly first mentioned by Bray [8], whereas Lehti and Fankhauser discuss OWL for data integration in their conference paper [22]. On-Line Analytic Processing (OLAP) as a term was introduced by Codd et al. [11]. Priebe and Pernul [33] have studied the possibility of combining OLAP and ontologies to add context information in users’ queries. Lawrence and Rau-Chaplin [21] have studied computing OLAP aggregations (see Section 2.2) in the Grid, whereas OGSADAI (see e.g. [14]) generally focuses on data access in the Grid. Another distributed approach, the GridVine platform combines peer-to-peer methods with an RDF presentation layer [5]. There are several studies using XML data as a source for the OLAP cube, such as [40,20,26]. These studies focus on helping the designer to define the OLAP cube schema based on the structure and meta data of XML sources. For example, Vrdoljak et al. [40] present a method designing Web warehouses based on XML Schemata. The method focuses on detecting shared hierarchies, convergence of dependencies and modelling many-to-many relationships. While the XML Schema is a powerful method to describe the meta data of an XML document, all the information needed cannot be extracted directly from it. Therefore, the method uses XQuery to query XML documents directly, and asks the designer’s help if a decision cannot be made automatically. Finally, the method constructs a star schema representing the data warehouse schema but the method does not populate the actual OLAP database. Jensen et al. [20] study how an OLAP cube can be specified from XML data on the Web. They propose a UML (Unified Modelling Language) based multidimensional model for describing and visualising the structure of XML documents. Finally, they
100
M. Niinim¨aki and T. Niemi
study how a multidimensional database can be designed based on XML and relational data sources. The aims of Jensen et al. are quite different from ours, since they concentrate on finding the multidimensional structure of XML data sources. N¨appil¨a et al. [26] present a tool for construction OLAP cubes from structurally heterogeneous XML documents. They propose a high-level query primitive that has been designed to work with structurally heterogeneous XML documents. In this way, the designer is freed from knowing the exact structure of the different documents used as source data. A process of extracting data from its sources, transforming it, and uploading it into an OLAP system is generally known as ETL (Extract, Transform, Load). Skoutas and Simitsis [36,37] present an ontology based ETL system. It has a detailed ontology, but omits details of its database implementation. Finally, Romero and Abello have studied automating ontology-based multidimensional database design [34]. They present a semi-automatic method to find multidimensional concepts from heterogeneous data sources sharing a common ontology. Further, the authors give a set of rules how facts and dimensions can be identified in the application domain. The method itself contains three steps: 1) identifying facts, 2) identifying dimension keys (called a base in their model), and 3) defining dimension hierarchies. The authors note that the method could use an OWL ontology as its input but this is not implemented. From this it follows that RDF/OWL query languages are not used in the method. Generally we can say that their method is a tool for a designer to define the structure or the OLAP cube by identifying the measures and dimensions, while the method we present in this paper focus on automating the cube construction when the designer already have the structure of the cube in his/her mind. 2.2 Background OLAP. On-line analytic processing (OLAP) is a method for supporting decision making in cases in which data on measures, such as sales or profit, can be presented as multidimensional OLAP cubes. The OLAP cube can be seen as a set of coordinates that determine a measure value. The coordinates, often called dimensions, can have a hierarchical structure making it possible to analyse data at different levels of statistical aggregation. For example, the dimensions in an OLAP cube can be Time, Customer, Product and the measures Sales and Profit. Further, the product dimension can have a structure Product → Subgroup → Main group. It can be said that items on lower levels ’roll-up’ to items on higher levels in the hierarchy. In this way the user can analyse data at different levels of details. Dimension hierarchies have different properties related to summarizability, i.e. how aggregation operations can be applied. Briefly, summarizability includes three conditions: 1) disjointness of attribute groups, 2) completeness of grouping, and 3) the type of the attribute and the aggregation function [23]. The first two conditions are easy to fulfil while the third one is more complicated. For example, daily sales can be summarised according to the time dimension by addition, but daily stock values cannot. However, the daily stock values can be summarised according to the product dimension. Our implementation is based on the relational OLAP model (ROLAP), that is, the relational database system is used to store OLAP data. The OLAP database is usually
An ETL Process for OLAP Using RDF/OWL Ontologies
101
stored in the relational database system using the so-called star schema (See e.g. [10]). The star schema consists of a fact table and several dimension tables. Each measure value is stored in the fact table with dimension keys while dimensions tables contains the dimension data. In the star schema, each dimension has one non-normalised dimension table. If the dimension tables are normalised, the schema is called a snowflake schema [24]. RDF, OWL and Ontologies. RDF (Resource Description Framework) was developed for describing resources on the Web. This is done by making statements about Web resources (pages) and things that can be identified on the Web, like products in on-line shops [3]. RDF is based on the idea of identifying things using Web identifiers (called Uniform Resource Identifiers, or URIs), and describing resources by giving statements in terms of simple properties and property values. An RDF statement is a triple of subject, predicate, and object. The statement asserts that some relationship, indicated by the predicate, holds between the things denoted by the subject and the object of the triple. As an example of a resource on the Web, we can have the following statement. The web page whose URI is “http://www. example.org/xyz.html” (subject) has a creator (predicate) that is N.N. (object). As an example of a thing outside of the Web, but referred to it by an URI, we can consider the following: A country, Andorra referred to by the URI “http://wiki.hip. fi/xml/ontology/Countries.owl#AD” (subject) has a population (predicate) of 470 000 (object). When describing types of resources, structures like classes are typically needed. This is possible by using the RDF schema language that enables us to construct classes with their properties (see [4]). OWL further extends the schema language by adding restrictions, like cardinality constraints (see [2]).2 Generally, an ontology is a description (like a formal specification of a program) of concepts and relationships that can exist for an agent or a community of agents [38] (an agent here means a rational actor, not necessarily a computer program). Here, we assume that an ontology is presented in the form of (consistent) statements. In the context of this paper, the language in which these statements are written will be RDF Schema or its extension OWL-DL where “DL” stands for Description Logics (see [6] and [2]). OLAP ontology. Based on the ideas of OLAP, we have defined a generic OLAP ontology [29], accessible as http://wiki.hip.fi/xml/ontology/olapcore. owl. The OLAP ontology consists of dimensions (members of DimensionSet) and measures (members of MeasureSet) that are connected by FactRows. Each FactRow object has one DimensionSet and at least one MeasureSet. An OLAP cube (OLAPCube) has at least one FactRow. A formal specification can be found in the appendix. Any application specific OLAP ontology uses olapcore.owl as its starting point – technically it includes olapcore.owl as a namespace. In a specific OLAP ontology, actual OLAP dimensions (like Country → Continent) are declared among classes 2
Cardinality constraints between dimensions, measures and OLAP cubes are discussed in the next section.
102
M. Niinim¨aki and T. Niemi
(Country and Continent) as properties (Country having the property hasContinent). Respectively, application specific dimension sets are declared as subclasses of DimensionSet. For instance in our Trade ontology called olaptrade, TradeDimensionSet, representing trade of specific products between two countries in a given year, is defined as having a hasExportCountry property whose range is Country, a hasImportCountry property, a hasProduct property and a hasYear property. The only member of TradeMeasureSet, a subclass of MeasureSet, is an attribute representing the trade value. TradeFactRow, naturally, consists of one TradeDimensionSet and one TradeMeasureSet. TradeOLAPCube, a subclass of OLAPCube, has at least one TradeFactRow. Similarly, we have defined an ontology olaptradewfb that includes the Trade ontology and Country attributes from the CIA World Factbook. These include the population, land area, literacy rate, and the number of telephone lines per 100 inhabitants for each country in a given year. Since these attributes are not required to determine the value of the trade (it is still determined by ), these are called dimension property attributes. An ontology like olaptrade (http://wiki.hip.fi/xml/ontology/ olaptrade.owl) or olaptradewfb (http://wiki.hip.fi/xml/ ontology/olaptradewfb.owl) will serve as a model according to which raw data needs to be structured. We assume that – apart from naming and structural representation – we do have raw data that is semantically unambiguous and factually correct. We shall use RDF to represent this data in a form that can be verified to conform with our ontology. RDF queries. Our motivation to use RDF comes from the fact that it provides an interim format for the main steps of raw data processing, where (i) the data are converted into a form that conforms with our specific ontologies and (ii) RDF queries are posed to the converted data and the results are loaded into an OLAP system. For Step (ii), we use an RDF query language. Since we know that the RDF data in the data set S conforms our ontology, i.e. it has dimension and fact row data, we can automatically generate queries for the instances of these Dimensions and FactRows in S. It should be noticed that by default the queries generated by the system are “large”, for instance a query of Countries will query all the attributes (Land area, Population, ..) of each country as well. As Broekstra et al. [9] explain, the evaluation of the RDF query is carried out as either (eval-i) computing and storing the closure of the given graph as a basis or (evalii) letting the query processor infer new statements as needed. Bearing in mind that RDF data sets can be expressed as a set of subject-predicate-object triples, we can see a query evaluation as a process where some triples “match” the query and they can be included in its result set, whereas others are rejected. Following (eval-i), a simplified algorithm for evaluating a query Q with a data source S would be given as follows: 1: Let CS be the closure of S. 2: Let L be an empty list. 3: Compute a query model of Q such that it contains (i) restrictions of values that are allowed to appear in each match (ii) unbound variables that are instantiated for each match.
An ETL Process for OLAP Using RDF/OWL Ontologies
103
4: for all RDF triple T in CS do 5: If T satisfies the restrictions, bind the values of T to their respective variables in
Q as pairs of variable names and values. Append the result in L. 6: end for 7: Return L.
The algorithm omits many details, including logical connectives, and its efficiency can be improved for evaluation of certain types of queries. For instance, Perez et al. [32] have stated that an unlimited RDF query has a PSPACE complexity, whereas a more limited class of queries can be linear in complexity. For the purposes of estimating the complexity of our typical queries, let us consider a query Q that is “?x hasPredicate ?y” and a data set D of RDF triples {< a1 , b1 , c1 >, < a2 , b2 , c2 >, .., < an , bn , cn >}. The simplest possibility to evaluate Q on D is to include only tuples where bi equals hasPredicate. This is possible (without any overhead) of course only in a situation where hasPredicate can be matched directly without calculating its subclasses or superclasses. However, this is the case in our automatically generated queries (see Section 4.2). It is obvious that using this method, the complexity of evaluating these queries is linear w.r.t the size of D: even though our queries have conjunctions, D can be traversed separately for each variable in the evaluation.
3 Methodology and Formalism In this section we present the theoretical background of our method. We assume that there exists a data source of measures from some phenomenon and that these measures are identified by a set of attributes, called dimension keys. The dimension keys can be classified based on additional data. This data can come from different data sources assuming that the ontology bases of these are similar enough. By similar enough, we mean that it must be possible to build a mapping between the data sources. If the data sources share a common name space, this is usually straightforward. In our world trade example, the trade figures and the external country data are joined using the standard country identifiers. As a whole, the system works as follows: 1. There are raw data with ontology descriptions on the Web. Pairs of < raw data source, ontology map > are assumed to be available to the system. 2. The raw data sources are converted into RDF files that conform to ontologies. 3. An OLAP cube is constructed by posing queries against the RDF files. The queries are generated automatically using the OLAP cube design (ontology) as input. 4. The OLAP server schema and its corresponding relational database are constructed for the OLAP cube. 5. The user can analyse data using a normal OLAP interface and, if needed, easily change the whole OLAP cube by repeating the process. 3.1 Formal OLAP Model We use the relational database model [12] as our theoretical background. The application area with all data sources can be seen as a universal relation (See e.g. [25,35]) as
104
M. Niinim¨aki and T. Niemi Table 1. Application area as a universal relation
Product
Group
Exp. country
Office paper Mobile phone :
Paper FI Electronics FI : :
Exp. population 5.2 5.2 :
Imp. country UK UK :
Imp. population 50 50 :
Year
Trade value
1995 1995 :
130 230 :
Table 1 illustrates. A universal relation instance can be seen as an unnormalised OLAP cube, if it contains some “dimensions” and some “measures” such that the combination of the dimensions determines the measure values. In our example, the measure can be ’Trade value’ while the other attributes can be seen as dimensions. However, we can see that not all attributes (e.g. Population of the exporting country) are needed to determine the Trade value attribute. Actually, Product, Exporter, Importer, and Year are enough. Therefore, we can conclude that those attributes are “dimension keys” and the other attributes, so-called dimension attributes, describe these keys. Dimension attributes can be either dimension level attributes forming the dimension hierarchy or dimension property attributes giving additional information on the dimension keys or levels. Based on the relational database model and the universal relation idea, we can formalise the OLAP model [28] as follows: Definition 1. The OLAP cube schema and instance. 1. A dimension schema D and a measure set M are sets of attributes. 2. An OLAP cube schema C = D1 ∪ D2 ∪ ... ∪ Dn ∪ M , where D1 ...Dn are called dimension schemata, M is the measure set, and D1 ∪ D2 ∪ ... ∪ Dn is a superkey for C, and there exists a single-attribute key Kk for each dimension schema Dk . 3. The attribute Kk is called the dimension key attribute. Other attributes A ∈ Dk are called dimension attributes. 4. An OLAP cube instance c is a relation over the OLAP cube schema C and a relation d over D is called a dimension instance. 5. It is assumed that the measure set M is not empty and the cube has at least one dimension d. The requirement for the measure set M being non-empty, refers only to the schema: every OLAP cube schema must have at least one measure attribute. This does not prevent some actual measure values from being missing. As we indicate by using the concepts “key” and “superkey”, there are some functional dependencies [13] involved in our model: the set of dimension keys K1 ...Kn → M , and each dimension key Kk → Dk . Although dimension hierarchies are not an explicit part of our general OLAP model, we can still define dimension level attributes and dimension property attributes as follows: An attribute A ∈ D is called a dimension level attribute if A is used as a level in the dimension hierarchy. Otherwise A is called a dimension property attribute. Dimension level attributes form the actual dimension hierarchy, for example town → country → continent, while dimension property attributes give additional information about the dimension items, for example its population, area, etc.
An ETL Process for OLAP Using RDF/OWL Ontologies
105
Our OLAP model is equivalent to the so-called star schema that is used in many relational OLAP implementations: the dimension key attributes K1 , ..., Kn with the measure attributes M1 , ..., Mm form the fact table and each dimension D forms its own dimension table. Like in our formal OLAP model, also in the star schema model dimension hierarchies are not stored explicitly on the relational level. They can still be extracted, for example, by using information on functional dependencies among attributes in a dimension. In our method, the dimension hierarchy information is presented in the RDF/OWL model, described in Section 2.2. 3.2 Modifying the OLAP Cube Schema Our starting point is a “virtual” OLAP cube schema representing the whole application area. Based on this the user defines a sub schema by giving a new RDF/OWL sub-ontology. In theoretical reasoning, we can think that the original cube schema is transformed to the user defined one by manipulating the dimensions in a proper way, preserving the fact that the instances before and after each operation fulfill Definition 1. The cube manipulation operations can be: 1. 2. 3. 4.
modifying the internal structure of a dimension, removing dimensions or dimension keys, changing dimension keys, or adding dimensions.
These operations enable us to construct different OLAP cubes from the same source data or from pre-existing OLAP cubes. For example, if we have daily sales data containing Product ID and Date as dimensions, we can have cubes (Product, Day, Month, Year, Sales) or (Product, Weekday, Week, Year, Sales), since both Day, Month, Year → Date and Weekday, Week, Year → Date. However, some issues should be taken into account. For example, summarizability properties can change when modifying dimensions. Next we will give more detailed descriptions of these operations. Because of space limitations and for better readability, we omit the formal proofs. Modifying the internal structure of a dimension. Modifying the internal structure of a dimension i.e. changing non-key attributes affects only the dimension itself. In these internal changes, one must take care that the dimension key really determines the other attributes in the dimension after the operation. The population of the country, for example, may depend on the time dimension. Thus the population cannot be added into a dimension but it could be an additional measure attribute. On the other hand, the land area of the country can most often be seen a static fact and it can be used to describe countries in a dimension. Removing dimensions or dimension keys. A dimension key K, or the whole dimension D, of an OLAP cube schema C can be removed if there exists an aggregation function3 f such that an instance c over C − K is a valid OLAP cube after applying f to c . An aggregation function is needed, since when a dimension or the key of a dimension is removed, some measurements can have identical dimension values, i.e. we do 3
By an aggregation function (e.g. sum, average, etc.), we mean a function summarizing fact rows having the same dimension keys.
106
M. Niinim¨aki and T. Niemi
not have a valid OLAP cube anymore. These duplicate coordinates can be eliminated by applying some aggregation operation to the measure values having the same dimension attributes. For example, if we remove the product dimension, i.e. Product and Product group attributes, the total trade of all different products must be summarized. The following example illustrates this: An original OLAP cube presented in a tabular form: Product Product group Exporter Value paper forest FI 200 wood forest FI 250 wood forest SE 300 phones electronics FI 400
After removing the product dimension the resulting cube is not a valid OLAP cube anymore, since Exporter does not determine Value: Exporter FI FI SE FI
Value 200 250 300 400
The problem can be solved by eliminating the duplicate dimension values using an aggregate function: Exporter Value FI 850 SE 300
While removing dimension keys is always technically possible, it can lead to semantically vague results if a semantically correct aggregation function is not used, or it does not exist. Therefore, our method does not try to find the aggregation function automatically. Changing dimension keys. A dimension key attribute can be replaced by another attribute if the new attribute still satisfies Definition 1. For example, in theory it would be possible to change the country ID to the population of the country, if there are no two countries with exactly the same population in our example data. Of course, semantically this would be misleading. On the other hand, if the same values exist violating the resulting structure not being a valid OLAP cube anymore, the aggregation functions can be applied. This would mean grouping dimension values in a new way. A dimension key can also be replaced by several new dimension keys. For example, assume Date, the key of the time dimension (having the format month-day-year) be replaced with two new dimension keys WeekDay (weekday-year), WeekNumber (week-year). This is a valid change, since it results to a valid OLAP cube. In practice this means adding a dimension to the OLAP cube, and thus, it easily increases sparsity. Adding dimensions. The fourth possibility is to add dimensions. This is possible since Definition 1 requires only the superkey property from the set of dimensions. This means
An ETL Process for OLAP Using RDF/OWL Ontologies
107
that we can add additional information as dimensions: for example the area of the country in addition to the country name. However, this is not a good practice since it usually increases sparsity to the cube.
4 Architecture and Implementation To summarise, the aim of our design is to collect data from different data sources, and to generate and populate OLAP cubes. Ontology maps are used to create a uniform (RDF) presentation from each of the data sources, while RDF queries are used to extract data from RDF presentations. The main components of our system are (i) data sources and ontology maps related to each of them, (ii) an RDF query processor, and (iii) an OLAP software and its database. We use Trade/WordFactBook ontology, describing countries, their attributes, and import and export of products between the countries, as our running example. The conceptual presentation of our example ontology is shown in Figure 1 and the system functionality is illustrated in Figure 2. Product
Sub group
Import country Trade value Export country
Main group
Year
Country Continent -area -telephonesPer100 -population -literacyPercent
Fig. 1. Conceptual schema of our example OLAP cube. The arrows represent functional dependencies.
Fig. 2. Functional flow of our approach
Our process of creating OLAP cubes has two main phases: 1. Defining and converting RDF data for the OLAP cube. This is done by converting raw data into RDF using ontology maps. 2. Creating the OLAP cube schema and database for the ROLAP server, and populating the database. The phases are described in more detail in the following sub-sections.
108
M. Niinim¨aki and T. Niemi
4.1 Defining and Converting Data In our system, an ontology map is created for each data source. The purpose of an ontology map is to describe the processing steps for the data from the data source to an RDF file that conforms with an OLAP ontology. A data source can be a relational database, a set of files with field = value rows, or an eXtensible Markup Language (XML) file. Intuitively, the mapping from the raw data to the RDF presentation is similar to database transformations (see e.g. [15]) and we assume that the raw data can be seen as at least First Normal Form data (see [13]). The grammar for ontology maps is given in http://wiki.hip.fi/xml/ ontology/ontomap.xsd, and a more detailed discussion can be found in [30]. The following example illustrates the idea. An example of a resulting RDF file is shown in the appendix. Let ontomap-TradeDimensionSet.xml be an ontology map and year1983withproductsandcountries.xml a raw XML data source (extracts of both shown in Figure 3). Our program “ProcessOntoMap” follows the instructions of the ontology map file and produces an RDF file. The RDF file conforms with the ontology (olaptradewbf.owl) and contains data generated from the raw data. In our example, the raw data is in XML format, and ProcessOntoMap populates the RDF file by – simply copying a data item from the XML element of the raw data into the XML element of the RDF representation, or – performing a “lookup” or a function. A lookup means replacing a value with another by using an internal map (e.g. table continent that contains names of continents, and their ID’s) in order to unify the names from different data sources. A function is an algebraic expression, used for instance to convert a value in Euros to US dollars. The functionality of RDF generation by ontology maps can be described as follows: – If the data source consists of field-value text files or SQL database tables, create a corresponding XML document D. – In ontology map O, evaluate each “tuple/column” expression T as an XPath expression for D.4 – For the result set of T: populate the RDF document. If there are value transformations such as replacing strings by lookups or performing numeric conversions, evaluate them. It is assumed that the data source provider knows the the semantics of the data. For instance a data source in French language may have a field called lettr´e for literacy and its value may be the number of literate people in the country instead of literacy percent. In this case, for this field the ontology map would contain the value transformation (literate people / population × 100) and using the result as the contents of field “literacy”. We assume, too, that there is no semantic ambiguity; in this case we assume 4
XPath is a language for addressing parts of an XML document, for details see [1].
An ETL Process for OLAP Using RDF/OWL Ontologies
109
Raw data .. .. .. .. Ontology map .. Fig. 3. Parts of a raw data XML file and its corresponding ontology map
that the words “lettr´e” and “literacy” have the same meaning in this context, and that the data concerning literacy has been collected in a compatible way in different data sources.
110
M. Niinim¨aki and T. Niemi
4.2 Creating OLAP Cube Schema and Populating It In our prototype, we use Mondrian open source ROLAP system (http:// mondrian.pentaho.org) as our OLAP implementation. Naturally, Mondrian uses an SQL database as its data storage. An XML configuration (Mondrian OLAP schema) file is used to map the database tables and fields into OLAP dimensions and measures. The OLAP platform and database creation consists of the following components: – Creating a Mondrian OLAP schema from the RDF/OWL ontology. – Creating a database schema (“create table” statements) from the ontology. – Populating the database using the RDF data. The Mondrian schema gives the structure of the OLAP cube and links it to the relational database storing the data. The Mondrian schema can be constructed from the OWL OLAP schema using Algorithm 1. An example schema is shown in the appendix. Algorithm 1. Create a Mondrian Schema from OWL OLAP schema. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
Find the subclass C of the OLAPCube class. Set Schema name = C and Cube name = C. Find the dimension set CD related to C. Set Table name = CD fact. for all dimension Dn ∈ CD do Create a section Set Dimension name = Dname, foreignKey = Dname Set Hierarchy-AllmemberName = ”Every ”+Dn , primaryKey = Dn Set Table name = Dname for all level L in the hierarchy do Set name = Lname, column = Lname end for end for Find the measure set CM related to C. Set name = CMname
A dimension hierarchy can be formed by following OWL data type properties that connect dimension classes to each other. For instance the dimension product → subgroup → maingroup hierarchy can be formed using Product as a starting point, identifying hasSubgroup as Product’s property linking it to Subgroup, and then identifying hasMainGroup as Subgroup’s property linking it to MainGroup. The relational database schema for Mondrian schema can now be constructed using Algorithm 2. The process is relatively trivial given that we know the dimensions and measures based on the ontology. Obviously, unique compound keys for the DimensionSet-MeasureSet table (“D” above), can be created by using the columns of the DimensionSet. This is a direct consequence of the fact that in each row, the value of the measure is dependent on its dimensions. The database is populated by using the data in its RDF form. As explained in Section 4.1, ontology maps enable data sources to have an RDF representation that conform with an ontology.
An ETL Process for OLAP Using RDF/OWL Ontologies
111
Algorithm 2. Create a relational database schema from a Mondrian OLAP Schema 1: for all subclass S of FactRow do 2: Identify DimensionSet D and MeasureSet M included in S 3: Create table D whose columns are the names of properties in D and M 4: for all class C in D do 5: Form a dimension hierarchy CH for C as explained above. 6: Create table C whose columns are the names of classes in CH. 7: end for 8: end for
Fact Table query: select ?hasProduct ?hasExportCountry ?hasImportCountry ?hasYear ?hasValue where (?TradeDimensionSet ?hasProduct) (?TradeDimensionSet ?hasExportCountry) (?TradeDimensionSet ?hasImportCountry) (?TradeDimensionSet ?hasYear) (?TradeDimensionSet ?hasValue) Dimension Table query: select ?Product ?SubGroup ?MainGroup where (?Product ?SubGroup) (?SubGroup ?MainGroup) Fig. 4. Two RDF queries
Once the raw data sources have been converted in RDF format and we know the classes and properties of the ontology, we can use RDF queries to retrieve parts of the data. Specifically, we construct queries that retrieve data that will correspond to our fact table and to each of the dimension tables. These queries can be created automatically based on the ontology: the RDF source corresponds to an RDF/OWL ontology and therefore the terms in the ontology have their extensions in the data. We simply use the terms as variable names in our queries as shown in Figure 4. Finally, the output of the queries can be used to build SQL insert/update statements for the database. The method for creating the queries based on the ontology at hand is almost identical to Algorithm 2. The method creates queries for the FactRows and dimensions. Examples
112
M. Niinim¨aki and T. Niemi
of such queries for the Trade ontology are shown in Figure 4: one for the fact table and one for a dimension table. A part of the Trade ontology can be seen in the appendix. After processing the query, our query tool produces an XML stream containing the results. This, in turn, can be converted into SQL insert/update statements.
5 Tests and Results We implemented a proof-of-concept prototype of our method by using Sun’s Java SDK, Protege RDF/OWL tools [16], Sesame query libraries [9], and Mondrian ROLAP software with MySQL database back-end. A detailed description of the work flow is shown in Figure 5. The processing times of our tests are shown in Table 2. The programs were run in a virtual Linux machine similar to a PC computer with AMD 32-bit 1800 MHz processor and 3 GB memory. The operating system was Linux Debian 4.0 with its standard kernel. Sun’s SDK 1.5 was used for Java, and MySQL 4.1 as Mondrian 2.4.1’s database. In more detail, an ontology map (“Ontomap” in the work flow diagram) indicates the data source (raw data) and contains instructions of how it can be converted into an ontology conforming RDF. ProcessOntoMap is a Java program that does the conversion. Ideally, conversions can be executed in parallel and close to the data to eliminate the need for data transfer over the network. In our test case, each raw XML data source contained around 450 000 statements about countries, their population, land area, literacy and telephones per 100 inhabitants, the continent where the country lies; products with their Standard International Trade Codes, their subgroups and main groups based on the same standard; and the trade values (in USD) of each type of product from the exporting country to the importing country. Each raw XML data source represented one year of trade with the country attributes. The size of the resulting RDF file was about 146 MB. Queries based on our ontology are created by the createqueries program. The queries, again, could be run close to the RDF data source, and only their results OWL ontology
XML raw data
OntoMap.xml
ProcessOntoMap.java createmondrian
createqueries
Queries defining contents of OLAP cube
Mondrian Schema
RDF data
Sesame RDF query engine A subset of RDF data Mondrian DB transformation
Fig. 5. Process work flow
Mondrian OLAP server
An ETL Process for OLAP Using RDF/OWL Ontologies
113
Table 2. Execution times of steps of Fig. 5 Step Converting a data source to RDF Creating Mondrian OLAP cube schema from ontology Creating a database star schema for Mondrian Creating queries needed for database inserts Running queries (for dimensions and FactRow) Converting query results to SQL insert/replace statements
Program ProcessOntoMap
Time 16 m 34 s
createmondian
1s
createdimensiontables
1s
createqueries
1s
sesamequery
ca. 1 for each dim + 1 m 49 s 3-4 s for each dim, 13 m 20 s (FactRow 444600 lines) 53 s
createinserts
Loading data into the database mysql
transferred over the network. In our test case, the combined size of the query results was about 100 MB. These results are trivially converted into SQL insert/update statements. Indexes for the database tables can be created automatically, too. A schema for Mondrian OLAP, and its corresponding database schema (star schema) is created by createmondrian, using our ontology as its input. Finally, the data generated by running the queries is loaded into Mondrian’s database, as shown at the bottom of Figure 5. After that, Mondrian OLAP with its web server is started. The starting point of the analysis shows years, products, exporting and importing countries with their country attributes as dimensions, and the value of the trade as a measure. The first aggregation of trade values is therefore done by Mondrian before this analysis page appears. However, generating the page first time only takes 8.2 seconds, and much less after the data has been cached by Mondrian.
6 Conclusions, Discussion and Future Work In this paper, we have presented an ontology based methodology for constructing OLAP cubes, mapping raw data with the cubes, extracting data, and populating an OLAP application with the data. We have demonstrated the feasibility of our approach by using actual raw data consisting of almost 450 000 lines. The potentially most complex parts of the process are (i) converting the raw data using an ontology map and (ii) running RDQL queries against the RDF data. We have observed that with our example data, both processes – ProcessOntoMap for (i), and sesamequery for (ii) – are nearly linear with respect to the size of data. For completeness, we now briefly discuss what the complexity would be in a more general case for these processes. In the case of ProcessOntoMap, we have described the conversion algorithm in Section 4.1. We note that each “column” line of the ontology map (in the appendix) represents an XPath expression that will be evaluated against
114
M. Niinim¨aki and T. Niemi
XML document D of raw data. Since the XPath expressions at hand are quite simple – i.e. they always have a root and only the leaf node is a query term – they belong to what Gottlob et al. [17] call Core XPath, and for them the complexity of evaluation is linear with respect to the size (variables) of XPath and the size (terms) of the XML document. For RDF queries, we stated in Section 2.2 that they can be evaluated, in theory, in linear time. However efficient the ontology map based conversions were, they are not usable if creating the ontology maps is too difficult for people who want to provide their raw data for analysis. We estimate that for each level of the dimension hierarchy, creating the corresponding ontology map expression will take only a few minutes. Making an XSLT stylesheets [39], for instance, to transform the raw data to RDF would require much more time. Based on the previous sections, we can now summarise our design for a semiautomated OLAP analysis system as a work flow: – Using a GUI design tool that resembles an OLAP analysis application, the user selects the dimensions and measures of his/her liking. In our prototype we have used the Protege ontology editor [16]. The starting point is an ontology, derived from olapcore.owl. The design tool of course only allows the user to select dimensions and measures in such a way that the resulting OLAP cube remains valid (as in Section 3). – It is possible that the OLAP database has been already populated, but here we assume it is not the case. Thus, we need to perform an ontology map based ETL process from data sources d1 , .., dn . First, queries are generated based on the ontology. For each data set di , the data set is transformed into an RDF form using an ontology map as in Section 4.1, and the queries are executed for di . It should be noted that the ETL process can inspect the ontology map for each data source. If the ontology map reveals that the data source does not contain data about the dimensions of our interest, transforming and executing the query is not needed. Similarly, the query generation should support a “relevant data only” option; by using the GUI, the users may have pointed out that they are not interested in country populations in their analysis of importing and exporting countries. Thus, the queries should not request the population attribute, even if it available in the raw data. – The results of the query execution are used to populate the OLAP database and the user can start his/her analysis. An ambitious design would combine the analysis tool with the design tool. Effectively, the design tool would enable the users to define their own ontologies (based on olapcore), and link them to raw data using ontology maps. The analysis tool would then generate the OLAP cube and let the users analyse it, as discussed above. An interesting topic for further research would be an automated search of ontologies and ontology maps. Here, for simplicity, we assume that the location of the data is known to the application – in [31] we designed a World Wide Web based repository that contains “collections”, i.e. sets of pairs. This approach may not be practical if there are dozens of ontologies and hundreds of data sources; an approach using e.g. Semantic Web data discovery tools (see [7]) would be more usable.
An ETL Process for OLAP Using RDF/OWL Ontologies
115
Taking data confidentiality into account, we have configured a system where ontologies and ontology maps are distributed without restrictions, but raw data files can only be accessed using authentication and authorization by X.509 certificates (see [19]).
References 1. 2. 3. 4. 5.
6. 7.
8. 9.
10. 11. 12. 13. 14. 15. 16. 17.
18. 19.
20. 21.
XML Path Language (XPath). Technical report, W3C (1999) OWL Web Ontology Language Overview. Technical report, W3C (2004) RDF primer, W3C recommendation 10 February 2004. Technical report, W3C (2004) RDF Vocabulary Description Language 1.0: RDF Schema. Technical report, W3C (2004) Aberer, K., Cudr´e-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: Building internetscale semantic overlay networks. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 107–121. Springer, Heidelberg (2004) Antoniu, G., van Harmelen, F.: Web Ontology Language: OWL, ch. 4. Springer, Heidelberg (2004) Bannon, M., Kontogiannis, K.: Semantic Web data description and discovery. In: STEP 2003: Eleventh Annual International Workshop on Software Technology and Engineering Practice. IEEE, Los Alamitos (2003) Bray, T.: RDF and metadata. XML. com (1998) Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A generic architecture for storing and querying RDF and RDF schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, p. 54. Springer, Heidelberg (2002) Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. SIGMOD Rec. 26(1), 65–74 (1997) Codd, E., Codd, S., Salley, C.: Providing OLAP to user-analysts: An IT Mandate. Technical report, Hyperion (1993) Codd, E.F.: A relational model for large shared data banks. Communications of the ACM (1970) Codd, E.F.: Further normalization of the data base relational model. In: Data Base Systems, Courant Computer Science Symposia Series 6 (1972) Comito, C., Talia, D.: XML Data Integration in OGSA Grids. In: Pierson, J.-M. (ed.) VLDB DMG 2005. LNCS, vol. 3836, pp. 4–15. Springer, Heidelberg (2006) Davidson, S., Buneman, P., Kosky, A.: Semantics of database transformations. LNCS, vol. 1358, pp. 55–91. Springer, Heidelberg (1998) Gennari, J., et al.: The evolution of Protege – an environment for knowledge-based systems development. Int. J. Hum.-Comput. Stud. 58(1) (2003) Gottlob, G., Koch, C., Pichler, R.: The complexity of XPath query evaluation. In: PODS 2003: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 179–190. ACM, New York (2003) Hull, R.: Managing semantic heterogeneity in databases: a theoretical prospective. In: Proc. ACM Symposium on Principles of Databases (1997) ITU-T. ITU-T Recommendation X.509. Technical Report ISO/IEC 9594-8: 1997, International Telecommunication Union. Information technology - Open Systems Interconnection The Directory: Authentication framework (1997) Jensen, M.R., Moller, T.H., Bach Pedersen, T.: Specifying OLAP cubes on XML data. J. Intell. Inf. Syst. 17(2-3), 255–280 (2001) Lawrence, M., Rau-Chaplin, A.: The OLAP-Enabled Grid: Model and Query Processing Algorithms. In: HPCS (2006)
116
M. Niinim¨aki and T. Niemi
22. Lehti, P., Fankhauser, P.: XML data integration with OWL: experiences and challenges. In: Proc. 2004 Intl. Symposium on Applications and the Internet. IEEE, Los Alamitos (2004) 23. Lenz, H., Shoshani, A.: Summarizability in OLAP and statistical data bases. In: Ioannidis, Y., Hansen, D. (eds.) Ninth International Conference on Scientific and Statistical Database Management, Proceedings, Olympia, Washington, USA, pp. 132–143. IEEE Computer Society, Los Alamitos (1997) 24. Levene, M., Loizou, G.: Why is the snowflake schema a good data warehouse design? Inf. Syst. 28(3), 225–240 (2003) 25. Maier, D., Ullman, J.D., Vardi, M.Y.: On the foundations of the universal relation model. ACM Trans. Database Syst. 9(2), 283–308 (1984) 26. N¨appil¨a, T., J¨arvelin, K., Niemi, T.: A tool for data cube construction from structurally heterogeneous XML documents. J. Am. Soc. Inf. Sci. Technol. 59(3), 435–449 (2008) 27. Niemi, T., Nummenmaa, J., Thanisch, P.: Constructing OLAP cubes based on queries. In: Hammer, J. (ed.) DOLAP 2001, ACM Fourth International Workshop on Data Warehousing and OLAP, pp. 9–11. ACM, New York (2001) 28. Niemi, T., Nummenmaa, J., Thanisch, P.: Normalising OLAP cubes for controlling sparsity. Data and Knowledge Engineering 46(1), 317–343 (2003) 29. Niemi, T., Toivonen, S., Niinim¨aki, M., Nummenmaa, J.: Ontologies with Semantic Web/grid in data integration for OLAP. International Journal on Semantic Web and Information Systems, Special Issue on Semantic Web and Data Warehousing 3(4) (2007) 30. Niinimaki, M.: Grid resources, services and data – towards a semantic grid system. Technical report, University of Tampere, Department of Computer Science (2006) 31. Niinim¨aki, M., Niemi, T.: Processing Semantic Web queries in Grid. Intl. Transactions on Systems Science and Application 3(4) (2008) 32. Perez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 30–43. Springer, Heidelberg (2006) 33. Priebe, T., Pernul, G.: Ontology-based Integration of OLAP and Information Retrieval. ˇ ep´ankov´a, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736. In: Maˇr´ık, V., Stˇ Springer, Heidelberg (2003) 34. Romero, O., Abell´o, A.: Automating multidimensional design from ontologies. In: DOLAP 2007: Proceedings of the ACM tenth international workshop on Data warehousing and OLAP, pp. 1–8. ACM, New York (2007) 35. Sagiv, Y.: Can we use the universal instance assumption without using nulls? In: SIGMOD 1981: Proceedings of the 1981 ACM SIGMOD international conference on Management of data, pp. 108–120. ACM, New York (1981) 36. Skoutas, D., Simitsis, A.: Designing ETL processes using semantic web technologies. In: DOLAP 2006: Proceedings of the 9th ACM international workshop on Data warehousing and OLAP, pp. 67–74. ACM Press, New York (2006) 37. Skoutas, D., Simitsis, A.: Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data. International Journal on Semantic Web and Information Systems, Special Issue on Semantic Web and Data Warehousing 3(4) (2007) 38. Staab, S. (ed.): Handbook on Ontologies. Springer, Heidelberg (2004) 39. The World Wide Web Consortium. XSL Transformations XSLT, Version 1.0, W3C Recommendation (November 16, 1999), http://www.w3.org/TR/xslt 40. Vrdoljak, B., Banek, M., Rizzi, S.: Designing web warehouses from XML schemas. In: Kambayashi, Y., Mohania, M., W¨oß, W. (eds.) DaWaK 2003. LNCS, vol. 2737, pp. 89–98. Springer, Heidelberg (2003)
An ETL Process for OLAP Using RDF/OWL Ontologies
117
Appendix The olapcore ontology. 1 1 1
118
M. Niinim¨aki and T. Niemi
A part of the Olaptrade ontology. 1 1 : 1 ..
Example of executing the ontology map based conversion and its results. java ProcessOntoMap ontomap-TradeDimensionSet.xml
An ETL Process for OLAP Using RDF/OWL Ontologies .. .. .. 1983 665.0 ..
A part of Mondrian schema for our example OLAP cube
..
119
Ontology-Driven Conceptual Design of ETL Processes Using Graph Transformations Dimitrios Skoutas1,2 , Alkis Simitsis3 , and Timos Sellis2 1
2
National Technical University of Athens, Athens, Greece
[email protected] Institute for the Management of Information Systems, R.C. “Athena” Athens, Greece
[email protected] 3 HP Labs and Stanford University Palo Alto, California, USA
[email protected]
Abstract. One of the main tasks during the early steps of a data warehouse project is the identification of the appropriate transformations and the specification of inter-schema mappings from the source to the target data stores. This is a challenging task, requiring firstly the semantic and secondly the structural reconciliation of the information provided by the available sources. This task is a part of the Extract-Transform-Load (ETL) process, which is responsible for the population of the data warehouse. In this paper, we propose a customizable and extensible ontology-driven approach for the conceptual design of ETL processes. A graph-based representation is used as a conceptual model for the source and target data stores. We then present a method for devising flows of ETL operations by means of graph transformations. In particular, the operations comprising the ETL process are derived through graph transformation rules, the choice and applicability of which are determined by the semantics of the data with respect to an attached domain ontology. Finally, we present our experimental findings that demonstrate the applicability of our approach.
1
Introduction
Successful planning and decision making in large enterprises requires the ability of efficiently processing and analyzing the organization’s informational assets, such as data regarding products, sales, customers, and so on. Such data are typically distributed in several heterogeneous sources, ranging from legacy systems and spreadsheets to relational databases, XML documents and Web pages, and are stored under different structures and formats. For this purpose, as well as for performance issues, data warehouses are employed to integrate the operational data and provide an appropriate infrastructure for querying, reporting, mining, and for other advanced analysis techniques. On the other hand, the explosion of the information available in Web repositories, further accelerated by the new trends and technologies referred to as Web 2.0 and combined with the S. Spaccapietra et al. (Eds.): Journal on Data Semantics XIII, LNCS 5530, pp. 120–146, 2009. c Springer-Verlag Berlin Heidelberg 2009
Ontology-Driven Conceptual Design of ETL Processes
121
ever increasing information needs, necessitates that modern applications often draw from multiple, heterogeneous data sources to provide added value services to the end users. Such environments raise new challenges for the problem of data integration, since naming conventions or custom-defined metadata, which may be sufficient for integration within a single organization, are of little use when integrating inter-organization information sources or Web data sources. The key challenge in all such situations is how to reconcile, both semantically and structurally, the data between the source and target specifications. Traditionally, the integration of the operational data into the central data warehouse is performed by specialized processes, known as Extract-Transform-Load (ETL) processes. The ETL processes are responsible for the extraction of data from distributed and heterogeneous operational data sources, their appropriate cleansing and transformation, and finally, their loading into the target data warehouse. In general, the ETL processes constitute a costly –both in time and resources– and complex part of the data warehouse design. As a motivating example consider the following real-world case adapted from a project of the Greek public sector. The goal of that project was the modernization of the personnel management system and its transition to a modern data warehouse environment. The operational data were stored in a combination of 1152 tables and flat files. The project resulted in a set of ETL processes consisting totally of more than 500 scripts in a procedural language, where each one of those contained more than one transformation performing a single operation. The whole process of (a) identifying the relevant and useful source tables out of the 1152 tables (a flat file can be viewed as an external table), (b) determining the inter-attribute mappings and the appropriate transformations needed, and (c) creating the ETL workflow (in a conceptual level) took approximately 7.5 man-months (3 designers × 2.5 months.) The basic setback in the whole process was the vastness of the schema and the lack of supporting documents and system descriptions for the original implementation. Urged from this scenario and the problems that occurred during that project, we envision a novel approach that would facilitate the early stages of a data warehouse project. In a previous work, we have proposed an easy to use, yet powerful, visual language to represent this task [1]. However, in that work and in other similar works too [2,3,1] towards the conceptual design of the backstage of a data warehouse architecture (see Section 6), the design was performed manually by the designer. The same holds for the plethora of the commercial solutions currently existing in the market, such as IBM’s Data Warehouse Manager [4], Informatica’s PowerCenter [5], Microsoft’s Data Transformation Services [6], and Oracle’s Warehouse Builder [7]. All these approaches, at the conceptual level, focus on the graphical design and representation of the ETL process, whereas the identification of the required mappings and transformations needs to be done manually. The lack of precise metadata hinders the automation of this task. The required information regarding the semantics of the data sources, as well as the constraints and requirements of the data warehouse application, tends to be
122
D. Skoutas, A. Simitsis, and T. Sellis
missing. Usually, such information is incomplete or even inconsistent, often being hard-coded within the schemata of the sources or provided in natural language format (e.g., after oral communication with the involved parties, including both business managers and administrators/designers of the enterprise data warehouse) [8]. Consequently, the first stage of designing an ETL process involves gathering the available knowledge and requirements regarding the involved data stores. Given that ETL processes are often quite complex, and that significant operational problems can occur with improperly designed ETL systems, following a formal approach at this stage can allow a high degree of automation of the ETL design. Such an automation can reduce the effort required for the specification of the ETL process, as well as the errors introduced by the manual process. Thus, in the context of a data warehouse application, and in particular of the ETL process design phase, an ontology, which constitutes a formal and explicit specification of a shared conceptualization [9], can play a key role in establishing a common conceptual agreement and in guiding the extraction and transformation of the data from the sources to the target. We build on top of this idea, and more specifically, we envision a method for the task of ETL design that comprises two main phases. First, we consider an ontology that captures the knowledge and the requirements regarding the domain at hand, and it is used to semantically annotate the data stores. The ontology may already exist, since in many real world applications the domain of the ETL environment is the same; e.g., enterprise or medical data. In such case, the ontology can be re-used or adapted appropriately. (A similar discussion on the applicability of this claim can be found in the experimental section.) If such ontology does not exist, then during the first phase of the design, a new ontology should be created. Clearly, the details of this phase largely depend on the particular needs and characteristics of each project. For example, there may exist different ways and sources to gather requirements, different methods to create an ontology, annotations may be specified manually or semi-automatically, and so on. In this work, we focus on the second phase of the design. Having the ontology available, we investigate how the ontology and the annotations can be used to drive, in a semi-automatic manner, the specification of the ETL process. A first attempt towards this direction has been recently presented [10]. In this paper, we build upon the idea of using an ontology for the conceptual design of ETL processes, and, more specifically, we elaborate on that by proposing a formal way for deriving a conceptual ETL design, based on the well-established graph transformation theory. We exploit the graph-based nature of the data store schemata and of the ETL processes to provide an appropriate formulation of the problem, and we present a customizable and extensible set of graph transformation rules that drive the construction of the ETL process. Notice that the burden of using an ontology is reduced mainly to annotating the source and target schemata with it. Several approaches toward the facilitation of the automatic schema matching have already been proposed [11,12]. Nevertheless, we argue that even if the designer has to do the whole task manually, still, it will be easier to map individual attributes (one each time) to a
Ontology-Driven Conceptual Design of ETL Processes
123
domain ontology rather than try to fill in the puzzle having all the pieces around at the same time. Additionally, the existence of an ontology that carries the mapping of the source and target tables can be used in other applications as well. We mention two prominent examples: (a) the use of such an ontology to produce reports in natural language [13,14]; and (b) such an ontology can be used as a convenient means to data warehousing web data, as an individual may easily plug-in his/her data source into the ontology and then, the ETL can be automated using our approach. Contributions. More specifically, the main contributions of our paper are as follows. • We present a framework for the conceptual design of ETL scenaria, based on the use of an ontology and semantic annotations of the data stores. • We develop a customizable and extensible set of graph transformation rules that determine the choice and the order of operations comprising the ETL scenario, in conjunction with the semantic information conveyed by the associated ontology. • We evaluate our approach using a set of ETL scenaria, artificially created based on the TPC-H schema. Our findings show that the proposed approach can be used with success even for large ETL scenaria. Outline. The rest of the paper is structured as follows. Section 2 presents the general framework of our approach for ontology-based design of ETL scenaria. Section 3 presents the use of the ontology as a common conceptual model to drive the selection and composition of ETL operations, based on a set of appropriately defined graph transformation rules. Section 4 presents an application example that resembles representative real-world settings. Section 5 demonstrates the applicability of our approach through an experimental study. Finally, Section 6 discusses related work, and Section 7 concludes the paper.
2
General Framework
In this section, we present the general framework of our approach towards the ontology-based design of ETL processes. First, we describe the representation model used for the source and target data stores, as well as for the domain ontology and the ETL process. Then, we state the problem of deriving the design of an ETL process at the conceptual level, via a series of graph transformations, based on the semantic knowledge conveyed by the domain ontology attached to the source and target schemata. In particular, our approach is based on appropriate manipulation of a graph that contains all the involved information, namely the data store schemata, the domain ontology, the semantic annotations, and the ETL operations. These modules are described in the following. Data store subgraph. Traditional ETL design tools employ a relational model as an interface to the data repositories. The relational model has widespread adoption and an RDBMS constitutes the typical solution for storing an organization’s
124
D. Skoutas, A. Simitsis, and T. Sellis
operational data. Nevertheless, the increasingly important role of the Web in ecommerce, and business transactions in general, has led to semi-structured data playing a progressively more important role in this context. The adoption of XML as a standard for allowing interoperability strongly suggests that data crossing the borders of the organization is structured in XML format. For instance, Web services, which enable enterprises to cooperate by forming dynamic coalitions, often referred to as Virtual Organizations, are described by documents in XML format, and they exchange information in XML format, too. These facts significantly increase the amount of heterogeneity among the data sources, and hence, the complexity of the ETL design task. To abstract from a particular data model, we employ a generic, graph-based representation, that can effectively capture both structured and semi-structured data. In particular, we model a data store as a directed graph, i.e., G = (V, E), where V is a set of nodes and E ⊆ V x V is a set of edges (i.e., ordered pairs of nodes). Graph nodes represent schema elements, whereas graph edges represent containment or reference relationship between those elements. Note that the same model is used for both source and target data stores. Given that the ETL process may involve multiple source data stores, nodes belonging to different sources are distinguished by using different prefixes in their identifiers. Ontology subgraph. Our approach is based on the use of an ontology to formally and explicitly specify the semantics of the data contained in the involved data stores. Leveraging the advances in Semantic Web technology, we can use RDF Schema [15,16] or OWL [17] as the language for the domain ontology. Hence, the knowledge for the domain associated with the application under consideration can be represented by a set of classes and properties, structured in an appropriate hierarchy. These classes and properties correspond to the concepts of the domain, and the relationships and attributes of these concepts. In addition, for the purpose of ETL design, it is commonly required to express some specific types of relationships, such as different representation formats (e.g., different currencies or different date formats) or different levels of granularity when structuring the information (e.g., representing a particular piece of information either as a single attribute or as a set of attributes). Therefore, apart from the provided isa relationship that can be specified among classes (i.e., ), we assume in addition a set of pre-defined properties, comprising the properties typeOf and partOf. This set of pre-defined properties can be further extended to accommodate application-specific or domain-specific needs. In the employed representation, classes are represented by nodes, whereas properties by edges. Data store annotations. Using the ontology to semantically annotate the data stores is achieved by establishing edges directed from nodes of the data store subgraph towards corresponding nodes of ontology subgraph. ETL process subgraph. An ETL process comprises a series of operations that are applied to the source data and transform it appropriately, so as it meets the target specifications. Given the previously described graph-based representation of the source and target data stores, we represent the specification of the ETL
Ontology-Driven Conceptual Design of ETL Processes
125
process as a set of paths directed from source data store nodes towards target data store nodes. The nodes along these paths denote ETL operations; there are also intermediate nodes as we discuss in Section 3.2. The edges connecting the nodes indicate the data flow. In general, it is not straightforward to come up with a closed set of welldefined primitive ETL operations. Normally, such effort would result in the set of relational operators extended by a generic function operator. However, this would not be too useful in real world applications that usually comprise a large variety of built-in or user-defined functions. Hence, it is essential to provide a generic and extensible solution that could cover the frequent cases and that could be enriched by additional transformations when needed. Building upon previous work [18], we consider the following set of operations: Load, Filter, Convert, Extract, Split, Construct, and Merge. These correspond to common operations frequently encountered in ETL processes. A detailed discussion of these operations, as well as their applicability in a given context, are presented in Section 3.3. Problem statement. We consider the problem of ontology-based conceptual design of ETL processes as follows: starting from an initial graph comprising the source and target data stores subgraphs, the ontology subgraph, and the semantic annotations, produce a final graph that contains also the ETL process subgraph. In this paper, we tackle this problem, based on graph transformations. This solution is essentially based on the definition of a set of transformation rules that, given the initial graph, build the ETL subgraph, in a step-by-step manner. We elaborate on these issues in the next section.
3
ETL Design by Graph Transformations
We address the design of an ETL scenario as a semi-automatic task, that proceeds interactively, driven on the one hand from formal metadata and on the other hand from appropriate guidance from the human designer, who verifies and completes the former process. To this end, we present in this section an approach drawing on the theory of graph transformation, which provides a rigorous formalism, combined at the same time with the emphasis on the ability to visually represent and control the specification of the ontology, the source and target graphs, and the derivation of the ETL process. 3.1
Preliminaries
Graph transformations were first introduced as a means to address the limitations in the expressiveness of classical approaches to rewriting, especially dealing with non-linear structures [19], and they are widely used in software engineering. The basic idea is to generate a new graph, H, starting from an initial given graph, G, by means of applying a set of transformation rules. The graphs G and H, which are also called instance graphs, may be typed over a type graph TG. A type graph specifies the types of nodes and edges, and how they are connected.
126
D. Skoutas, A. Simitsis, and T. Sellis
Then, the structure of the instance graphs should conform to the type graph, in order for them to be valid. That is, the relationship between an instance graph and a corresponding type graph is similar to that between an XML document and its associated XML Schema. Additionally, the graphs may be attributed, i.e., graph nodes and edges may have attributes. An attribute has a name and a type, specifying the values that can be assigned to it. Graph objects of the same type share their attribute declarations. Transformations of the original graph to a new graph are specified by transformation rules. A graph transformation rule, denoted by p : L → R consists of a name p and two instance graphs L and R, which are also typed over TG and represent, respectively, the pre-conditions and the post-conditions of the rule. This means that (a) the rule is triggered whenever a structure matching L is found, and (b) the execution of the rule results in replacing the occurrence of the left-hand side (LHS) of the rule, L, with the right-hand side (RHS), R. Therefore, a graph p(o) transformation from a given graph G to a new graph H is denoted by G =⇒ H, and it is performed in three steps: i. Find an occurrence o of the left-hand side L in the given graph G. ii. Delete from G all the nodes and edges matched by L \ R (making sure that the remaining structure is a graph, i.e., no edges are left dangling.) iii. Glue to the remaining part a copy of R \ L. Apart from pre-conditions, i.e., patterns whose occurrence triggers the execution of the rule, a rule may also have negative application conditions (NACs), i.e., patterns whose occurrence prevents its execution. A graph transformation sequence consists of zero or more graph transformations. Notice that two kinds of non-determinism may occur. First, several rules may be applicable. Second, given a certain rule, several matches may be possible. This issue can be addressed with different techniques, such as organizing rules in layers, setting rule priorities, and/or assuming human intervention in choosing the rule to apply or the match to consider. 3.2
The Type Graph
In the following, we describe an approach for designing an ETL process through graph transformations based on the constructed ontology. One of the main advantages of this approach is that it allows to visualize the involved schemata, the domain knowledge and the ETL operations, and to proceed with the design task in either automated or interactive manner. As discussed in Section 2, the design of the ETL process is built in a step-by-step manner through a series of graph transformations. Essential to this is the role of the ontology, which determines the context (i.e., the semantics) at each transformation step, thus determining which ETL operations are applicable and in what order. The selected ETL operations are represented as additional nodes and edges forming paths (flows) that lead from the nodes of the source subgraph to the nodes of the target subgraph. The process of addressing this problem by means of graph transformations is outlined in the following. We consider as starting point a graph comprising three
Ontology-Driven Conceptual Design of ETL Processes typeOf
isa
SrcNode String ID
127
partOf
OntNode String URI
IntmNode
Operation String type
(a)
TrgNode String ID
typeOf OntNode URI=”ex:USD”
SrcNode ID=”s:Salary”
OntNode URI=”ex:Salary”
Operation type=”CONVERT”
typeOf OntNode URI=”ex:EUR”
TrgNode ID=”t:Salary”
(b)
Fig. 1. (a) The type graph and (b) a sample instance graph
subgraphs, namely the source, the target, and the ontology subgraphs. The main goal is then to define an appropriate set of rules, determining where, when, and how a flow of operations from a source to a target node can be created. Essentially, each rule is responsible for inserting an operator in the ETL flow. (Additionally, as we discuss at a later point, some rules aim at replacing or removing operators from the flow.) The finally obtained graph is a supergraph of the initial graph, depicting the choice and order of the aforementioned required operations. In the generated graph, ETL operations are represented by nodes, with incoming and outgoing edges corresponding, respectively, to the inputs and outputs of the operation. These form flows between source nodes and target nodes. Since populating a target element with data from a source element often requires more than one transformation to be performed on the data, in the general case these flows will have length higher than 1. To allow for such functionality, we use the notion of intermediate nodes. These refer to intermediate results produced by an ETL operation and consumed by a following one. Consequently, the incoming edges of a node representing an ETL operation may originate either from source nodes or from intermediate nodes, while outgoing edges may be directed either to target nodes or to intermediate nodes. To formally capture such relationships, we introduce the type graph illustrated in Figure 1 and explained in detail below. The type graph specifies the types of nodes and edges that the instance graphs (i.e., those constructed to model data store schemata, annotations, and ETL flows) may contain, as well as how they are structured. The type graph is depicted in Figure 1(a) and distinguishes the following types of nodes and edges: – Ontology nodes (OntNode): they represent concepts of the considered application domain. An ontology node may connect to other ontology nodes by means of isa, partOf, typeOf or connects edges. The connects edges correspond to generic relationships between concepts of the domain, and they are represented in Figure 1(a) by continuous, unlabeled arrows; the isa, partOf, and typeOf edges are represented by continuous arrows with a corresponding label to distinguish the type of the relationship. Each ontology node has an associated URI that uniquely identifies it.
128
D. Skoutas, A. Simitsis, and T. Sellis
– Source nodes (SrcNode): they correspond to elements of the source data store schemata (e.g., tables or attributes in the case of relational schemata, or XML tree nodes in the case of XML documents). Each source node has a unique ID (i.e., a URI), prefixed accordingly to indicate the data store it belongs to. Source nodes may relate to each other by connects edges (corresponding, for example, to foreign keys in the case of relational sources or to containment relationships in the case of XML.) Source nodes are annotated by ontology nodes, as shown by the dotted edge in Figure 1(a), to make explicit the semantics of the enclosed data. – Target nodes (TrgNode): they are similar to source nodes, except from the fact that they refer to elements of the target data stores instead. – Intermediate nodes (IntmNode): they are nodes containing temporary data that are generated during ETL operations. They are also annotated by ontology nodes. This is necessary for continuing the flow of operations once an intermediate node has been created. Notice however the difference: source and target nodes are annotated manually (or perhaps semi-automatically) and these annotations need to be in place a-priori, i.e., at the beginning of the ETL design process. In fact, these annotations constitute the main driving force for deriving the ETL scenario. On the contrary, the annotations of the intermediate nodes are produced automatically, when the intermediate node is created, and are a function of the type of ETL operation that created this node, as well as of the (annotation of the) input used for that operation. – Operation nodes (Operation): they represent ETL operations. The attribute type identifies the type of the operation (e.g., filter or convert). The inputs and outputs of an operation are denoted by dashed edges in Figure 1(a). In particular, the input of an operation is either a source node or an intermediate node, whereas the output of an operation is either an intermediate node or a target node. Each ETL operation must have at least one incoming and one outgoing edge. Example. A sample instance of the considered type graph is illustrated in Figure 1(b). It depicts a typical scenario where an ETL operation converts the values of a source element containing salaries expressed in U.S. Dollars to populate a target element with the corresponding values in Euros. 3.3
The Transformation Rules
Having the type graph introduced in the previous section, we can create instances of this graph to represent specific instances of the ETL design problem, i.e., to model a given source graph, a given target graph, and their annotations with respect to an associated domain ontology. The initial graph does not contain any Operation nodes. Instead, the goal of the transformation process is exactly to add such nodes in a step-by-step manner, by applying a set of corresponding transformation rules. Recall from Section 3.1 that each such rule comprises two basic parts: a) the left-hand-side (LHS), specifying the pattern that triggers the execution of the rule, and b) the right-hand-side (RHS), specifying how the LHS
Ontology-Driven Conceptual Design of ETL Processes
129
is transformed by the application of the rule. Optionally, a rule may have a third part, specifying one or more negative application conditions (NACs). These are patterns preventing the triggering of the rule. A common usage of NACs is as stop conditions, i.e., to prevent the same rule from firing multiple times for the same instance. This occurs when the RHS of the rule also contains the LHS. In the following, we introduce a set of rules used to construct ETL flows based on the operations (and their conditions) described in Section 2, and describe each rule in detail. Essentially, these rules are divided into groups, each one responsible for the addition of a certain type of ETL operation. We consider two kind of rules, referring, respectively, to simple and composite ETL operations. Rules for Simple Operations. This set of rules handles the LOAD, FILTER, CONVERT, EXTRACT, and CONSTRUCT operations. LOAD. This is the simplest operation: it simply loads data records from a source to a target element. For such a direct data flow to be valid, one of the following conditions must apply: either a) the source element must correspond to a concept that is the same with that of the target element, or b) the source element must correspond to a concept that is subsumed (i.e., has an isa link) by that of the target element. In the former case the rule pattern searches for a pair of source and target nodes that point to the same OntNode, as shown in Figure 2. If a match is found, the rule is triggered and a LOAD operation is inserted. OntNode OntNode
SrcNode
TrgNode
LHS
SrcNode
Operation type=”LOAD”
TrgNode
RHS (and NAC)
Fig. 2. Rules for inserting LOAD operations in the presence of direct relationship
In the latter case the pattern searches for a SrcNode that is annotated by an OntNode which has an isa relationship to another OntNode annotating a TrgNode (Figure 3.) Again, the transformation performed by the rule is to insert an Operation node of type LOAD, connecting the source and target nodes. Additionally, in the second case, it is also useful to have data flow to (or from) an intermediate node, which will then be further transformed to meet the target node specifications (or respectively that has resulted from previous transformations). Thus, for this latter case we have four individual rules corresponding to the pairs source-to-target, source-to-intermediate, intermediate-to-intermediate, and intermediate-to-target (rules i - iv in Figure 3.) Finally, NACs are used accordingly, to prevent the same rule firing repeatedly for the same pattern, as mentioned previously. Hence, NACs replicating the RHS of the corresponding rule are inserted. An exception to this can be observed in rule iii of Figure 3, handling the case intermediate-to-intermediate. Here, in addition to the NAC replicating the RHS of the rule, two other NACs are used to ensure that a LOAD operation will not be inserted to an intermediate node, if this node was produced
130
D. Skoutas, A. Simitsis, and T. Sellis isa
OntNode
OntNode
SrcNode
OntNode
isa
OntNode
OntNode
IntmNode
isa
OntNode
SrcNode
LHS isa
TrgNode
RHS (and NAC) i. source-to-target
SrcNode
OntNode
OntNode
Operation type=”LOAD”
SrcNode
TrgNode
LHS
isa
OntNode
OntNode
Operation type=”LOAD”
IntmNode
RHS (and NAC) ii. source-to-intermediate isa
OntNode
IntmNode
Operation type=”LOAD”
isa
OntNode
OntNode
Operation type=”FILTER”
IntmNode
IntmNode
isa
OntNode
Operation type=”FILTER”
IntmNode
LHS
RHS (and NAC) iii. intermediate-to-intermediate OntNode
isa
IntmNode
LHS
OntNode
OntNode
TrgNode
IntmNode
isa
Operation type=”LOAD”
OntNode
IntmNode
OntNode
SrcNode
additional NACs OntNode
TrgNode
RHS (and NAC) iv. intermediate-to-target
Fig. 3. Rules for inserting LOAD operations via isa link
as a result of a previous FILTER operation from another intermediate or source node (this will become clearer in the description of FILTER operation below). FILTER. This operation applies a filtering, such as arithmetic comparisons or regular expressions on strings, on data records flowing from the source to the target data store. The LHS of this rule searches for a target node pointing to a concept that is a subconcept (i.e., more restricted) of a concept corresponding to a source node. Whenever a match is found, the rule is triggered and it inserts a FILTER operation between the source and target nodes. Analogously to the previous case, three other “versions” of this rule are also considered, dealing with the cases of intermediate nodes. The rules for the cases source-to-target and intermediate-to-intermediate are illustrated in Figure 4. Notice the additional NACs used again in the latter case. The necessity of these NACs (and of those used previously in the corresponding rule for LOAD operations) becomes evident if we consider the following situation. Assume two ontology concepts C and D related via an isa link, isa(C,D), and an intermediate node V pointing at (i.e., annotated by) C. Then, rule iii of Figure 3(b) will fire, inserting a LOAD
Ontology-Driven Conceptual Design of ETL Processes OntNode
isa
SrcNode
OntNode
OntNode
TrgNode
SrcNode
LHS OntNode
isa
OntNode
OntNode
Operation type=”FILTER”
TrgNode
RHS (and NAC) i. source-to-target
OntNode
IntmNode
IntmNode
isa
isa
Operation type=”FILTER”
isa
OntNode
OntNode
OntNode
IntmNode IntmNode
OntNode
IntmNode
LHS
131
RHS (and NAC) ii. intermediate-to-intermediate
Operation type=”LOAD” isa
Operation type=”LOAD”
IntmNode
OntNode
SrcNode
additional NACs
Fig. 4. Rules for inserting FILTER operations
operation leading to a new intermediate node U. Subsequently, in the absence of the aforementioned NACs, the rule ii of Figure 4 will fire, inserting a FILTER operation leading back to node V. CONVERT. This operation represents conceptually the application of arbitrary functions used to transform data records, such as arithmetic operations or operations for string manipulation. It can be thought of as transforming the data between different representation formats. In the ontology this knowledge is captured by means of concepts related to a common concept via typeOf links. Thus, the LHS for this rule is as shown in Figure 5, while the RHS inserts, as expected, a CONVERT operation between the matched nodes. Due to space considerations, only the transition between intermediate nodes is shown. The derivation of the corresponding rules involving source or target nodes is straightforward. Notice the additional NACs used here. This is to prevent loops converting repeatedly among the same types. For instance, consider the case of three concepts C1 , C2 and C3 , which are all “type of” C. In the absence of these NACs, this would lead to a series of conversions starting, e.g., from C1 , going to C2 , then to C3 , and then back to either C1 or C2 , and so on. Instead, this is prevented by the two NACs checking whether the considered intermediate node is itself a product of another CONVERT operation. EXTRACT. This operation corresponds to the case of extracting a piece of information from a data record (e.g., a substring from a string). In this case we search for a pair of source and target nodes, where the latter corresponds to an ontology concept that is related via a partOf link to that of the former. When a match is found, the RHS of the rule inserts an EXTRACT operation. Three similar rules are constructed again to handle intermediate nodes. Figure 6 depicts the rule for the case of transition between intermediate nodes. As described before for the rules LOAD and FILTER, appropriate NACs are introduced to prevent loops that may occur in combination with CONSTRUCT operations (see below).
132
D. Skoutas, A. Simitsis, and T. Sellis
OntNode
typeOf
IntmNode
OntNode
OntNode
OntNode
IntmNode
typeOf
OntNode
typeOf
Operation type=”CONVERT”
typeOf
OntNode
typeOf
OntNode
Operation type=”CONVERT”
IntmNode
IntmNode
typeOf
OntNode
RHS (and NAC)
OntNode
IntmNode
typeOf
OntNode
Operation type=”CONVERT”
IntmNode
LHS
OntNode
OntNode
SrcNode
additional NACs
Fig. 5. Rules for inserting CONVERT operations
partOf OntNode
OntNode
OntNode
IntmNode
IntmNode
partOf
Operation type=”EXTRACT”
OntNode
partOf
OntNode
Operation type=”CONSTRUCT”
IntmNode
IntmNode
partOf
OntNode
RHS (and NAC)
IntmNode
OntNode
Operation type=”CONSTRUCT”
IntmNode
LHS
OntNode
SrcNode
additional NACs
Fig. 6. Rules for inserting EXTRACT operations
CONSTRUCT. This operation corresponds to the case that a larger piece of information needs to be constructed given a data record (typically by filling in the missing part(s) with default values). This is represented by a pair of source and target nodes, where the corresponding source OntNode is partOf the corresponding target OntNode. When triggered, the rule inserts a CONSTRUCT operation. Rules for dealing with intermediate nodes operate similarly. In this case, care needs to be taken to avoid loops created by transitions back and forth a pair of OntNodes linked with a partOf edge, i.e., interchanging EXTRACT and CONSTRUCT operations. The rule referring to a pair of intermediate nodes is depicted in Figure 7. partOf OntNode
OntNode OntNode
IntmNode
IntmNode
partOf
Operation type=”CONSTRUCT”
partOf
OntNode
IntmNode IntmNode
Operation type=”EXTRACT”
OntNode
IntmNode
LHS
RHS (and NAC)
OntNode
OntNode
partOf
Operation type=”EXTRACT”
IntmNode
OntNode
SrcNode
additional NACs
Fig. 7. Rules for inserting CONSTRUCT operations
Ontology-Driven Conceptual Design of ETL Processes
133
Rules for Composite Operations. Our approach is generic and extensible. It is possible to combine simple operations in order to construct composite ones. We present two transformation rules, dealing with such operations, namely the SPLIT and MERGE operations; this allows to demonstrate the extensibility of the proposed framework. SPLIT. This operation can be used in the place of multiple EXTRACT operations, when multiple pieces of information need to be extracted from a data record in order to populate different elements in the target data store. However, since the number of resulting elements is not fixed, it is not possible to construct a rule that directly inserts SPLIT operations in the ETL flow (unless some appropriate pre-processing on the domain ontology and the data store schemata is performed). Therefore, we insert such operations indirectly, by first applying temporary EXTRACT operations, and then replacing multiple EXTRACT operations originating from the same node with a SPLIT operation. Notice that having in these cases a single SPLIT operation instead of multiple related EXTRACT operations, apart from reflecting more closely the human perception regarding the intended transformation, also has the benefit that results in more compact ETL flows. Hence, the LHS of the rule for inserting SPLIT operations searches for two EXTRACT operations originating from the same source node, and replaces them with a SPLIT operation. Observe however that in the case that more than two EXTRACT operations existed, this rule would only merge two of them. Still, the others also need to be merged with the substituting SPLIT operation. For this purpose, an additional rule is required, that merges an EXTRACT operation to a SPLIT operation. This rule is executed iteratively, until all EXTRACT operations have been “absorbed” by the SPLIT operation. However, since the execution order of rules is non-deterministic, if more than three EXTRACT operations exist, originating from the same node, it is possible to end up with multiple SPLIT operations. Thus, a third rule that combines two SPLIT operations in a single one is employed. The aforementioned rules are presented in Figure 8. Similar rules are devised to apply this process for cases involving intermediate nodes. TrgNode TrgNode
Operation type=”EXTRACT”
SrcNode
Operation type=”EXTRACT”
TrgNode
SrcNode
Operation type=”SPLIT”
LHS
TrgNode
RHS TrgNode
TrgNode
Operation type=”EXTRACT”
SrcNode
Operation type=”SPLIT”
TrgNode
SrcNode
Operation type=”SPLIT”
LHS
TrgNode
RHS TrgNode
TrgNode
Operation type=”SPLIT”
SrcNode
Operation type=”SPLIT”
TrgNode
SrcNode
Operation type=”SPLIT”
LHS Fig. 8. Rules for inserting SPLIT operations
RHS
TrgNode
134
D. Skoutas, A. Simitsis, and T. Sellis
MERGE. As mentioned earlier, in CONSTRUCT operations some external information needs to be provided to construct from a given data item the required data to populate a target element. In this case the missing data is provided by other source elements. That is, two or more source elements complement each other in producing the data records for populating a given target element. As with the case of SPLIT mentioned above, since the number of cooperating source nodes is not fixed, this operation is also handled indirectly, in a similar manner. In particular, the rules for MERGE search for two CONSTRUCT operations or for a MERGE and a CONSTRUCT operation or for two previously inserted MERGE operations, and incorporate them in a single MERGE operation. As previously, multiple CONSTRUCT operations are iteratively absorbed into a single MERGE operation by consecutive executions of the corresponding rules. The corresponding rules are shown in Figure 9. SrcNode
SrcNode
Operation type=”CONSTRUCT”
TrgNode
Operation type=”CONSTRUCT”
SrcNode
Operation type=”MERGE”
LHS
RHS SrcNode
SrcNode
Operation type=”CONSTRUCT”
TrgNode
Operation type=”MERGE”
SrcNode
Operation type=”MERGE”
RHS SrcNode
Operation type=”MERGE”
TrgNode
TrgNode
SrcNode
LHS
SrcNode
TrgNode
SrcNode
Operation type=”MERGE”
SrcNode
Operation type=”MERGE”
TrgNode
SrcNode
LHS
RHS
Fig. 9. Rules for inserting MERGE operations
Additional Rules. As it may have become clear from the description of the transformation rules previously presented, when there is not a one-step transformation between a source and a target node, the graph transformation engine will simulate a (random) search for creating paths of operations that may lead from the source node to the target. (The randomness is due to the two kinds of non-determinism mentioned in Section 3.1.) It is most likely that for most (or even all) of these paths after a few transformation steps no more rules can be applied, without having reached a target node. To avoid overloading the resulting graph, and consequently the ETL designer, with these “false positives”, we Operation
IntmNode
LHS
-
Operation
RHS Fig. 10. “Clean-up” rule
IntmNode
NAC
Operation
Ontology-Driven Conceptual Design of ETL Processes
135
employ an additional rule that aims at “cleaning up” the final ETL flow. This rule, illustrated in Figure 10, essentially removes intermediate nodes, together with the operation producing them, that do not have a next step in the ETL flow (i.e., that fail to reach a target node). 3.4
Creation of the ETL Design
In this section, we discuss ordering issues in the execution of the transformation rules. It is evident from the description of the functionality of the introduced rules that some of the rules should be considered before or after other ones have been applied. In particular, we consider the following requirements: – Rules referring to one-step transformations, i.e., involving source and target nodes, should be considered before rules involving intermediate nodes. – The rule “clean-up” should be performed only after the examination of any rules adding new operations has been completed. – Rules regarding composite operations (e.g., SPLIT and MERGE) should be considered after all the rules for the corresponding simple operations (e.g., EXTRACT and CONSTRUCT) have been triggered. Ensuring that this ordering is respected is both a necessary condition for the method to produce the desired results and a matter of improving performance. For instance, allowing clean-up operations to be performed before rules inserting new operations have been completed, it may result in an infinite loop, i.e., repeatedly adding and removing the same operation(s). On the other hand, checking for the applicability of rules regarding SPLIT or MERGE operations before all EXTRACT or CONSTRUCT operations have been identified, leads to redundant matching tests. Consequently, we organize the rules described above into 4 layers, as follows: – The first layer comprises those rules inserting ETL operations that directly connect a source node to a target node. – The second layer comprises rules inserting operations from or to intermediate nodes. – The third layer contains the clean-up rule. – Finally, the last, fourth, layer comprises the rules for composite operations (i.e., SPLIT and MERGE). These layers are executed in the above order, starting from the first layer. The execution of rules from a layer i starts only if no more rules from the layer i − 1 can be applied. The whole process terminates when no rules from the last layer can be applied. Within the same layer, the order in which the rules are triggered is non-deterministic. Hence, given the presented set of rules, organized appropriately in the aforementioned layers, and the problem instance, comprising the source graph, the target graph, the ontology and the annotations, the creation of the ETL design proceeds as follows:
136
D. Skoutas, A. Simitsis, and T. Sellis
– Step 1: Identify single operations that can connect a source node to a target node. This is accomplished by the graph transformation engine applying the rules of the first layer. – Step 2: This step accomplishes the rules of the second layer and it comprises two tasks, which may be executed interchangeably: • Starting from source nodes, introduce ETL operations that transform data leading to an intermediate node. • Starting from the created intermediate nodes, continue introducing additional transformations, until either the target nodes are reached or no more rules can be applied. – Step 3: Remove paths of ETL operations and intermediate nodes that have not reached a target node. This is performed by the rule in layer 3. – Step 4: Search for groups of EXTRACT or CONSTRUCT operations that can be substituted by SPLIT or MERGE operations, respectively. Correctness of the Produced Flow. Within a flow of ETL operations, the execution order of the operations is significant, as different orderings may produce semantically very different results. In [20], the issue of correctness of the execution order of operations in an ETL workflow has been introduced, and formal rules have been presented that ensure such correctness. In the same spirit, we work in the approach presented in this work. For instance, assume two pairs of operations. The first one involves a function that converts Euro values to Dollar values for an hypothetical attribute Cost, and a filter that allows only cost values over 100 Dollars; i.e., c : E → $ and f : $ > 100. In that case, it is necessary to have the function c, represented as a CONVERT operation in our approach, before the filter operation f . The second pair involves, let’s say, the same function c : E → $ and another one that transforms dates from European to American format; i.e., c : EDate → ADate. In that case, both orderings either {c, c } or {c , c} are correct, since the two operations are applied to different attributes (see [20] for more details). Our method captures both cases, as the desired ordering is determined by the (relative) position in the ontology graph of the ontology nodes annotating the transformed data records.
4
An Illustrative Example
In this section, we demonstrate the presented method by means of an example. The source and target schemata used for this example have been chosen appropriately from the TPC-H1 schema to resemble typical real-world scenaria. We keep the example concise, tailoring the source and target graphs so that a small number of schema elements will suffice for demonstrating the main aspects of our framework. We assume two main entities, namely customers and orders, while the whole setting is represented in Table 1. A customer has a name, comprising his/her first and last name, and an address, which consists of his/her country, city and 1
http://www.tpc.org/tpch/
Ontology-Driven Conceptual Design of ETL Processes
137
Table 1. Source and target schemata for the example sources targets
s s t t
OntNode URI=”ex:Address” partOf
{ { { {
customers orders customers orders
cid, oid, cid, oid,
name, country, city, street } cid, date, amount, price } firstName, lastName, address } cid, date, amount, price }
OntNode URI=”ex:Customer”
OntNode URI=”ex:Order”
OntNode URI=”ex:Price” typeOf
partOf
partOf
OntNode URI=”ex:Country”
OntNode URI=”ex:City”
SrcNode ID=”s:City”
OntNode URI=”ex:Date”
OntNode URI=”ex:Name”
OntNode URI=”ex:Street”
partOf
partOf
OntNode URI=”ex:FirstName”
OntNode URI=”ex:LastName”
typeOf typeOf
OntNode URI=”ex:DDMMYY”
OntNode URI=”ex:Amount” isa OntNode URI=”ex:WoleSale”
SrcNode ID=”s:Customer”
SrcNode ID=”s:Order”
TrgNode ID=”t:Address”
SrcNode ID=”s:Amount” SrcNode ID=”s:Price”
isa
OntNode URI=”ex:Offer”
typeOf OntNode URI=”ex:USD”
isa OntNode URI=”ex:Discount”
TrgNode ID=”t:Date”
SrcNode ID=”s:Date”
SrcNode ID=”s:Name”
OntNode URI=”ex:Retail”
OntNode URI=”ex:MMDDYY”
SrcNode ID=”s:Street”
SrcNode ID=”s:Country”
isa
OntNode URI=”ex:EUR”
TrgNode ID=”t:Amount”
TrgNode ID=”t:FirstName” TrgNode ID=”t:LastName”
TrgNode ID=”t:Customer”
TrgNode ID=”t:Price” TrgNode ID=”t:Order”
Fig. 11. Example Operation type=”LOAD” SrcNode ID=”s:Country” SrcNode ID=”s:City” SrcNode ID=”s:Customer”
SrcNode ID=”s:Street” SrcNode ID=”s:Name” SrcNode ID=”s:Amount”
SrcNode ID=”s:Order”
SrcNode ID=”s:Date” SrcNode ID=”s:Price”
Operation type=”CONSTRUCT”
TrgNode ID=”t:Address”
Operation type=”CONSTRUCT”
TrgNode ID=”t:FirstName”
Operation type=”CONSTRUCT” Operation type=”EXTRACT” Operation type=”EXTRACT” Operation type=”FILTER” Operation type=”CONVERT”
TrgNode ID=”t:Customer”
TrgNode ID=”t:LastName” TrgNode ID=”t:Amount” TrgNode ID=”t:Date”
TrgNode ID=”t:Order”
TrgNode ID=”t:Price”
Operation type=”LOAD”
Fig. 12. Output of the graph transformation process: layer 1
street. An order is placed in a particular date, which can be recorded in either the “DD/MM/YY” or the “MM/DD/YY” format. It also refers to an amount of items. This amount can be categorized as “retail” or “wholesale”, according to whether it exceeds a specific threshold. Finally, the price of the order can be recorded in either USD or EUR. We also assume the existence of special offers and discounts, and suppose that the currency for the former is EUR, while for the latter it is USD. This information is reflected in the sample ontology shown in Figure 11, where ontology concepts are represented by round rectangles. The figure also illustrates a source and a target schemas (nodes prefixed with “s” and “t”, respectively), with their elements being annotated by elements of the ontology (dotted lines). Notice, for example, the structural differences in representing the customer’s name and address, as well as the different formats and currencies used in the two data stores for an order’s date and price.
138
D. Skoutas, A. Simitsis, and T. Sellis SrcNode ID=”s:Customer”
IntmNode
IntmNode
IntmNode
IntmNode
IntmNode
IntmNode
Operation type=”CONSTRUCT”
Operation type=”CONSTRUCT”
Operation type=”CONSTRUCT”
Operation type=”FILTER”
Operation type=”CONVERT”
Operation type=”FILTER”
SrcNode ID=”s:Country”
SrcNode ID=”s:City”
SrcNode ID=”s:Street”
SrcNode ID=”s:Amount”
SrcNode ID=”s:Date”
Operation type=”CONSTRUCT”
Operation type=”CONSTRUCT”
Operation type=”CONSTRUCT”
Operation type=”FILTER”
Operation type=”CONVERT”
IntmNode
Operation type=”EXTRACT”
Operation type=”EXTRACT”
SrcNode ID=”s:Name”
Operation type=”LOAD”
Operation type=”EXTRACT”
Operation type=”EXTRACT”
TrgNode ID=”t:FirstName”
TrgNode ID=”t:LastName”
SrcNode ID=”s:Order”
TrgNode ID=”t:Address”
IntmNode
Operation type=”LOAD”
TrgNode ID=”t:Amount”
TrgNode ID=”t:Customer”
TrgNode ID=”t:Date”
SrcNode ID=”s:Price” Operation type=”CONVERT”
TrgNode ID=”t:Price”
IntmNode Operation type=”FILTER”
IntmNode
Operation type=”FILTER”
TrgNode ID=”t:Order”
(a) layer 2 Operation type=”LOAD”
SrcNode ID=”s:Customer”
SrcNode ID=”s:Order”
SrcNode ID=”s:Country”
Operation type=”CONSTRUCT”
SrcNode ID=”s:City”
Operation type=”CONSTRUCT”
SrcNode ID=”s:Street”
Operation type=”CONSTRUCT”
SrcNode ID=”s:Name”
Operation type=”EXTRACT”
TrgNode ID=”t:FirstName”
SrcNode ID=”s:Date”
Operation type=”EXTRACT”
TrgNode ID=”t:LastName”
SrcNode ID=”s:Amount”
Operation type=”CONVERT”
TrgNode ID=”t:Date”
SrcNode ID=”s:Price”
Operation type=”FILTER” Operation type=”CONVERT”
TrgNode ID=”t:Address”
TrgNode ID=”t:Customer”
TrgNode ID=”t:Order”
TrgNode ID=”t:Amount” IntmNode
Operation type=”FILTER”
TrgNode ID=”t:Price”
Operation type=”LOAD”
(b) layer 3 Operation type=”LOAD” SrcNode ID=”s:Country” SrcNode ID=”s:Customer”
SrcNode ID=”s:City”
Operation type=”MERGE”
TrgNode ID=”t:Address”
SrcNode ID=”s:Street” SrcNode ID=”s:Name” SrcNode ID=”s:Date” SrcNode ID=”s:Order”
SrcNode ID=”s:Amount” SrcNode ID=”s:Price”
Operation type=”SPLIT”
TrgNode ID=”t:FirstName”
TrgNode ID=”t:Customer”
TrgNode ID=”t:LastName” Operation type=”CONVERT” Operation type=”FILTER” Operation type=”CONVERT”
TrgNode ID=”t:Date”
TrgNode ID=”t:Order”
TrgNode ID=”t:Amount” IntmNode
Operation type=”FILTER”
TrgNode ID=”t:Price”
Operation type=”LOAD”
(c) layer 4 Fig. 13. Output of the graph transformation process: layers 2, 3, and 4
This graph constitutes the starting point for the graph transformation process. The annotations make explicit the semantics of the data in the corresponding elements, and are obtained either manually or semi-automatically (e.g., through automatic schema matching techniques [11,12]) by the administrator through processes ranging from oral communication with the administrators of the corresponding data stores to study of the elements’ comments and/or accompanying documentation. Note that due to the size of the involved schemata, such graphs can be quite large. This is not a disadvantage of our proposed approach, but an inherent difficulty of the ETL design task. Nevertheless, we can tackle this
Ontology-Driven Conceptual Design of ETL Processes
139
issue either by exploiting existing advanced techniques for visualization of large graphs (e.g., [21]) or by using simple zoom-in/out techniques for exploring certain interesting parts of the graph [22]. Next, the ETL flow is computed by the graph transformation engine, starting from the above input graph and applying the rules presented in Section 3.2. To better illustrate the process, we separately display the result produced by the execution of each layer of rules. The result of the first layer is depicted in Figure 12. For brevity, we omit the ontology nodes. Recall that the first layer is responsible for one-step transformations. Hence, no data flow between the elements s:Price and t:Price has been determined, as no single operation is sufficient to meet the required target specification. Afterward, the rules involving intermediate nodes are executed. The corresponding output is shown in Figure 13(a). Notice the data flow that has now been created between the elements s:Price and t:Price, comprising one CONVERT and one FILTER operation. On the way, some intermediate nodes not leading to a target node, have been introduced. These are removed after the execution of layer 3 (Figure 13(b).) Finally, the EXTRACT and CONSTRUCT operations are incorporated into SPLIT and MERGE operations, respectively, during the execution of layer 4. The final result is presented in Figure 13(c).
5
Evaluation
In this section, we study the applicability of our approach in different ETL scenaria. Our experimental method involves a set of artificially created ETL scenaria containing a varying number of source and target data stores, along with a varying number of respective attributes per data store, and a varying number of ETL operations of different types. The scenaria are built on top of the TPC-H schema. In our implementation, for the creation of the ETL design and the interpretation of the transformation rules per each scenario, we used the Attributed Graph Grammar (AGG) software system [23]. AGG follows an algebraic approach to graph transformation and it is implemented in JAVA. Our goal is not to focus on the exact values of the experimental findings with respect to the efficiency, since by using a different interpretation engine we may get different results; rather we aim at a proof of concept that will demonstrate the applicability of our proposal in real-world settings. In our experiments, we have considered three categories of ETL scenaria containing a varying average number of ETL nodes (data stores and transformations): small scenaria (less than 50 nodes), medium scenaria (between 50 and 140 nodes), and large scenaria (more than 140). In addition, for each category we varied two measures: the number of source and target data stores and the number of transformations. (We have assumed a uniform distribution of the transformation types presented in Section 2.) An overview of our experimental setting is depicted in Table 2.
140
D. Skoutas, A. Simitsis, and T. Sellis Table 2. Details of ETL scenaria used in the experiments (in average values) source nodes small 3-5 medium 10-20 large 20-30
target nodes 3-5 10-20 20-30
operation nodes 10-25 25-90 100-200
Fig. 14. Execution times for the production of ETL designs
Our experimental findings with respect to the execution time are depicted in Figure 14. The axes x, y, and z represent the number of data store nodes (both source and target), the execution time in seconds, and the number of transformation nodes, respectively. Observe that as the number of ETL nodes increases, i.e., as the ETL scenario becomes more complex, the time needed for the production of ETL designs increases as well. However, the rate of increase differs depending on the type of ETL nodes whose number changes. More specifically, the increase of the production time is relatively small when we keep the number of transformations stable and modify only the number of data stores; and thus, the size of the input graph. On the other hand, when we keep the number of data stores stable and modify only the number of transformations, then the time needed for the creation of an ETL design is relatively bigger than the previous attempt. (Again, we stress that the trend is what is important here, since the actual numbers may differ if we use a different interpretation engine.) One would expect that when the input graph becomes bigger then we would need more processing time. However, the latter case involving the addition of further transformations is proved much more complicated, especially due to the fact that several ETL operations cannot be produced before other ETL operations have been created. Recall the example discussed in Section 3.4, where the operation f : $ > 100 can be produced only after the c : E → $ has been created.
Ontology-Driven Conceptual Design of ETL Processes
141
Ontology nodes
40 30 20 10 0 small
medium
large
Fig. 15. Variation of the ontology nodes used
An additional finding is that it does not make any significant difference if the modified (increased/decreased) number of data stores involves either the source or target data stores. Also, observe that for the case containing 60 data store nodes and 212 ETL operations, the creation of the respective ETL design lasted less than 10 minutes, which we consider being a very reasonable time for data warehouse settings. Clearly, our approach resembles a lot an exhaustive method that tests all possible applications of rules during the construction of the ETL design. We consider as a very challenging and interesting future work the usage of heuristic techniques for reducing the search space. Finally, it is worthwhile to mention that the number of ontology nodes used is rather stable, independently of the type of the ETL scenario considered, since all the scenaria that have been tested belong to the same application domain. (Recall that they are based on the TPC-H schema.) Figure 15 depicts the variation of the ontology nodes used (min, max, and avg values) for different types of ETL scenaria. That is an additional benefit obtained by the use of an ontology, as the use of the same ontology nodes significantly decreases the burden of the mapping; from a designer’s point of view, it is more productive and less error-prone to handle a relative stable subset of the ontology for the majority of ETL scenaria. After the production of the ETL designs, we manually checked for their correctness. That is that we checked whether there is any ETL operation that violates the properties of the type graph presented in Section 3.2 and if the placement of the ETL operations are correct with respect to the concepts presented in Section 3.4. In all cases, the ETL designs produced were correct, as it was expected due to the way the ontology is used, as explained in Section 3.4.
6
Related Work
In this section, we discuss the related work for the different aspects of our approach, such as ETL processes, semantic web technology and data warehouses, mashups applications, service composition, publish/subscribe systems, and schema matching.
142
D. Skoutas, A. Simitsis, and T. Sellis
Even though the design and maintenance of ETL processes constitutes a key factor for the success of a DW project, there is a relatively small amount of research literature concerning techniques for facilitating the specification of such processes. The conceptual modeling of ETL scenaria has been studied in [1]. ETL processes are modeled as graphs composed of transformations, treating attributes as first-class modeling elements, and capturing the data flow between the sources and the targets. In another effort, ETL processes are modelled by means of UML class diagrams [3]. A UML note can be attached to each ETL operation to indicate its functionality in a higher level of detail. The main advantage of this approach is its use of UML, which is a widespread, standard modeling language. In a subsequent work, the above approaches have converged, so as to provide a framework that combines both their advantages, namely the ability to model relationships between sources and targets at a sufficiently high level of granularity (i.e., at the attribute level), as well as a widely accepted modeling formalism (i.e., UML) [2]. However, even though these approaches provide a formalism for representing an ETL scenario, they do not deal with the issue of how to exploit the available domain knowledge in order to derive (semi-)automatically such a design, which is the focus of our work. Instead, this task is performed completely manually by the ETL designer. More recent work has focused on the aspect of optimization of ETL processes, providing algorithms to minimize the execution cost of an ETL workflow with respect to a provided cost model [20]. Still, it is assumed that an initial ETL workflow is given, and the problem of how to derive it is not addressed. Nevertheless, in this work we adopt the correctness analysis presented in [20] and we build upon that. In another line of research, the problem of ETL evolution has been studied [24]. Although, in that work a novel graph model for representing ETL processes has been presented, still, it cannot be used in our framework, since (a) it is not suitable for incorporating our ontology-based design, and (b) it deals more with physical aspects of ETL, rather than the conceptual entities we consider in this work. The use of semantic web technology in data warehouses and related areas has already produced some first research results, as the use of ontologies in data warehouse conceptual design [25] and in On-Line analytical Processing [26]. In addition, a semi-automated method exploiting ontologies for the design of multidimensional data warehouses is presented in [27]. Our work is complementary to these efforts, as in our case, the ontology is used specifically for the design of the ETL process. Ontologies have been used also for data cleaning purposes, e.g., in [28], and in web data extraction [29]. Another similar line of research concerns the application of Model-Driven Architecture (MDA) to the automatic generation of logical schemata from conceptual schemata for the data warehouse development; e.g., [30]. However, to the best of our knowledge, the approaches presented in [10,18] were the first to explore the application of ontologies to the conceptual design of ETL processes. In these works, the schemata of the source and target data stores are annotated by complex expressions built from concepts of a corresponding
Ontology-Driven Conceptual Design of ETL Processes
143
domain ontology. Then, a reasoner is used to infer the semantic relationships between these expressions, and the ETL design is derived based on the results of the reasoner. In this paper, we follow a different direction to the problem, based on the theory and tools for graph transformations. This allows for two main advantages. First, in the former approaches, customization and extensibility, although possible, are not very easy to accomplish, as the process for deriving the ETL transformations is tightly coupled to the ontology reasoner. Instead, in the current approach there is a clear separation between the graph transformation rules, which are responsible for creating the ETL design, and the graph transformation engine. Second, in the former approaches, the whole ETL flow between a given pair of source and target node sets is produced in a single run. On the contrary, the current approach provides, in addition to that, the ability to proceed with the ETL design in an interactive, step-by-step mode, e.g., the designer can select a set of source nodes to begin with, select a set of rules and execute them for a number of steps, observe and possibly modify the result, and continue the execution; this exploratory behavior is very often required, since the design of an ETL process is a semi-automatic task. Notice that even though we have adopted in this work the algebraic graph transformation approach supported by AGG [23], other related approaches for graph transformation, as well as the Query/View/Transformation (QVT) language, which is an OMG standard for model transformations [31], can be used. For a detailed comparison and description of the correspondences between such techniques we refer to [32]. In a similar sense, the graph edit operations that are widely used in several applications, such as Pattern Recognition [33,34,35], are not appropriate for the problem at hand. These operations typically include node/edge insertion, node/edge deletion, and node/edge relabeling. Consequently, they are generic, powerful, low-level operations that can transform any given source graph to any given target graph. Instead, for our purposes, (a) we need a set of operations suitable for the ETL case, and (b) we need to control the applicability of each operation; that is, the applicability of each operation should be dependent on (and driven by) the “context”, where context here refers to the semantic annotations of the nodes involved in these operations. Mashups constitute a new paradigm of data integration becoming popular in the emerging trend of Web 2.0. Their different characteristics and requirements compared to ETL processes are mainly the facts that the latter are typically offline procedures, designed and maintained by database experts, while the former are online processes, targeted largely for end users. However, both activities share a common goal: to extract data from heterogeneous sources, and to transform and combine them to provide added value services. Recently deployed mashups editors, such as Yahoo! Pipes [36], Microsoft Popfly [37], and the Google Mashup Editor [38], as well as recent research efforts [39], aim at providing an intuitive and friendly graphical user interface for combining and manipulating content from different Web sources, based mainly on the “dragging and dropping” and parametrization of pre-defined template operations. The process is procedural rather than declarative and does not support the use of metadata to facilitate
144
D. Skoutas, A. Simitsis, and T. Sellis
and automate the task. Hence, our proposed formalism and method can be beneficial in this direction. In fact, our approach is likely even more readily applicable in such context, in the sense that often the semantic annotation of the sources may already be in place. An approach for automatically composing data processing workflows is presented in [40]. Data and services are described using a common ontology to resolve the semantic heterogeneity. The workflow components are described as semantic web services, using relational descriptions for their inputs and outputs. Then a planner uses relational subsumption to connect the output of a service with the input of another. To bridge the differences between the inputs and outputs of services, the planner can introduce adaptor services, which may be either pre-defined, domain-independent relational operations (i.e., selection, projection, join, and union) or domain-dependent operations. In contrast to this work, which assumes a relational model and addresses the problem as a planning problem, focusing on the specificities of the planning algorithm, we follow a generic framework, based on graph transformations, and derive the ETL design by the application of appropriate graph transformation rules. The publish/subscribe system described in [41] considers publications and subscriptions represented by RDF graphs, which is close to our graph-based modeling of the source and target data stores. However, the problem in that case is one of graph matching, namely checking whether a given subscription, represented as a subgraph of the graph containing all subscriptions, matches a subgraph of a given publication graph. Instead, in our case the problem is one of transforming a subgraph of the source graph to a subgraph of the target graph. Our approach has some commonalities with approaches for semantic schema matching [12,42], which take as input two graph-like structures and produce a mapping between the nodes of these graphs that correspond semantically to each other. First, in a pre-processing phase, the labels at the graph nodes, which are initially written in natural language, are translated into propositional formulas to explicitly and formally codify the label’s intended meaning. Then, the matching problem is treated as a propositional unsatisfiability problem, which can be solved using existing SAT solvers. Due to this formalism, the following semantic relations between source and target concepts can be discovered: equivalence, more general, less general, and disjointness. Instead, our approach can handle a larger variety of correspondences, such as convert, extract or merge, which can be further extended to support application-specific needs.
7
Conclusions
In this paper, we have explored the use of an ontology to the conceptual design of ETL processes. We have exploited the graph-based nature of ETL processes and we have considered their design as a series of conditional graph transformations. In doing so, we have proposed a formal means for deriving a conceptual ETL design by creating a customizable and extensible set of graph transformation rules, which drive the construction of the ETL process, in conjunction with the semantic information conveyed by the associated ontology. Finally, we have evaluated our approach and we have proved its applicability in real-world settings.
Ontology-Driven Conceptual Design of ETL Processes
145
Our future plans include the optimization of the approach and especially, of the transformation rules interpretation engine. Of a great interest is the direction to study the problem under the prism of other related approaches for graph transformation, e.g., the QVT language.
References 1. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual Modeling for ETL Processes. In: DOLAP, pp. 14–21 (2002) 2. Luj´ an-Mora, S., Vassiliadis, P., Trujillo, J.: Data Mapping Diagrams for Data Warehouse Design with UML. In: ER, pp. 191–204 (2004) 3. Trujillo, J., Luj´ an-Mora, S.: A UML Based Approach for Modeling ETL Processes in Data Warehouses. In: ER, pp. 307–320 (2003) 4. IBM: IBM Data Warehouse Manager (2006), http://www.ibm.com/software/data/db2/datawarehouse/ 5. Informatica: Informatica PowerCenter (2007), http://www.informatica.com/products/powercenter/ 6. Microsoft: Microsoft Data Transformation Services (2007), http://www.microsoft.com/sql/prodinfo/features/ 7. Oracle: Oracle Warehouse Builder (2007), http://www.oracle.com/technology/products/warehouse/ 8. H¨ usemann, B., Lechtenb¨ orger, J., Vossen, G.: Conceptual Data Warehouse Modeling. In: DMDW, p. 6 (2000) 9. Borst, W.N.: Construction of Engineering Ontologies for Knowledge Sharing and Reuse. PhD thesis, University of Enschede (1997) 10. Skoutas, D., Simitsis, A.: Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data. Int. J. Semantic Web Inf. Syst. 3(4), 1–24 (2007) 11. Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic Schema Matching. VLDB J. 10(4), 334–350 (2001) 12. Shvaiko, P., Euzenat, J.: A Survey of Schema-Based Matching Approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146– 171. Springer, Heidelberg (2005) 13. Simitsis, A., Skoutas, D., Castellanos, M.: Natural Language Reporting for ETL Processes. In: DOLAP, pp. 65–72 (2008) 14. Skoutas, D., Simitsis, A.: Flexible and Customizable NL Representation of Requirements for ETL processes. In: NLDB, pp. 433–439 (2007) 15. Manola, F., Miller, E.: Rdf primer. W3C Recommendation, W3C (February 2004) 16. Brickley, D., Guha, R.: Rdf vocabulary description language 1.0: Rdf schema. W3C Recommendation, W3C (February 2004) 17. McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language Overview. W3C Recommendation, W3C (February 2004) 18. Skoutas, D., Simitsis, A.: Designing ETL Processes Using Semantic Web Technologies. In: DOLAP, pp. 67–74 (2006) 19. Rozenberg, G. (ed.): Handbook of Graph Grammars and Computing by Graph Transformations. Foundations, vol. 1. World Scientific, Singapore (1997) 20. Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-Space Optimization of ETL Workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005) 21. Tzitzikas, Y., Hainaut, J.L.: How to Tame a Very Large ER Diagram (Using Link Analysis and Force-Directed Drawing Algorithms). In: Delcambre, L.M.L., Kop, ´ (eds.) ER 2005. LNCS, vol. 3716, pp. C., Mayr, H.C., Mylopoulos, J., Pastor, O. 144–159. Springer, Heidelberg (2005)
146
D. Skoutas, A. Simitsis, and T. Sellis
22. Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A Generic and Customizable Framework for the Design of ETL Scenarios. Inf. Syst. 30(7), 492–525 (2005) 23. AGG: AGG Homepage (2007), http://tfs.cs.tu-berlin.de/agg 24. Papastefanatos, G., Vassiliadis, P., Simitsis, A., Vassiliou, Y.: Policy-regulated Management of ETL Evolution. J. Data Semantics (to appear) 25. Maz´ on, J.N., Trujillo, J.: Enriching data warehouse dimension hierarchies by using semantic relations. In: Bell, D.A., Hong, J. (eds.) BNCOD 2006. LNCS, vol. 4042, pp. 278–281. Springer, Heidelberg (2006) 26. Niemi, T., Toivonen, S., Niinim¨ aki, M., Nummenmaa, J.: Ontologies with Semantic Web/Grid in Data Integration for OLAP. Int. J. Semantic Web Inf. Syst. 3(4), 25– 49 (2007) 27. Romero, O., Abell´ o, A.: Automating Multidimensional Design from Ontologies. In: DOLAP, pp. 1–8 (2007) 28. Kedad, Z., M´etais, E.: Ontology-based data cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds.) NLDB 2002. LNCS, vol. 2553, pp. 137–149. Springer, Heidelberg (2002) 29. Gottlob, G.: Web Data Extraction for Business Intelligence: The Lixto Approach. In: BTW, pp. 30–47 (2005) 30. Maz´ on, J.N., Trujillo, J., Serrano, M., Piattini, M.: Applying MDA to the development of data warehouses. In: DOLAP, pp. 57–66 (2005) 31. QVT: QVT (2007), http://www.omg.org/docs/ptc/07-07-07.pdf 32. Ehrig, K., Guerra, E., de Lara, J., Lengyel, L., Levendovszky, T., Prange, U., Taentzer, G., Varr´ o, D., Gyapay, S.V.: Model transformation by graph transformation: A comparative study. In: MTiP (2005) 33. Sanfeliu, A., Fu, K.: A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. SMC 13(3), 353–362 (1983) 34. Messmer, B.T., Bunke, H.: A New Algorithm for Error-Tolerant Subgraph Isomorphism Detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(5), 493–504 (1998) 35. Myers, R., Wilson, R.C., Hancock, E.R.: Bayesian Graph Edit Distance. IEEE Trans. Pattern Anal. Mach. Intell. 22(6), 628–635 (2000) 36. Yahoo!: Pipes (2007), http://pipes.yahoo.com/ 37. Microsoft: Popfly (2007), http://www.popfly.com/ 38. Google: Mashup Editor (2007), http://www.googlemashups.com/ 39. Huynh, D.F., Miller, R.C., Karger, D.R.: Potluck: Semi-ontology alignment for casual users. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 903–910. Springer, Heidelberg (2007) 40. Ambite, J.L., Kapoor, D.: Automatically Composing Data Workflows with Relational Descriptions and Shim Services. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 15–29. Springer, Heidelberg (2007) 41. Petrovic, M., Liu, H., Jacobsen, H.A.: G-ToPSS: Fast Filtering of Graph-based Metadata. In: WWW, pp. 539–547 (2005) 42. Giunchiglia, F., Yatskevich, M., Shvaiko, P.: Semantic Matching: Algorithms and Implementation. In: Spaccapietra, S., Atzeni, P., Fages, F., Hacid, M.-S., Kifer, M., Mylopoulos, J., Pernici, B., Shvaiko, P., Trujillo, J., Zaihrayeu, I. (eds.) Journal on Data Semantics IX. LNCS, vol. 4601, pp. 1–38. Springer, Heidelberg (2007)
Policy-Regulated Management of ETL Evolution George Papastefanatos1, Panos Vassiliadis2, Alkis Simitsis3, and Yannis Vassiliou1 1
National Technical University of Athens, Greece {gpapas,yv}@dbnet.ece.ntua.gr 2 University of Ioannina, Greece
[email protected] 3 Stanford University, USA
[email protected]
Abstract. In this paper, we discuss the problem of performing impact prediction for changes that occur in the schema/structure of the data warehouse sources. We abstract Extract-Transform-Load (ETL) activities as queries and sequences of views. ETL activities and its sources are uniformly modeled as a graph that is annotated with policies for the management of evolution events. Given a change at an element of the graph, our method detects the parts of the graph that are affected by this change and highlights the way they are tuned to respond to it. For many cases of ETL source evolution, we present rules so that both syntactical and semantic correctness of activities are retained. Finally, we experiment with the evaluation of our approach over real-world ETL workflows used in the Greek public sector. Keywords: Data Warehouses, ETL, Evolution, Impact of changes.
1 Introduction Data warehouses are complicated software environments that are used in large organizations for decision support based on OLAP-style (On-Line Analytical Processing) analysis of their operational data. Currently, the data warehouse market is of increasing importance; e.g., a recent report from the OLAP Report (http://www.olapreport.com ) mentions that this market grew from $1 Billion in 1996 to $5.7 Billion in 2006 and showed an estimated growth of 16.4 percent in 2006. In a high level description of a data warehouse general architecture, data stemming from operational sources are extracted, transformed, cleansed, and eventually stored in fact or dimension tables in the data warehouse. Once this task has been successfully completed, further aggregations of the loaded data are also computed and subsequently stored in data marts, reports, spreadsheets, and several other formats that can simply be thought of as materialized views. The task of designing and populating a data warehouse can be described as a workflow, also known as Extract–Transform–Load (ETL) workflow, which comprises a synthesis of software modules representing extraction, cleansing, transformation, and loading routines. The whole environment is a very complicated architecture, where each module depends upon its data providers to fulfill its task. This strong flavor of inter-module dependency makes the problem of evolution very important in data warehouses, and especially, for their back stage ETL processes. S. Spaccapietra et al. (Eds.): Journal on Data Semantics XIII, LNCS 5530, pp. 147–177, 2009. © Springer-Verlag Berlin Heidelberg 2009
148
G. Papastefanatos et al.
Fig. 1. An example ETL workflow
Figure 1 depicts an example ETL workflow. Data are extracted from two sources, S1 and S2, and they are transferred to the Data Staging Area (DSA), where their contents and structure are modified; example transformations include filters, joins, projection of attributes, addition of new attributes based on lookup tables and produced via functions, aggregations, and so on. Finally, the results are stored in the data warehouse (DW) either in fact or dimension tables and materialized views. During the lifecycle of the warehouse it is possible that several counterparts of the ETL process may evolve. For instance, assume that a source relation’s attribute is deleted or renamed. Such a change affects the entire workflow, possibly, all the way to the warehouse (tables T1 and T2), along with any reports over the warehouse tables (abstracted as queries over a view V1.) Similarly, assume that the warehouse designer wishes to add an attribute to the source relation S2. Should this change be propagated to the view or the query? Although related research can handle the deletion of attributes due to the obvious fact that queries become syntactically incorrect, the addition of information is deferred to a decision of the designer. Similar considerations arise when the WHERE clause of a view is modified. Assume that the view definition is modified by incorporating an extra selection condition. Can we still use the view in order to answer existing queries (e.g., reports) that were already defined over the previous version of the view? The answer is not obvious, since it depends on whether the query uses the view simply as a macro (in order to avoid the extra coding effort) or, on the other hand, the query is supposed to work on the view, independently of what the view definition is, [23]. In other words, whenever a query is defined over a view, there exist two possible ways to interpret its semantics: (a) the query is defined with respect to the semantics of the view at the time of the query definition; if the view’s definition changes in the future, the query’s semantics are affected and the view should probably be re-adjusted, (b) the query’s author uses the view as an API ignoring the semantics of the view; if these semantics change in the future, the query should not be affected. The problem lies in the fact that there is no semantic difference in the way one defines the query over the view; i.e., we define the view in the same manner in both occasions. Research has extensively dealt with the problem of schema evolution, in objectoriented databases [1, 18, 25], ER diagrams [11], data warehouses [5, 6, 9, 10] and materialized views [2, 6, 10, 13]. Although the study of evolution has had a big impact in the above areas, it is only just beginning to be taken seriously in data
Policy-Regulated Management of ETL Evolution
149
warehouse settings. A recent effort has provided a general mechanism for performing impact prediction for potential changes of data source configurations [15]. In this paper, we build on [15] (see related work for a comparison) and present an extended treatment of the management of evolution events for ETL environments. Our method is fundamentally based on a graph model that uniformly models relations, queries, views, ETL activities, and their significant properties (e.g., conditions). This graph representation has several roles, apart from the simple task of capturing the semantics of a database system, and one of them is the facilitation of impact prediction for a hypothetical change over the system. In this paper, we present in detail the mechanism for performing impact prediction for the adaptation of workflows to evolution events occurring at their sources. The ETL graph is annotated with policies that regulate the impact of evolution events on the system. According to these policies, rules that dictate the proper actions, when additions, deletions or updates are performed to relations, attributes and conditions (all treated as first-class citizens of the model) are provided, enabling the automatic readjustment of the graph. Affected constructs are assigned with statuses (e.g., to-delete) designating the transformations that must be performed on the graph. Moreover, we introduce two mechanisms for resolving contradictory or absent policies defined on the graph, either during the runtime of the impact analysis algorithm (on-demand), or before the impact analysis algorithm executes (a-priori). Finally, we present the basic architecture of the proposed framework and we experimentally assess our approach with respect to its effectiveness and efficiency over real-world ETL workflows. Outline. Section 2 presents a graph-based modelling for ETL processes. Section 3 formulates the problem of evolving ETL processes and proposes an automated way to respond to potential changes expressed by the Propagate Changes algorithm. Section 4 discusses the tuning of the Propagate Changes algorithm. Section 5 presents the system architecture of our prototype. Section 6 discusses our experimental findings and Section 7 concludes our work.
2 Graph Based Modeling for ETL Processes In this section, we propose a graph modeling technique that uniformly covers relational tables, views, ETL activities, database constraints and SQL queries as first class citizens. The proposed technique provides an overall picture not only for the actual source database schema but also for the ETL workflow, since queries that represent the functionality of the ETL activities are incorporated in the model. The proposed modeling technique represents all the aforementioned database parts as a directed graph G(V,E). The nodes of the graph represent the entities of our model, where the edges represent the relationships among these entities. Preliminary versions of this model appear in [14, 15, 16]. The elements of our graph are listed in Table 1. The constructs that we consider are classified as elementary, including relations, conditions, queries and views and composite, including ETL activities and ETL processes. Composite elements are combinations of elementary ones.
150
G. Papastefanatos et al. Table 1. Elements of our graph model Nodes Relations Attributes Conditions Queries Views Group-By Order-By Parameter Function ETL activities ETL summary
R A C Q V GB OB P F A S
Edges Schema relationships Operand relationships Map-select relationships From relationships Where relationships Having relationships Group-By relationships Order-By relationships
ES EO EM EF EW EH EGB EOB
Relations, R. Each relation R(A1,A2,…,An) in the database schema, either a table or a file (it can be considered as an external table), is represented as a directed graph, which comprises: (a) a relation node, R, representing the relation schema; (b) n attribute nodes, Ai∈A, i=1..n, one for each of the attributes; and (c) n schema relationships, ES, directing from the relation node towards the attribute nodes, indicating that the attribute belongs to the relation. Conditions, C. Conditions refer both to selection conditions, of queries and views and constraints, of the database schema. We consider two classes of atomic conditions that are composed through the appropriate usage of an operator op belonging to the set of classic binary operators, Op (e.g., , =, ≤, ≥, !=): (a) A op constant; (b) A op A’ (A, A’ are attributes of the underlying relations). Also, we consider the classes of A IN Q, and EXISTS Q, with Q being a subquery. A condition node is used for the representation of the condition. Graphically, the node is tagged with the respective operator and it is connected to the operand nodes of the conjunct clause through the respective operand relationships, EO. These edges are indexed according to the precedence of each operand (i.e., op1 for the left-side operand and op2 for the right-side) in the condition clause. Composite conditions are easily constructed by tagging the condition node with a Boolean operator (e.g., AND or OR) and connecting the respective edges to the conditions composing the composite condition. Well-known constraints of database relations – i.e., primary/foreign key, unique, not null, and check constraints – are easily captured by this modeling technique with use of a separate condition node. Foreign key constraints are subset conditions between the source and the target attributes of the foreign key. Check constraints are simple valuebased conditions. Primary keys, not null and unique constraints, which are uniquevalue constraints, are explicitly represented through a dedicated node tagged by their names connected with operand edges with the respective attribute nodes. Queries, Q. The graph representation of a Select - Project - Join - Group By (SPJG) query involves a new node representing the query, named query node, and attribute nodes corresponding to the schema of the query. The query graph is therefore a directed graph connecting the query node with all its schema attributes, via schema relationships. In order to represent the relationship between the query graph and the underlying relations, we resolve the query into its essential parts: select, FROM, WHERE, GROUP BY, HAVING, and ORDER BY, each of which is eventually mapped to a subgraph.
Policy-Regulated Management of ETL Evolution
151
Select part. Each query is assumed to own a schema that comprises the attributes, either with their original or alias names, appearing in the SELECT clause. In this context, the SELECT part of the query maps the respective attributes of the involved relations to the attributes of the query schema through map-select relationships, EM, directing from the query attributes towards the relation attributes. From part. The from clause of a query can be regarded as the relationship between the query and the relations involved in this query. Thus, the relations included in the from part are combined with the query node through from relationships, EF, directing from the query node towards the relation nodes. Where and Having parts. We assume that the WHERE and/or the HAVING clauses of a query involve composite conditions. Thus, we introduce two directed edges, namely where relationships, Ew, and having relationships, EH, both starting from a query node towards an operator node corresponding to the condition of the highest level. Nested Queries. Concerning nested queries, we extend the WHERE subgraph of the outer query by (a) constructing the respective graph for the subquery, (b) employing a separate operator node for the respective nesting operator (e.g., IN operator), and (c) employing two operand edges directing from the operator node towards the two operand nodes (the attribute of the outer query and the respective attribute of the inner query) in the same way that conditions are represented in simple SPJ queries. Group and Order By part. For the representation of aggregate queries, we employ two special purpose nodes: (a) a new node denoted as GB∈GB, to capture the set of attributes acting as the aggregators; and (b) one node per aggregate function labeled with the name of the employed aggregate function; e.g., COUNT, SUM, MIN. For the aggregators, we use edges directing from the query node towards the GB node that are labeled , indicating group-by relationships, EGB. Then, the GB node is connected with each of the aggregators through an edge tagged also as , directing from the GB node towards the respective attributes. These edges are additionally tagged according to the order of the aggregators; we use an identifier i to represent the i-th aggregator. Moreover, for every aggregated attribute in the query schema, there exists an edge directing from this attribute towards the aggregate function node as well as an edge from the function node towards the respective relation attribute. Both edges are labelled and belong to EM, as these relationships indicate the mapping of the query attribute to the corresponding relation attribute through the aggregate function node. The representation of the ORDER BY clause of the query is performed similarly. Self-Join Queries. For capturing the set of self-join queries, we stress that each reference via an alias to a relation in the FROM clause of the query is semantically equivalent with an inline view projecting all attributes of the referenced relation (i.e., SELECT *) and named with the respective alias. Self Join query subgraph is connected with the corresponding views’ subgraphs. Functions, F. Functions used in queries are integrated in our model through a special purpose node Fi∈F, denoted with the name of the function. Each function has an input parameter list comprising attributes, constants, expressions, and nested functions, and one (or more) output parameter(s). The function node is connected with each input
152
G. Papastefanatos et al.
parameter graph construct, nodes for attributes and constants or sub-graph for expressions and nested functions, through an operand relationship directing from the function node towards the parameter graph construct. This edge is additionally tagged with an appropriate identifier i that represents the position of the parameter in the input parameter list. An output parameter node is connected with the function node through a directed edge E∈EO∪EM∪EGB∪EOB from the output parameter towards the function node. This edge is tagged based on the context, in which the function participates. For instance, a map-select relationship is used when the function participates in the SELECT clause, and an operand relationship for the case of the WHERE clause. Views, V. Views are considered either as queries or relations (materialized views). Thus, in the rest of the paper, whatever refers to a relation R it refers to a view too (R/V), and respectively, whatever refers to a query Q, it also refers to a view (Q/V). Thus, V ⊆ R∪Q. ETL activities, A. An ETL activity is modeled as a sequence of SQL views. An ETL activity necessarily comprises: (a) one (or more) input view(s), populating the input of the activity with data coming from another activity or a relation; (b) an output view, over which the following activity will be defined; and (c) a sequence of views defined over the input and/or previous, internal activity views. ETL summary, S. An ETL summary is a directed acyclic graph Gs=(Vs,Es) which corresponds to an ETL process of the data warehouse [22]. Vs comprises activities, relations and views that participate in an ETL process. Es comprises the edges that connect the providers and consumers. Conversely to the overall graph where edges denote dependency, edges in the ETL summary denote data provision. The graph of the ETL summary can be topologically sorted and therefore, execution priorities can be assigned to activities. ETL summaries act as zoomed-out descriptions of the detailed ETL processes, and comprise only relations and activities without their internals; this also allows the visualization of the ETL process without overloading the screen with too many details (see for example, figure 9). Modules. A module is a sub-graph of the graph in one of the following patterns: (a) a relation with its attributes and all its constraints, (b) a view with its attributes, functions and operands (c) a query with all its attributes, functions and operands. Modules are disjoint and they are connected through edges concerning foreign keys, mapselect, where, and so on. Within a module, we distinguish top-level nodes comprising the query, relation or the view nodes, and low-level nodes comprising the remaining subgraph nodes. Additionally, edges are classified into provider and part-of relationships. Provider edges are intermodule relationships (e.g., EM, EF), whereas part-of edges are intramodule relationships (e.g., ES, EW). Fig. 2 depicts the proposed graph representation for the following aggregate query: Q: SELECT FROM WHERE GROUP BY
EMP.Emp# as Emp#, Sum(WORKS.Hours) as T_Hours EMP, WORKS EMP.Emp# = WORKS.Emp# AND EMP.STD_SAL >5000 EMP.Emp#
Policy-Regulated Management of ETL Evolution
153
WORKS S
from
S
Emp# map-select map-select
SUM op2
T_HOURS S where
Q S
=
op1
AND
Hours
op op WORKS.PK
Proj#
op W.EMP#.FK op
op1
from
group by
S
EMP GB
op2 Emp#
S
group by
Emp#
map-select
S
S
Name
STD_SAL
op >
EMP.PK
op1
op2 5000
Fig. 2. Graph representation of aggregate query
DML Statements. As far as DML statements are concerned, there is a straightforward way to incorporate them in the graph, since their behavior with respect to adaptation to changes in the database schema can be captured by a graph representation that follows the one of SELECT queries and captures the intended semantics of the DML statement. In our discussions, we will use the term graph equivalence to refer to the fact that evolution changes (e.g., attribute addition) can be handled in the same way we handle the equivalent SELECT query, either these changes occur in the underlying relation of the INSERT statement or the sources of the provider subquery Q. (a)The general syntax of INSERT statements can be expressed as: INSERT INTO table_name (attribute_set) [VALUES (value_set)] | [Q] where Q is the provider subquery for the values to be inserted. The graph equivalent SELECT query, which corresponds to the INSERT statement, comprises a SELECT and a FROM clause, projecting the same attribute set with the attribute set of the INSERT statement, and a WHERE clause for correlating the attribute set with the inserted values - value set or the projected attribute set of the subquery, i.e.,: SELECT (attribute_set) FROM table_name WHERE [(attribute_set) IN (value_set)] | [(attribute_set) IN Q ] (b) Similarly, DELETE statements can be treated as SELECT * queries comprising a WHERE clause. The general syntax of a DELETE statement can be expressed as: DELETE FROM table_name WHERE condition_set
154
G. Papastefanatos et al.
Again, the equivalent SELECT query, which corresponds to the above DELETE statement, comprises a SELECT clause, projecting all the attributes (i.e., *) of the table, as well as a WHERE clause, containing the same set of conditions with that of the DELETE statement, i.e.,: SELECT * FROM table_name WHERE condition_set (c) Finally, UPDATE statements can be treated as SELECT queries comprising a WHERE clause. The general syntax of an UPDATE statement can be expressed as: UPDATE table_name SET [(attribute_set) = (value_set)] | [(attribute_set) = Q ] WHERE condition_set The equivalent SELECT query, which corresponds to the above UPDATE statement, comprises a SELECT clause, projecting the attribute set which is included in the SET clause of the UPDATE statement, as well as a WHERE clause, containing the same set of conditions with that of the UPDATE statement, i.e.,: SELECT attribute_set FROM table_name WHERE condition_set AND [(attribute_set) IN (value_set)] | [(attribute_set) IN Q ]
3 Evolution of ETL Workflows In this section, we formulate a set of rules, which allow the identification of the impact of evolution changes to an ETL workflow and propose an automated way to respond to these changes. The impact of the changes affects the software used in an ETL workflow – mainly queries, stored procedures, triggers, etc. – in two ways: (a) syntactically, a change may evoke a compilation or execution failure during the execution of a piece of code; and (b) semantically, a change may have an effect on the semantics of the software used. In section 3.1, we detail how the graph representing the ETL workflow is annotated with actions that should be taken when a change event occurs. The combination of events and annotations determines the policy to be followed for the handling of a potential change. The annotated graph is stored in a metadata repository and it is accessed from an impact prediction module. This module notifies the designer or the administrator on the effect of a potential change and the extent to which the modification to the existing code can be fully automated, in order to adapt to the change. The algorithm presented in subsection 3.2 explains the internals of this impact prediction module. 3.1 The General Framework for Handling Schema Evolution The main mechanism towards handling schema evolution is the annotation of the constructs of the graph (i.e., nodes and edges) with elements that facilitate impact prediction. Each such construct is enriched with policies that allow the designer to specify the behavior of the annotated construct whenever events that alter the database graph occur. The combination of an event with a policy determined by the designer/administrator triggers the execution of the appropriate action that either blocks the event, or reshapes the graph to adapt to the proposed change.
Policy-Regulated Management of ETL Evolution
155
The space of potential events comprises the Cartesian product of two subspaces; specifically, (a) the space of hypothetical actions (addition/ deletion/modification), and, (b) the space of the graph constructs sustaining evolution changes (relations, attributes and conditions). For each of the above events, the administrator annotates graph constructs affected by the event with policies that dictate the way they will regulate the change. Three kinds of policies are defined: (a) propagate the change, meaning that the graph must be reshaped to adjust to the new semantics incurred by the event; (b) block the change, meaning that we want to retain the old semantics of the graph and the hypothetical event must be blocked or, at least, constrained, through some rewriting that preserves the old semantics [13, 22, 7]; and (c) prompt the administrator to interactively decide what will eventually happen. For the case of blocking, the specific method that can be used is orthogonal to our approach, which can be performed using any available method [13, 22, 7]. Our framework prescribes the reaction of the parts of the system affected by a hypothetical schema change based on their annotation with policies. The correspondence between the examined schema changes and the parts of the system affected by each change is shown in Table 2. We indicate the parts of the system that can be affected by each kind of event. For instance, for the case of an attribute addition, affected parts of the system comprise the relation or view on which the new attribute was added as well as any view or query defined on this relation / view. Table 2. Parts of the system affected by each event and annotation of graph constructs with policies for each event parts of the system affected event on database schema Add
R/V
R/V R/V Q/V Attr. Cond.
Q/V Attr.
nodes annotated with policies Q/V Cond.
R
A
V/Q
√
√
√
√
√
√
C/F/GB/ OB/P
A
√
C
√
√
√
A
√
√
C
√
√
R/V
√
A
√
√
√
√
√
√
√
√
C
√
√
√
√
√
√
√
√
R/V
√
√
√
R/V Delete
Modify/ Rename
√ √
√
√
√
√
√
√
√
√
√
√
√
√ √
√
√
√
√
A = Attribute, C = Constraint, R= Relation, V=View, Q=Query, F= Function, GB = GroupBy, OB=OrderBy, P=Parameter
The definition of policies on each part of the system involves the annotation of the respective construct (i.e., node) in our graph framework. Table 2 presents the allowed annotations of graph constructs for each kind of event. Example. Consider the simple example query SELECT * FROM EMP as part of an ETL activity. Assume that the provider relation EMP is extended with a new attribute PHONE. There are two possibilities:
156
G. Papastefanatos et al.
− The * notation signifies the request for any attribute present in the schema of relation EMP. In this case, the * shortcut can be treated as “return all the attributes that EMP has, independently of which these attributes are”. Then, the query must also retrieve the new attribute PHONE. − The * notation acts as a macro for the particular attributes that the relation EMP originally had. In this case, the addition to relation EMP should not be further propagated to the query. A naïve solution to a modification of the sources; e.g., the addition of an attribute, would be that an impact prediction system must trace all queries and views that are potentially affected and ask the designer to decide upon which of them must be modified to incorporate the extra attribute. We can do better by extending the current modeling. For each element affected by the addition, we annotate its respective graph construct with the policies mentioned before. According to the policy defined on each construct the respective action is taken to correct the query.
Fig. 3. Propagating addition of attribute PHONE
Therefore, for the example event of an attribute addition, the policies defined on the query and the actions taken according to each policy are: − Propagate attribute addition. When an attribute is added to a relation appearing in the FROM clause of the query, this addition should be reflected to the SELECT clause of the query. − Block attribute addition. The query is immune to the change: an addition to the relation is ignored. In our example, the second case is assumed, i.e., the SELECT * clause must be rewritten to SELECT A1,…,An without the newly added attribute. − Prompt. In this case, the designer or the administrator must handle the impact of the change manually; similarly to the way that currently happens in database systems. The graph of the query SELECT * FROM EMP is shown in Figure 3. The annotation of the Q node with propagating addition indicates that the addition of PHONE node to EMP relation will be propagated to the query and the new attribute is included in the SELECT clause of the query.
Policy-Regulated Management of ETL Evolution
157
3.2 Adapting ETL Workflows to Evolution of Sources The mechanism determining the reaction to a change is formally described in Figure 4 by the algorithm Propagate Changes. Given a graph G annotated with policies and an event e, Propagate Changes assigns a status to each affected node of the graph, dictating the action that must be performed on the node to handle the event. Specifically, given an event e over a node n0 altering the source database schema, Propagate Changes determines those nodes that are directly connected to the node altered and an appropriate message is constructed for each of them, which is added into the queue. For each processed node nR, its prevailing policy pR for the processed event e is determined. According to the prevailing policy, the status of each construct is set (see more on statuses in section 4.2). Subsequently, both the initial changes, along with the readjustment caused by the respective actions, are recursively propagated as new events to the consumers of the activity graph. In Figure 3, the statuses assigned to the affected nodes by the addition of an attribute to EMP relation are depicted. First, the algorithm sends a message to EMP relation for the addition of attribute PHONE to its schema, with a default propagate policy. It assigns the status ADD CHILD to relation EMP and propagates the event sending a new message to the query. Since an appropriate policy capturing this event exists on the query, the query is also assigned an ADD CHILD status. In the following sections, we discuss in more details the main components of the proposed algorithm.
Algorithm Propagate Changes Input:
(a) a session id SID (b) a graph G(V,E) (c) an event e over a node n0 (d) a set of policies P defined over nodes of G (e) an optional default policy p0 defined by the user for the event e Output: a graph G(V,E)with a Status value for each n∈V’⊆V Parameters: (a) a global queue of messages Emsg (b) each message m is of the form m = [SID, nS, nR, e, pS], where SID : The unique identifier of the session regarding the evolution event e nS : The node that sends the message nR : The node that receives the message e : The event that occurs on nS pS : Policy of nS for the event e {Propagate, Block, Prompt} Begin 1. Emsg.enqueue([SID,user,n0, e, p0]) 2. while (Emsg != ∅){ 3. m = Emsg.dequeue(); 4. pR = determinePolicy(m); 5. nR.Status=set_status(m,pR); 6. decide_next_to_signal(m,Emsg,G);} //enqueue m End
Fig. 4. Propagate Changes Algorithm
158
G. Papastefanatos et al.
4 Tuning the Propagation of Changes In this section, we detail the internals of the algorithm Propagate Changes. Given an event arriving at a node of the graph, the algorithm involves three cases, specifically, (a) the determination of the appropriate policy for each node, (b) the determination of the node's status (on the basis of this policy) and (c) the further propagation of the event to the rest of the graph. The two first issues are detailed in sections 4.1 and 4.2 respectively. The third issue is straightforward, since the processing order of affected graph elements is determined by a BFS traversal on the graph. Therefore, after the status determination at each node, a message is inserted into the queue for all adjacent nodes connected with incoming edges towards this node. 4.1 Determining the Prevailing Policy It is possible that the policies defined over the different elements of the graph do not always align towards the same goal. Two problems might exist: (a) over-specification refers to the existence of more than one policies that are specified for a node of the graph for the same event, and, (b) under-specification refers to the absence of any policy directly assigned to a node. Consider for example the case of Figure 5, where a simplified subset of the graph for a certain environment is depicted. A relation R with one attribute A populates a view V, also with an attribute A. A query Q, again with an attribute A is defined over V. Here, for reasons of simplicity, we omit all the parts of the graph that are irrelevant to the discussion of policy determination. As one can see, there are only two policies defined in this graph, both concerning the deletion of attributes of view V. The first policy is defined on view V and says: ‘Block all deletions for attributes of view V’, whereas the second policy is defined specifically for attribute V.A and says ‘If V.A must be deleted, then allow it’.
Fig. 5. Example of over-specification and under-specification of policies
The first problem one can easily see is the over-specification for the treatment of the deletion of attribute V.A. In this case, one of the two policies must override the other. A second problem has to do with the fact that neither R.A, nor Q.A, have a policy for handling the possibility of a deletion. In the case that the designer initiates such an event, how will this under-specified graph react? To give you a preview, under-specification can be either offline prevented by specifying default policies for all attributes or online compensated by following the policy of surrounding nodes. In the rest of this section, we will refer to any such problems as policy misspecifications.
Policy-Regulated Management of ETL Evolution
159
We provide two ways for resolving policy misspecifications on a graph construct: on-demand and a-priori policy misspecification resolution. Whenever a node is not explicitly annotated with a policy for a certain event, on-demand resolution determines the prevailing policy during the algorithm execution based on policies defined on other constructs. A-priori resolution prescribes the prevailing policy for each construct potentially affected by an event with use of default policies. Both a-priori and on-demand resolution can be equivalently used for determining the prevailing policy of an affected node. A-priori annotation requires the investment of effort for the determination of policies before hypothetical events are tested over the warehouse. The policy overriding is tuned in such a way, though, that general annotations for nodes and edges need to be further specialized only wherever this is necessary. Our experiments, later, demonstrate that a-priori annotation can provide significant earnings in effort for the warehouse administrator. On the other hand, one can completely avoid the default policy specification and annotate only specific nodes. This is the basic idea behind the on-demand policy and this way less effort is required at the expense of runtime delays whenever a hypothetical event is posed on the system. 4.1.1 On-Demand Resolution The algorithm for handling policy misspecifications on demand is shown in Figure 6. Intuitively, the main idea is that if a node has a policy defined specifically for it, it will know how to respond to an event. If an appropriate policy is not present, the node looks for a policy (a) at its container top-level node, or (b) at its providers. Algorithm Determine Policy Input: a message m of the form m=[SID,nS,nR,e,pS] Output: a prevailing policy pR Begin 1. if (edge(ns,nr) isPartOf) 2. return pS; 3. else 4. if exists policy(nR,e) 5. return policy(nR); 6. else if exists policy(nR.parent,e) 7. return policy(nR.parent); 8. else return ps; End
// if m came from partof edge // child node policy prevails // m came from provider // check if nR has policy for this event // return this policy // return nR parent’s policy // else return providers policy
Fig. 6. Determine Policy Algorithm
Algorithm Determine Policy implements the following basic principles for the management of an incoming even to a node: − If the policy is over-specified, then the higher and left a module is at the hierarchy of Figure 7, the stronger its policy is. − If the policy is under-specified, then the adopted policy is the one coming from lower and right.
160
G. Papastefanatos et al.
Fig. 7. On Demand Policy Resolution
The algorithm assumes that a message is sent from a sender node ns to a receiver node nr. Due to its complexity, we present the actual decisions taken in a different order than the one of the code: − −
−
Check 1 (lines 6-7): this concerns child nodes: if they do not have a policy of their own, they inherit their parent’s policy. If they do have a policy, this is covered by lines 4-5. Check 2 (lines 1-5): if the event arrives at a parent node (e.g., a relation), and it concerns a child node (e.g., an attribute) the algorithm assigns the policy of the parent (lines 4-5), unless the child has a policy of its own that overrides the parent’s policy (lines 1-2). A subtle point here is that if the child did not have a policy, it has already obtained one by its parent in lines 6-7. Check 3 (line 7): Similarly, if an event arrives from a provider to a consumer node via a map-select edge, the receiver will make all the above tests, and if they all fail, it will simply adopt the provider’s policy. For example, in the example of figure 5, Q.A will adopt the policy of V.A if all else fails.
4.1.2 A-Priori Resolution A-priori resolution of policy misspecifications enables the annotation of all nodes of the graph with policies before the execution of the algorithm. A-priori resolution guarantees that every node is annotated with a policy for handling an occurred event and thus no further resolution effort is required at runtime. That is, the receiver node of a message will always have a policy handling the event of the message. A-priori resolution is accomplished by defining default policies at 3 different scopes [18]. System-wide scope. First, we prescribe the default policies for all kinds of constructs, in a system-wide context. For instance, we impose a default policy on all nodes of the graph that blocks the deletion of the constructs per se. Top-level scope. Next, we prescribe defaults policies for top-level nodes, namely relations, queries and views of the system, with respect to any combination of the following: the deletion of the construct per se, as well as the addition, deletion or modification of a construct’s descendants. The descendants can be appropriately specified by their type, as applicable (i.e., attributes, constraints or conditions). Low-level scope. Lastly, we annotate specific low granularity constructs, i.e., attributes, constraints or conditions, with policies for their deletion or modification.
Policy-Regulated Management of ETL Evolution
161
The above arrangement is order dependent and exploits the fact that there is a partial order of policy overriding. The order is straightforward: defaults are overridden by specific annotations and high level construct annotations concerning their descendants are overridden by any annotation of such descendant: System-wide Scope ≤ Top-Level Scope ≤ Low-Level Scope Furthermore, certain nodes or modules that violate the above default behaviors and must obey to an opposite reaction for a potential event are explicitly annotated. For example, if a specific attribute of an activity must always block the deletion of itself, whereas the default activity policy is to propagate the attribute deletions, then this attribute node is explicitly annotated with block policy, overriding the default behavior. 4.1.3 Completeness The completeness problem refers to the possibility of a node that is unable to determine its policy for a given event. It is easy to see that it is sufficient to annotate all the source relations for the on-demand policy, in order to guarantee that all nodes can determine an appropriate policy. For the case of a-priori annotation, it is also easy to see that a top-level, system-wide annotation at the level of nodes is sufficient to provide a policy for all nodes. In both cases, it is obvious that more annotations with extra semantics for specific nodes, or classes of nodes, that override the abovementioned (default) policies, are gracefully incorporated in the policy determination mechanisms. 4.2 Determination of a Node’s Status In the context of our framework, the action applied on an affected graph construct is expressed as a status that is assigned on this construct. The status of each graph construct visited by Propagate Changes algorithm is determined locally by the prevailing policy defined on this construct and the event transmitted by the adjacent nodes. The status of a construct with respect to an event designates the way this construct is affected and reacts to this event, i.e., the kind of evolution action that will be applied to the construct. A visited node is initially assigned with a null status. If the prevailing policy is block or prompts then the status of the node is block and prompt respectively, independently of the occurred event. Recall that blocking the propagation of an event implies that the affected node is annotated for retaining the old semantics despite of change occurred at its sources. The same holds for prompt policy with the difference that the user, e.g., the administrator, the developer, etc. must decide upon the status of the node. For determining the status of a node when a propagate policy prevails, we take into account the event action (e.g., attribute addition, relation deletion, etc.) transmitted to the node, the type of node accepting the event and lastly the scope of the event action. An event raises actions that may affect the node itself, its ancestors within a module or its adjacent dependent nodes. Thus, we classify the scope of evolution impacts with respect to an event that arrives at a node as: − SELF: The impact of the event concerns the node itself, e.g., a ‘delete attribute’ event occurs on an attribute node. − CHILD: The impact of the event concerns a descending node belonging to the same module, e.g., a view is notified with a ‘delete attribute’ event for the deletion of one of its attributes.
162
G. Papastefanatos et al.
− PROVIDER: The impact of the event concerns a node belonging to a provider module, e.g., a view is notified for the addition of an attribute at the schema of one of its source relations (and, in return, it must notify any other views or queries that are defined over it). In that manner, combinations of the event type and the event scope provide a non finite set of statuses, such as: DELETE SELF, DELETE CHILD, ADD CHILD, RENAME SELF, MODIFY PROVIDER and so on. It is easy to see that that the above mechanism is extensible both with respect to event types and statuses. Lastly, the status assignment to nodes induces new events on the graph which are further propagated by Propagate Changes algorithm to all adjacent constructs. In the Appendix of this paper, the statuses assigned to visited nodes for combinations of events and types of nodes are shown, when propagate policy prevails on the visited node. For each status, the new event induced by the assignment of a node with status, which is further propagated to the graph, is also shown.
5 System Architecture For the representation of the database graph and its annotation with policies regarding evolution semantics, we have implemented a tool, namely HECATAEUS [16, 17]. The system architecture is shown in Figure 8. Graph Viewer DDL files SQL scripts
Graph Visualization
Import/ Export Scenarios
XML, jpeg
Evolution Manager DB Schema representation Workload representation Parser
Evolution Semantics
Create DB Schema
Metric Manager
Validate Workload
DB Dictionary
Fig. 8. System Architecture of HECATAEUS
HECATAEUS enables the user to transform SQL source code to database graphs, explicitly define policies and evolution events on the graph and determine affected and adjusted graph constructs according to the proposed algorithm. As mentioned in the
Policy-Regulated Management of ETL Evolution
163
introduction, the graph modeling of the environment has versatile utilizations: apart from the impact prediction, we can also assess several graph-theoretic metrics of the graph that highlight sensible regions of the graph (e.g., a large node degree denotes strong coupling with the rest of the graph). This metrics management is not part of this paper’s investigations; still, we find it worth mentioning. The tool architecture (see Figure 8) consists of the coordination of HECATAEUS’ five main components: the Parser, the Evolution Manager, the Graph Viewer, the Metric Manager and the Dictionary. The Parser is responsible for parsing the input files (i.e., DDL and workload definitions) and for sending each command to the database Catalog and then to the Evolution Manager. The functionality of the Dictionary is to maintain the schema of the relations as well as to validate the syntax of the workload parsed (i.e., activity definitions, queries, views), before they are modeled by the Evolution Manager. The Evolution Manager is the component responsible for representing the underlying database schema and the parsed queries in the proposed graph model. The Evolution Manager holds all the semantics of nodes and edges of the aforementioned graph model, assigning nodes and edges to their respective classes. It communicates with the catalog and the parser and constructs the node and edge objects for each class of nodes and edges (i.e., relation nodes, query nodes, etc.). It retains all evolution semantics for each graph construct (i.e., events, policies) and methods for performing evolution scenarios and executing Propagate Changes algorithm. It contains methods for transforming the database graph from/to an XML format. The Metric Manager is responsible for the application of graph metrics on the graph. It measures and provides results regarding several properties of the graph, such as nodes’ degrees, graph size, etc. Finally, the Graph Viewer is responsible for the visualization of the graph and the interaction with the user. It communicates with the Evolution Manager, which holds all evolution semantics and methods. Graph Viewer offers distinct colorization for each set of nodes, edges according to their types and the way they are affected by evolution events, editing of the graph, such as addition, deletion and modification of nodes, edges and policies. It enables the user to raise evolution events, to detect affected nodes by each event and highlight appropriate transformations of the graph.
6 Experiments We have evaluated the proposed framework and capabilities of the approach presented via the reverse engineering of seven real-world ETL scenarios extracted from an application of the Greek public sector. The data warehouse examined maintains information regarding farming and agricultural statistics. Our goal was to evaluate the framework with respect to its effectiveness for adapting ETL workflows to evolution changes occurring at ETL sources and its efficiency for minimizing the human effort required for defining and setting the evolution metadata on the system. The aforementioned ETL scenarios extract information out of a set of 7 source tables, namely S1 to S7 and 3 lookup tables, namely L1 to L3, and load it to 9 tables, namely T1 to T9, stored in the data warehouse. The 7 scenarios comprise a total
164
G. Papastefanatos et al.
number of 59 activities. Our approach has been built on top of the Oracle DBMS. All ETL scenarios were source coded as PL\SQL stored procedures in the data warehouse. First, we extracted embedded SQL code (e.g., cursor definitions, DML statements, SQL queries) from activity stored procedures. Table definitions (i.e., DDL statements) were extracted from the source and data warehouse dictionaries. Each activity was represented in our graph model as a view defined over the previous activities, and table definitions were represented as relation graphs. In Figure 9, we depict the graph representation of the first ETL scenario as modeled by our framework. For simplicity reasons, only top level nodes are shown. Activities are depicted as triangles; source, lookup and target relations as dark colored circles.
Fig. 9. Graph representation for the first ETL scenario Constraint Add; 1
Relation Rename; 7
Attribute Add; 92 Attribute Rename; 236
Attribute Drop; 32 Attribute Modify; 6
Fig. 10. Distribution of occurrence per kind of evolution events
Policy-Regulated Management of ETL Evolution
165
Afterward, we monitored the schema changes occurred at the source tables due to changes of requirements over a period of 6 months. The set of evolution events occurred in the source schema included renaming of relations and attributes, deletion of attributes, modification of their domain, and lastly addition of primary key constraints. We counted a total number of 374 evolution events and the distribution of occurrence per kind of event is shown in Figure 10. In Table 3 we provide the basic properties of each examined ETL scenario and specifically, its size in terms of number of activities and number of nodes comprising its respective graph, its evolved source tables and lookup tables and lastly the number of occurred events on these tables. Table 3. Characteristics of the ETL scenarios Scenario
# Activ.
# Nodes
1 2 3 4 5 6 7
16 6 6 16 5 5 5
1428 830 513 939 242 187 173
Sources S1, S4 ,L1, L2, L3 S2, L1 S3, L1 S4, L1 S5 S6 S7
# Events 142 143 83 115 3 1 6
The intent of the experiments is to present the impact of these changes to the ETL flows and specifically to evaluate our proposed framework with respect to its effectiveness and efficiency. 6.1 Effectiveness of Workflow Adaptation to Evolution Changes For evaluating the extent to which affected activities are effectively adapted to source events, we imposed policies on them for each separate occurred event. Our first goal was to examine whether our algorithm determines the correct status of activities in accordance to the expected transformations, i.e., transformations that the administrators/developers would have manually enforced on the ETL activities to handle schema changes at the sources, by inspecting and rewriting every activity source code. Hypothesis H1. Algorithm “Propagate Changes” effectively determines the correct status of activities for various kinds of evolution events. Methodology: 1. 2.
3.
We first examined each event and its impact on the graph, by finding all affected activities. Since all evolution events and their impact on activities were a-priori known, each activity was annotated with an appropriate policy for each event. An appropriate policy for an event is the policy (either propagate or block), which adjusts the activity according to the desired manual transformation, when this event occurs on the activity source. In that manner, each event at the source schema of the ETL workflows was separately processed, by imposing a different policy set on the activities. We
166
4.
G. Papastefanatos et al.
employed both propagate and block policies for all views and queries subgraphs comprising ETL activities. Policies were defined both at query and attribute level, i.e., query, view and attribute nodes were annotated. We invoked each event and examined the extent to which the automated readjustment of the affected activities (i.e., the STATUS assigned to each activity) adheres to the desired transformation. We, finally, evaluated the effectiveness of our framework by measuring the number of affected activities by each event, i.e., these that obtained a STATUS, with respect to the number of successfully readjusted activities, (or, in other words, those activities whose STATUS was consistent with the desired transformation).
In Table 4, we summarize out results for different kinds of events. First, we note that most of the activities were affected by attribute additions and renaming, since these kinds of events were the most common in our scenarios. Most important, we can conclude that our framework can effectively adapt activities to the examined kinds of events. Exceptions regarding attribute and constraint additions are due to the fact that specific events induced ad hoc changes in the functionality of the affected activities, which prompts the user to decide upon the proper readjustments. These exceptions are mainly owed to events occurred on the lookup tables of the scenarios. Additions of attributes at these tables incurred (especially when these attributes were involved in primary key constraints) rewriting of the WHERE clause of the queries contained in the affected activities. Finally, whereas the above concern the precision of the method (i.e., the percentage of correct status determination for affected activities), we should also report on the recall of our method. Our experimental findings demonstrate that the number of those activities that were not affected by the event propagation, although they should have been affected, is zero. Table 4. Affected and adjusted activities per event kind
Event Type Attribute Add Attribute Delete Attribute Modify Attribute Rename Constraint Add Table Rename Total
Activities with Correct Status
with Status 1094 426 59 1255 13 8 2855
1090 426 59 1255 5 8 2843
6.2 Effectiveness of Workflow Annotation Our second goal was to examine the extent to which different annotations of the graph with policies affect the effectiveness of our framework. This addresses the real case when the administrator/developer does not know the number and the kind of potential events that occur on the sources and consequently cannot decide a priori upon a specific policy set for the graph.
Policy-Regulated Management of ETL Evolution
167
Hypothesis H2. Different annotations affect the effectiveness of the algorithm. Methodology: 1. 2. 3.
We first imposed a policy set on the graph. We then invoked each event in sequence, retaining the same policy set on the graph. We again examined the extent to which the automated readjustment of the affected activities (i.e., their obtained status) adheres to the desired transformation and evaluated the effectiveness of our framework for several annotation plans.
We experimented with 3 different policy sets. −
−
−
Mixture annotation. A mixture annotation plan for a given set of events comprises the set of policies imposed on the graph that maximizes the number of successfully adjusted activities. For finding the appropriate policy for each activity of the ETL scenarios, we examined its most common reaction to each different kind of event. For instance, the appropriate policy of an activity for attribute addition will be propagate if this activity propagates the 70% of the new attributes added at its source and blocks the rest 30%. In mixture annotation, propagate policies were applied on most activities for all kinds of events whereas block policies were applied on some activities regarding only attribute addition events. Worst-Case annotation. As opposed to the mixture annotation plan, the worst case scenario comprises the set of policies imposed on the graph that minimizes the number of successfully adjusted activities. The less common reaction to an event type was used for determining the prevailing policy of each activity. Optimistic annotation. Lastly, an optimistic annotation plan implies that all activities are annotated with a propagate policy for all potential events occurred at their sources.
Again, we measured the number of affected activities that obtained a specific status with respect to the number of correctly adapted activities. In Figures 11, 12, 13 we present the results for the different kinds of events and annotations. As stated in the hypothesis, different annotations on the graph have a different impact on the overall effectiveness of our framework, as they vary both the number of the affected activities (i.e., candidates for readjustment) and the number of the adjusted activities (i.e., successfully readjusted) on the graph. The mixture annotation manages most effectively to detect these activities that should be affected by an event and adjust them properly. In mixture annotation, the policies, imposed on the graph, manage to propagate event messages towards activities that should be readjusted, whereas block messages from activities that should retain their old functionality. On the contrary, the worst case annotation, fails to detect all affected activities on the graph as well as to adjust them properly, as it blocks event messages from the early activities of each ETL workflow. Since events are blocked in the beginning of the workflow, further activities cannot be notified for handling these events. Lastly, optimistic annotation provides both good and bad results. On the good side of things, the optimistic annotation is close to the mixture annotation in several categories. On the other hand, the optimistic annotation propagates event messages even towards activities, which should retain their old
168
G. Papastefanatos et al.
semantics. In that manner, optimistic annotations increases the number of affected activities (i.e., actually all the activities of the workflow are affected) without however handling properly their status determination. AFFECTED
Mixture
CORRECT STATUS 12551255
1094 1036
426 426
59 59 Attribute Add
Attribute Drop
Attribute Modify
13 Attribute Rename
5
Constraint Add
8
8
Table Rename
Fig. 11. Mixture Annotation
AFFECTED
Worst-Case
CORRECT STATUS 263
169
58 4 Attribute Add
0 Attribute Drop
7
0
Attribute Modify
8
0 Attribute Rename
0
Constraint Add
8
0
Table Rename
Fig. 12. Worst Case Annotation
Optimistic
AFFECTED
CORRECT STATUS
1448 12551255
486
426 426 59 59
Attribute Add
Attribute Drop
Attribute Modify
13 Attribute Rename
5
Constraint Add
Fig. 13. Optimistic Annotation
8
8
Table Rename
Policy-Regulated Management of ETL Evolution
169
Overall, a reasonable tactic for the administrator would be to either choose a mixture method, in case there is some a-priori knowledge on the desired behavior of constructs in an environment, or, progressively refine an originally assigned optimistic annotation whenever nodes that should remain immune to changes are unnecessarily affected. 6.3 Efficiently Adapting ETL Workflows to Evolution Changes For measuring the efficiency of our framework, we examined the cost of manual adaptation of the ETL activities by the administrator / developer with respect to the cost of setting the evolution metadata on the graph (i.e., annotation with policies) and transforming properly the graph with use of our framework. Developers’ effort comprises the detection, inspection and where necessary the rewriting of affected activities by an event. For instance, given an attribute addition in a source relation of an ETL workflow, the developer must detect all activities affected by the addition, decide how and whether this addition must be propagated or not to each SQL statement of the activity and lastly rewrite, if necessary, properly the source code. The effort required for the above operations depends highly on the developers’ experience but on the ETL workflow characteristics as well (e.g., the complexity of the activity source code, the workflow size, etc.). Therefore, the cost in terms of human effort for manual handling of source evolution, MC, can be quantified as the sum of (a) the number of SQL statements per activity, which are affected by an event and must be manually detected, AS, plus (b) the number of SQL statements, which must be manually rewritten for adapting to the event, RS. Thus human effort for manual adaptation of an activity, a, to an event, e, can be expressed as:
MC ae = ( AS ae + RS ae )
(1)
For a given set of evolution events E, and a set of manually adapted activities A in an ETL workflow, the overall cost, OMC, is expressed as:
OMC = ∑∑ MCae
(2)
e∈E a∈A
For calculating OMC, we recorded affected and rewritten statements for all activities and events. If HECATAEUS had been used, instead of manually adapting all the activities, the human effort can be quantified as the sum of two factors: (a) the number of annotations (i.e., policy per event) imposed on the graph, AG, and (b) the cost of manually discovering and adjusting activities AR that escape the automatic status annotation of the tool, e.g., no annotations have been set on these activities or a prompt policy is assumed for these activities. The latter cost is expressed as:
RMC = ∑ ∑ MC ae
(3)
e∈E a∈AR
Therefore, overall cost for automated adaptation, OAC, is expressed as:
OAC = AG + RMC
(4)
170
G. Papastefanatos et al.
Hypothesis H3. The cost of the semi-automatic adaptation, OAC, is equal or less than the cost of manually handling evolution, OMC. For calculating OAC, we followed the mixture plan for annotating each attribute and query node potentially affected by an event occurred at the source schema and measured the number of explicit annotations, AG. We then applied our algorithm and measured the cost of manual adaptation for activities which were not properly adjusted. Figure 14 compares the OMC with OAC for 7 evolving ETL scenarios. Figure 14 shows that the cost of manual adaptation is much higher than the cost of semi automating the evolution process. The divergence becomes higher especially for large scenarios such as scenario 1 and 4 or scenarios with many events such as scenario 2, in which the administrator must manually detect a large number of affected activities or handle a large number of events.
Fig. 14. Manual (OMC) and Semi-automatic (OAC) Adaptation Cost per ETL Scenario
Fig. 15. Cost of Adaptation with and without use of Default Policies
Furthermore, to decrease the annotation cost, AG, we applied system wide default policies on the graph. With use of default policies, the annotation cost, AG, decreases to the number of explicit annotations of nodes that violates the default behavior. We, again, measured the number of explicit annotations as well as the remaining RMC. As shown in figure 15, the cost of adaptation with use of our framework is further
Policy-Regulated Management of ETL Evolution
171
decreased, when default policies are used. With use of default policies, the overall adaptation cost is dependent neither to the scenario size (e.g., number of nodes) nor to the number of evolution events, but rather to the number of policies, deviating from the default behavior, that are imposed on the graph. Scenarios 1 and 4 comprised more cases for which the administrator should override the default system policies and thus, the overall cost is relatively high. On the contrary, in scenarios 2 and 3 the adaptation is achieved better by a default policy annotation, since the majority of the affected activities react in a uniform way (i.e., default) to evolution events.
7 Related Work Evolution. A number of research works are related to the problems of database schema evolution. In [21] a survey on schema versioning and evolution is presented, whereas a categorization of the overall issues regarding evolution and change in data management is presented in [20]. The problem of view adaptation after redefinition is mainly investigated in [2, 8, 12] where changes in views definition are invoked by the user and rewriting is used to keep the view consistent with the data sources. In [9] the authors discuss versioning of star schemata, where histories of the schema are retained and queries are chronologically adjusted to ask the correct schema. [2] deals also with warehouse adaptation, but only for SPJ views. [13] deals with the view synchronization problem, which considers that views become invalid after schema changes in their definition. The authors extend SQL, enabling the user to define evolution parameters characterizing the tolerance of a view towards changes and how these changes will be dealt with during the evolution process. Also, the authors propose an algorithm for rewriting views based on interrelationships between different data sources. In this context, our work can be compared with that of [13] in the sense that policies act as regulators for the propagation of schema evolution on the graph similarly to the evolution parameters introduced in [13]. We furthermore extend this approach to incorporate attribute additions and the treatment of conditions. Note that all algorithms for rewriting views when the schema of their source data change (e.g., [2,8]), are orthogonal to our approach. This is due to the fact that our algorithm stops at status determination and does not perform any rewritings. A designer can apply any rewriting algorithm, provided that he pays the annotation effort that each of the methods of the literature requires (e.g., LAV/GAV/GLAV or any other kind of metadata expressions) For example, such an expression could be stating that two select-project fragments of two relations are semantically equivalent. Due to this generality, our approach can be extended in the presence of new results on such algorithms. A short, first version of this paper appears in [15] where (a) the graph model is presented and (b) the general framework is informally presented. [15] sketches the basic concepts of a framework for annotating the database graph with policies concerning the behaviour of nodes in the presence of hypothetical changes. In this paper, we extend the above work in the following ways. First, we elaborate an enriched version of the graph model, by incorporating DML statements. ETL activities utilize DML statements for storing temporary or filtering out redundant data; thus, the incorporation of such statement complements the representation of ETL workflows. Second, we present the mechanism for impact prediction in much more detail, both in terms of the algorithmic internals and in terms of system architecture. In this context, we also give a more elaborate version
172
G. Papastefanatos et al.
of the algorithm for the propagation of changes. Third, the discussions on the management of incomplete or overriding policies are novel in this paper. Finally, we present a detailed experimental study for the above that is not present in [15]. Model mappings. Model management [3, 4], provides a generic framework for managing model relationships, comprising three fundamental operators: match, diff and merge. Our proposal assigns semantics to the match operator for the case of model evolution, where the source model of the mapping is the original database graph and the target model is the resulting database graph, after evolution management has taken place. Velegrakis at al., have proposed a similar framework, namely ToMas, for the management of evolution. Still, the model of [24] is more restrictive, in the sense that it is intended towards retaining the original semantics of the queries by preserving mappings consistent when changes occur. Our work is a larger framework that allows the restructuring of the database graph (i.e., model) either towards keeping the original semantics or towards its readjustment to the new semantics. Lastly, in [7], AutoMed, a framework for managing schema evolution in data warehouse environments is presented. They introduce a schema transformation-based approach to handle evolution of the source and the warehouse schema. Complex evolution events are expressed as simple transformations comprising addition, deletion, renaming, expansion and contraction of a schema construct. They also deal with the evolution of materialized data with use of IQL, a functional query language supporting several primitive operators for manipulating lists. Both [24] and [7] can be used orthogonally to our approach for the case that affected constructs must preserve the old semantics (i.e., block policy in our framework) or for the case that complex evolution events must be decomposed into a set of elementary transformations, respectively.
8 Conclusions In this paper, we have dealt with the problem of impact prediction of schema changes in data warehouse environments. The strong flavor of inter-module dependency in the back stage of a data warehouse makes the problem of evolution very important under these settings. We have presented a graph model that suitably models the different constructs of an ETL workflow and captures DML statements. We have provided a formal method for performing impact prediction for the adaptation of workflows to evolution events occurring at their sources. Appropriate policies allow the automatic readjustment of the graph in the presence of potential changes. Finally, we have presented our prototype based on the proposed framework and we have experimentally assessed our approach with respect to its effectiveness and efficiency over real-world ETL workflows.
Acknowledgments We would like to thank the anonymous reviewers for their suggestions on the structure and presentation of the paper, which have improved the paper a lot.
References 1. Banerjee, J., Kim, W., Kim, H.J., Korth, H.F.: Semantics and implementation of schema evolution in object-oriented databases. In: Proc. ACM Special Interest Group on Management of Data, pp. 311–322 (1987)
Policy-Regulated Management of ETL Evolution
173
2. Bellahsene, Z.: Schema evolution in data warehouses. Knowledge and Information Systems 4(3), 283–304 (2002) 3. Bernstein, P., Levy, A., Pottinger, R.: A Vision for Management of Complex Models. SIGMOD Record 29(4), 55–63 (2000) 4. Bernstein, P., Rahm, E.: Data warehouse scenarios for model management. In: Laender, A.H.F., Liddle, S.W., Storey, V.C. (eds.) ER 2000. LNCS, vol. 1920, pp. 1–15. Springer, Heidelberg (2000) 5. Blaschka, M., Sapia, C., Höfling, G.: On Schema Evolution in Multidimensional Databases. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 153–164. Springer, Heidelberg (1999) 6. Bouzeghoub, M., Kedad, Z.: A logical model for data warehouse design and evolution. In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds.) DaWaK 2000. LNCS, vol. 1874, pp. 178–188. Springer, Heidelberg (2000) 7. Fan, H., Poulovassilis, A.: Schema Evolution in Data Warehousing Environments – A Schema Transformation-Based Approach. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 639–653. Springer, Heidelberg (2004) 8. Gupta, A., Mumick, I.S., Rao, J., Ross, K.A.: Adapting materialized views after redefinitions: Techniques and a performance study. Information Systems J. 26(5), 323–362 (2001) 9. Golfarelli, M., Lechtenbörger, J., Rizzi, S., Vossen, G.: Schema Versioning in Data Warehouses. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 415–428. Springer, Heidelberg (2004) 10. Kaas, C., Pedersen, T.B., Rasmussen, B.: Schema Evolution for Stars and Snowflakes. In: Sixth Int’l. Conference on Enterprise Information Systems (ICEIS 2004), pp. 425–433 (2004) 11. Liu, C.T., Chrysanthis, P.K., Chang, S.K.: Database schema evolution through the specification and maintenance of changes on entities and relationships. In: Loucopoulos, P. (ed.) ER 1994. LNCS, vol. 881, pp. 132–151. Springer, Heidelberg (1994) 12. Mohania, M., Dong, D.: Algorithms for adapting materialized views in data warehouses. In: Proc. International Symposium on Cooperative Database Systems for Advanced Applications (CODAS 1996), pp. 309–316 (1996) 13. Nica, A., Lee, A.J., Rundensteiner, E.A.: The CSV algorithm for view synchronization in evolvable large-scale information systems. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, pp. 359–373. Springer, Heidelberg (1998) 14. Papastefanatos, G., Vassiliadis, P., Vassiliou, Y.: Adaptive Query Formulation to Handle Database Evolution. In: Proc. Forum of the Eighteenth Conference on Advanced Information Systems Engineering (CAISE 2006) (2006) 15. Papastefanatos, G., Vassiliadis, P., Simitsis, A., Vassiliou, Y.: What-if analysis for data warehouse evolution. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2007. LNCS, vol. 4654, pp. 23–33. Springer, Heidelberg (2007) 16. Papastefanatos, G., Kyzirakos, K., Vassiliadis, P., Vassiliou, Y.: Hecataeus: A Framework for Representing SQL Constructs as Graphs. In: Proc. Tenth International Workshop on Exploring Modeling Methods in Systems Analysis and Design (held with CAISE) (2005) 17. Papastefanatos, G., Anagnostou, F., Vassiliadis, P., Vassiliou, Y.: Hecataeus: A What-If Analysis Tool for Database Schema Evolution. In: Proc. Twelfth European Conference on Software Maintenance and Reengineering (CSMR 2008) (2008) 18. Papastefanatos, G., Vassiliadis, P., Simitsis, A., Aggistalis, K., Pechlivani, F., Vassiliou, Y.: Language Extensions for the Automation of Database Schema Evolution. In: 10th International Conference on Enterprise Information Systems (ICEIS 2008) (2008)
174
G. Papastefanatos et al.
19. Ra, Y.G., Rundensteiner, E.A.: A transparent object-oriented schema change approach using view evolution. In: Proc. Eleventh International Conference on Data Engineering (ICDE 1995), pp. 165–172 (1995) 20. Roddick, J.F., et al.: Evolution and Change in Data Management - Issues and Directions. SIGMOD Record 29(1), 21–25 (2000) 21. Roddick, J.F.: A survey of schema versioning Issues for database systems. Information Software Technology J. 37(7) (1995) 22. Simitsis, A., Vassiliadis, P., Terrovitis, M., Skiadopoulos, S.: Graph-based modeling of ETL activities with multi-level transformations and updates. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 43–52. Springer, Heidelberg (2005) 23. Tsichritzis, D., Klug, A.C.: The ANSI/X3/SPARC DBMS Framework Report of the Study Group on Database Management Systems. Information Systems 3(3), 173–191 (1978) 24. Velegrakis, Y., Miller, R.J., Popa, L.: Preserving mapping consistency under schema changes. VLDB J. 13(3), 274–293 (2004) 25. Zicari, R.: A framework for schema update in an object-oriented database system. In: Proc. Seventh International Conference on Data Engineering (ICDE 1991), pp. 2–13 (1991)
Appendix In Table A.1, the statuses, i.e., the actions dictated at the detailed level of nodes, assigned to visited nodes by Propagate Changes Algorithm for combinations of events and types of nodes are shown, when propagate policy prevails on the visited node. For each status the new event induced by the assignment of a node with status, which is further propagated to the graph, is also shown. Table A.1. Statuses assigned to nodes when propagate policy prevails Event on the graph R/V/Q
A
Add
On node
Scope1
Status
Raised Event
None affected
N/A
N/A
N/A
R
S
Add Child
Add Attribute
V
S
Add Child
Add Attribute
Q
S
Add Child
Add Attribute
R
C
Add Child
Add Condition
V
S, P
Q
S, P
Add Child, Modify Provider Add Child, Modify Provider
Add Condition, Modify Condition Add Condition, Modify Condition
A
S
Add Child
Add Condition
C
V
S, P
Q
S, P
V
S, P
Q
S, P
GB
OB
Add Child, Modify Provider Add Child, Modify Provider Add Child, Modify Provider Add Child, Modify Provider
Add GB, Modify GB Add GB, Modify GB Add OB, Modify OB Add OB, Modify OB
Policy-Regulated Management of ETL Evolution
R
R
S
Delete Self
Delete Relation2
V
V
S
Delete Self
Delete View2
Q
Q
S
Delete Self
Delete Query2
A
Delete
C
F
GB
OB
R
C
Delete Child
None
V
C
Delete Child
None
Q
C
Delete Child
None
A
S
Delete Self
Delete Attribute
C
P
Delete Self
Delete Condition
F
P
Delete Self
Delete Function
GB
P
OB
P
R
C
Delete Self, Modify Self3 Delete Self, Modify Self3
175
Delete GB, Modify GB Delete OB, Modify OB
Delete Child
Delete Condition
Delete Child, Modify Provider Delete Child, Modify Provider
Delete Condition, Modify Condition Delete Condition, Modify Condition
V
C, P
Q
C, P
A
C
Delete Child
Delete Condition
C
S, C
Delete Self, Delete Child
Delete Condition, Modify Condition
A
C
Delete Self
Delete Attribute
C
C
Delete Self
Delete Condition
F
C
Delete Self
Delete Function
GB
C
OB
C
V
C, P
Q
C, P
GB
S
V
C, P
Q
C, P
OB
S
Delete Self, Modify Self3 Delete Self, Modify Self3 Delete Child, Modify Provider Delete Child, Modify Provider Delete Self Delete Child, Modify Provider Delete Child, Modify Provider Delete Self
Delete GB, Modify GB Delete OB, Modify GB Delete GB, Modify GB Delete GB, Modify GB Delete GB Delete OB, Modify OB Delete OB, Modify OB Delete OB
176
G. Papastefanatos et al.
R
R
S
Rename Self
Rename Relation
V
P
Rename Provider
None
Q
P
Rename Provider
None
V
S
Rename Self
Rename View
Q
P
Rename Provider
None
R
C
Rename Child
None
V
C
Rename Child
None
Q
C
Rename Child
None
A
S
Rename Self
Rename Attribute
C
P
Rename Provider
None
V
Rename
A
Modify Domain
A
F
P
Rename Provider
None
GB
P
Rename Provider
None
OB
P
Rename Provider
None
R
C
Modify Child
None
V
C
Modify Child
None
Q
C
Modify Child
None
A
S
Modify Self
Modify Attribute
C
P
Modify Provider
Modify Condition
F
P
Modify Provider
Modify Function
GB
P
Modify Provider
Modify GB
OB
P
Modify Provider
Modify OB
Policy-Regulated Management of ETL Evolution
R
C
C
Modify Child Modify Child, Modify Provider Modify Child, Modify Provider
Modify Condition Modify Condition
V
C, D
Q
C, D
A
C
Modify Child
Modify Condition
C
S, C
Modify Self, Modify Child
Modify Condition
A
C
Modify Self
Modify Attribute
C
C
Modify Self
Modify Condition
Modify Condition
F Modify
GB
C
Modify Self
Modify GB
OB
C
Modify Self
Modify OB
V
C, D
Q
C, D
V
C, D
Q
C, D
P
S
Modify Self
Modify Parameter
C
C
Modify Self
Modify Condition
GB
OB
Modify Child, Modify Provider Modify Child, Modify Provider Modify Child, Modify Provider Modify Child, Modify Provider
Modify GB Modify GB Modify OB Modify OB
P 1
Scope: S (SELF), C(CHILD), P(PROVIDER) All attributes in the schema are first deleted before Delete Relation, Delete View and Delete Query events occur. 3 The value for the status depends on whether GB / OB node have other children. If no other children exist then Delete GB/OB is assigned, Modify GB/OB otherwise. 2
177
Author Index
Aramburu, Mar´ıa Jos´e Banerjee, Sandipto 72 Berlanga, Rafael 1 Davis, Karen C.
72
Nebot, Victoria 1 Niemi, Tapio 97 Niinim¨ aki, Marko 97
1
Papastefanatos, George 147 Pedersen, Torben Bach 1 P´erez, Juan Manuel 1 Pinet, Fran¸cois 37 Schneider, Michel 37 Sellis, Timos 120 Simitsis, Alkis 120, 147 Skoutas, Dimitrios 120 Vassiliadis, Panos Vassiliou, Yannis
147 147