Model-Based Software and Data Integration: First International Workshop, MBSDI 2008, Berlin, Germany, April 1-3, 2008, Proceedings (Communications in Computer and Information Science)
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
The First International Workshop on Model-Based Software and Data Integration (MBSDI 2008), was our first event of this kind in a forthcoming series of activities at TU Berlin, where a scientific discussion and exchange forum was provided for both academic and industrial researchers. We aimed at researchers, engineers and practitioners who focus on advanced, model-based solutions in the area of software and information integration and interoperability. As with every beginning, the resonance on our calls in today’s overflooding of workshops was somewhat unpredictable, and we did not really know how many paper submissions to expect. We were nicely surprised, considering the rather short lead time to organize the meeting and the very specialized and focused topic. After the rigorous review process, where each paper received at least four reviews, we were able to accept nine regular papers and, additionally, we asked for extended abstracts from our invited speakers. The selected papers mirror the main aspect and the mission of the workshop: to promote research in the field of model-based software engineering, essentially focusing on methodologies for data and software (component) integration. Integration of data from heterogeneous distributed sources, and at the same time integration of software components and systems, in order to achieve full interoperability is one of the major challenges and research areas in the software industry today. It is also the major IT cost-driving factor. During the past few years, the relevance of “model based” approaches to these extremely time-and money-consuming integration tasks has come into the special focus of software engineering methods. OMG’s keyword model-driven architecture (MDA) has brought model-based approaches into the wide observation of the software industry and science. On the other hand, a strong community centered around the service-oriented architecture (SOA) paradigm and service science in general has covered significant grounds in defining interoperability standards in the area of communication, discovery, description and binding, as well as business process modeling and processing. The two communities have remained largely isolated, resulting in MDA concepts not being broadly applied to system integration. Our workshop addressed this issue. The selection of papers tried to introduce a strict model-based design, verification, development and evolution methodologies to system integration concepts, such as SOA. Through our three thematic paper sessions – Data Integration; Software Architectures, Services and Migration; and Model-Based and Semantic Approaches – we tried to offer a roadmap and the vision of a new methodology. We had a distinguished keynote speaker, Bran Selic from IBM Rational, whose work has extensively contributed to the very definition of model-driven development methods and tools, as well as to the definition of unified modeling
VI
Preface
language (UML). He presented a talk “Key Technical and Cultural Challenges for Model-Based Software Engineering,” in which he identified short- and longterm research problems that have to be resolved to facilitate faster adoption of model-based software engineering methods. We also had three prominent invited speakers, Miroslaw Malek (Humboldt University Berlin), who shared his views on the art of creating and integrating models, Stefan Tai (University of Karlsruhe), who proposed service science as an interdisciplinary approach for service modeling and Volker Markl (IBM Almaden Research Center), who presented the data mashup project. The workshop, in the context of a German regional initiative of collaborative development of methodologies and tools for “Model-Based Software and Data Integration,” a joint effort of software SMEs and science under the acronym BIZYCLE, was a part of the Berlin Software Integration Week 2008. Besides MBSDI 2008, it featured a one-day industrial forum, where problems were discussed and solutions proposed and demonstrated in the area of software interoperability and integration. This forum addressed several industrial sectors, such as production and logistics, health, facility management and publishing. The Berlin Software Integration Week 2008 presented a unique opportunity for knowledge and technology transfer between industrial practitioners and academic researchers. We would wholeheartedly like to thank to all the people that made MBSDI 2008 possible, first of all, our Program Committee members, for their guidance and diligent review process which enabled us to select an exciting program. We would also like to thank our industrial partners from the BIZYCLE consortium, for their support in understanding integration problems in different industrial contexts. The event would not have been possible without the support and the grant given by the Federal Ministry of Education and Research (BMBF), and its subordinated project management agency PTJ. Finally, our thanks go to the local organizers, members of the CIS group at the Technical University of Berlin (especially Mario Cartsburg and Timo Baum) and Katja Baumheier of Baumheier Eventmanagement GbR. We hope that the attendees enjoyed the final scientific program and the industry symposium, got interesting insights from the presentations, got involved in the discussions, struck up new friendships and got inspired to contribute to MBSDI 2009! April 2008
Ralf-Detlef Kutsche Nikola Milanovic
Organization
MBSDI 2008 was organized by the Berlin University of Technology, Institute for Software Engineering and Theoretical Computer Science, research group Computation and Information Structures (CIS), in cooperation with the BIZYCLE consortium industrial partners and German Federal Ministry of Education and Research. Our Program Committee was formed by 26 members, many of them from Germany, but many of them from universities and research institutions all over the world. Thanks to all of them for their engagement and their critical reviewing work.
Program Committee Co-chair Co-chair
Ralf-Detlef Kutsche (TU Berlin, Germany) Nikola Milanovic (TU Berlin, Germany)
Referees
Roberto Baldoni (University of Rome, Italy) Andreas Billig (University of Joenkoeping, Sweden) Susanne Busse (TU Berlin, Germany) Tru Hoang Cao (HCMUT, Vietnam) Fabio Casati (University of Trento, Italy) Stefan Conrad (University of Duesseldorf, Germany) Bich-Thuy T. Dong (HCMUNS, Vietnam) Michael Goedicke (University of Duisburg-Essen, Germany) Martin Grosse-Rhode (Fraunhofer ISST, Germany) Oliver Guenther (HU Berlin, Germany) Willi Hasselbring (University of Oldenburg, Germany) Maritta Heisel (University of Duisburg-Essen Germany) Arno Jacobsen (University of Toronto, Canada) Andreas Leicher (Carmeq GmbH, Germany) Michael Loewe (FHDW, Germany) Aad van Moorsel (University of Newcastle, UK) Felix Naumann (HPI, Germany) Andreas Polze (HPI, Germany) Ralf Reussner (University of Karlsruhe, Germany) Kurt Sandkuhl (University of Joenkoeping, Sweden) Alexander Smirnov (SPIIRAS, Russia) Stefan Tai (IBM Yorktown Heigths, USA) Gerrit Tamm (HU Berlin, Germany) Bernhard Thalheim (University of Kiel, Germany) Gregor Wolf (Klopotek AG, Germany) Katinka Wolter (HU Berlin, Germany) Uwe Zdun (TU Wien, Austria)
VIII
Organization
BIZYCLE “Entrepreneurial Regions” Context of MBSDI 2008 The workshop series MBSDI in its first edition in 2008 was born out of the project context of the Innovation Initiative “Entrepreneurial Regions” set up by the German Federal Ministry of Education and Research (BMBF). This initiative, particularly its part program “Innovative Regional Growth Cores,” stands for innovation-oriented regional alliances which develop the region’s identified core competencies to clusters on a high level and with strict market orientation. BMBF has systematically developed a series of such programs for the New German L¨ander since 1999. BIZYCLE, the “Evolution-Oriented Technology Platform for the Integration of Enterprise Management Software” is one of the “Innovative Regional Growth Cores,” which was started in February 2007 as a joint activity of six industrial partners, SMEs in Berlin, and CIS/ TU Berlin as the academic partner of this consortium. After the first project year, it seemed appropriate to establish a long-term academic and industrial collaboration initiative in the form of a scientific conference combined with an industrial forum in the context of our focus area: Model-Based Software and Data Integration. We created the “Berlin Software Integration Week 2008” as our cover for “Model-Based Software and Data Integration 2008” and “BIZYCLE Industrial Forum 2008.” Looking forward to establishing a long-term perspective in this challenging area of research and industrial engineering, we had – besides the regularly reviewed program of MBSDI 2008 – prominent invited international speakers from academia for this part of the software integration week, as well as engaged and active entrepreneurs from our region Berlin-Brandenburg and from outside, for the BIZYCLE Industrial Forum 2008.
Sponsoring Institutions MBSDI 2008 and the BIZYCLE Industrial Forum were partially supported under grant number 03WKBB1B by the German Federal Ministry of Education and Research (BMBF).
Table of Contents
Invited Papers The Art of Creating Models and Models Integration . . . . . . . . . . . . . . . . . Miroslaw Malek
1
Modeling Services – An Inter-disciplinary Perspective . . . . . . . . . . . . . . . . . Stefan Tai and Steffen Lamparter
8
Data Mashups for Situational Applications . . . . . . . . . . . . . . . . . . . . . . . . . . Volker Markl, Mehmet Altinel, David Simmen, and Ashutosh Singh
The Art of Creating Models and Models Integration Miroslaw Malek Humboldt-Universität zu Berlin Unter den Linden 6, 10099 Berlin
The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work. John von Neumann (1903 - 1957)
1 Introduction The art of abstracting from physical or virtual objects and behaviors for the development of system models is critical to a variety of applications ranging from business processes to computers and communication. As computer and communication systems begin to pervade all walks of lives the question of design, performance and dependability evaluation of such systems proves to be increasingly important. Today's challenge is to develop models that can, not only give a qualitative understanding of ever more complex and diverse phenomena and systems, but can also aid in a quantitative assessment of functional and non-functional properties. System modeling should help to understand the functionality and properties of the system and models are used for development and communication with other developers and users [1]. Different models present the system from different perspectives • • • •
Structural perspective describing the system organization or data structure Behavioral perspective showing the behavior of the system Hierarchical perspective depicting the system’s hierarchical organization to cope with complexity External perspective reflecting the system’s context or environment.
designed by professionals while software is written by a broad spectrum of population ranging from experts to dilettantes. This state of affairs poses a number of challenges. In this paper we give first a historical perspective how we have got to the state that we are in at present and point out major problems and challenges with the state-of-the-art models in general as well as their integration by using a specific example of a model for failure prediction.
2 Historical Perspective Modeling was around since the beginning of times. The first traceable abstractions of reality were numbers and this process dates back to the beginning of mankind [2]. Astronomy and architecture were next areas where models were used since about 4,000 BC. Mathematical models have spread over Babylon, Egypt and India over 4,000 years ago. With Thales of Miletus (circa 600 BC) the geometry became a useful tool in reflecting and analyzing reality and since then many branches of mathematics and corresponding models have flourished and proved their use. But it was not until Euler in 18th century when graph theory was formalized and gave rise to a plethora of mathematical models reflecting communication patterns and networks, which gave impetus to research in a large number of areas ranging from psychology to electrical engineering. The breakthrough in using models in computers came with Boolean algebra which resulted in modeling computer hardware by logic gates such as AND, OR and NOT. The logic gate model of hardware enjoyed an incredible popularity since then and forms till today the basis of computer design. The beauty and power of gate model is simply astounding. It abstracts complicated circuits into gates, allows to form any mathematical function which can be tested and verified for logical correctness. This is a model-driven architecture par excellence. But logical correctness is only part of the story. Such a model does not reflect time and associated with it hazards and races, temperature, quality, reliability and many other non-functional properties of integrated circuits. At the beginning of the 20th century, Andrey Markov, has developed a theory of stochastic processes that enjoys enormous popularity to date in modeling of a number of stochastic events in computer and communication systems. Next breakthrough in model of computing came with flow process charts which were originally used to model business processes (Frank Gilbreth, 1921) and then used by programmers in late fifties to reflect a logic flow of a computer program in a form of a flowchart. Flowchart depicts realization of an underlying algorithm and depending on its level, the model may just sketch the way to achieve a given goal or give a detailed implementation (execution) of a program realizing a given algorithm. It mainly focuses on basic functionality while non-functional properties are largely neglected. A refined form of a flowchart is a Nassi-Schneidermann Diagram [3], which supports Dijkstra’s structured programming concept. Since its inception in the sixties, the proposal by Carl Adam Petri for system modeling has enjoyed tremendous interest (called Petri nets theory today) and has resulted in a number of applications.
The Art of Creating Models and Models Integration
3
In the eighties the dominant development was pattern design [4] that promoted the reuse of certain solutions in software development process and with objects [5] and components [6] the software reuse became a reality. Also, representation of finite state automata, called statecharts was proposed by David Harel [7]. With further development of computer languages, this formed the basis for textual models of algorithms leading to the Unified Modeling Language (UML) in the 1990’s which allowed to semiformally synthetize and encode models. It also gave rise to, so called Model Driven Architecture (MDA), which promotes models that can be automated in part (unlike CASE - Computer-Aided Software Engineering where full automation is the goal) to accelerate and improve software development process. The MDA strives for separation of concerns and distinguishes four levels of models: • • • •
Computation Independent Model (CIM) – description and specification Platform Independent Model (PIM) - a model of the business process or service Platform Specific Model (PSM) - platform-dependent model of architecture or service Code Model, Platform Implementation
This is already a remarkable progress but, unfortunately, due to complexity of systems we would like to model, a comprehensive description of most systems is not feasible but frequently also not necessary as in the MDA we want to focus on goals and properties that we, the developers/users, are interested in and not just the entire system. We have to be realistic and recognize that reflecting the reality fully may not be feasible. When Niels Bohr was searching for a complete description of the world (nature) around us believing that everything can be deduced by logic, after many years of research he concluded that there is mutually exclusive but at the same time mutually complementary world of illogic. He called the phenomenon complementarity and he even designed a coat-of-arms with inscription contraria sunt complementa and the yin-yang symbol reflecting his principle of complementarity on which the fundamental laws of physics are based.
3 Problems and Challenges An important part of work of a software/hardware engineer is ability to translate a real world phenomenon into his or her own language. This art, as it is evolving into a science, results in a success, partial success or a failure of the undertaking at hand. In our activities, we should always remember that a model is usually a simplified version of reality so stating that “the system works” should only apply to a model as reality may turn out to be different. Good models reflect reality very closely, the bad ones behave differently or even opposite to the goals of a real system. It is highly desirable to be able to measure the distance between a model and a reality in order to evaluate its goodness. There are a couple of problems with this task: first, we can do it only with respect to a limited number of variables (limited by the model) and second, we have to have ability to measure the reality (a real system) without influencing its behavior. Ability to determine distance would help us to assess model’s relevance in achieving the desirable goal. Models can vary in
4
M. Malek
their level of formalism, complexity (level of detail) and quality in meeting the intended goals. These characteristics largely depend on the basic function of the model and the complexity of modeling goal. The blessing and the curse with artifacts such as software is that so called “reality” does not even exist during the system development. Once the software is developed then we can run it on a particular platform and measure. The fundamental problems with modeling of computer systems are: 1.
2.
3.
4. 5.
6. 7.
Creation of artifacts, systems that do not exist in nature, so it is difficult to assess quantitatively their quality even when they are created as there is no reference point. The only way out are comparative analysis (e.g., who has developed bigger or faster engine, algorithm, etc.). Unconstrained design space and unconstrained objectives (software engineers can promise to do the impossible, e.g., “exceed the speed of light,” hardware engineers typically cannot, they are constrained by physics). Complexity of systems and their behavior is frequently prohibitive and despite methods like abstraction (top-down design, hierarchical design), partitioning and sequencing poses an ever-growing challenge. Demand for an ever-growing number of features (e.g., scalability, adaptivity or a real-time behavior). Conflicting requirements (e.g., a system should be fault tolerant and secure could be interpreted as a file replication and distribution for fault tolerance and keeping a single copy at one location for security – an apparent contradiction). Dynamicity of systems caused by varying configurations, patches updates, upgrades/downgrades requires highly flexible and dynamic models. Composability and integration (e.g., ability to combine various service models into a business process; models integration of structure, behavior and both functional and non-functional properties; integration of software, hardware, interoperability/infrastructure and personal with respect to a given property).
Our knowledge of reality is structured by our model and the way we have abstracted the reality. In case of non-existing objects we have to assume what reality should look like and concentrate on the goals and scope we want to achieve with particular model while escaping the question of realizability. This can be relatively easily accomplished at the CIM level. It has to become increasingly concrete and realistic once we approach the implementation phase and this process of refinement can be painstakingly difficult. Models may have different focus such as explaining phenomena (frequently used in physics), knowledge transfer, prediction (reliability or failure prediction), decision making and, finally, in specification, design and implementation process.
4 Example – Models for Failure Prediction Non-functional properties such as availability or security play ever increasing role in system development in addition to functionality of a system or service. The purpose of this example is to show how system properties can be modeled and pose a challenge of incorporating such model in service-oriented architecture (SOA) framework. We have developed best practice guide backed by methodology and models [8], [9], [10] for availability enhancement using failure prediction and recovery methods.
The Art of Creating Models and Models Integration
5
This best practice guide [8] is based on the experience we have gained when investigating these topics: a.
b.
c.
complexity reduction, showing that selecting the most predictive subset of variables contributes more to model quality than selecting a particular linear or nonlinear modeling technique information gain of using numerical vs. categorical data: finding that including log file data into the modeling process may have negative impact on model quality due to increased processing requirements, data-based empirical modeling of complex software systems, cross benchmarking of linear and nonlinear modeling techniques, finding nonlinear approaches seems to be consistently superior than linear approaches, however, not always significantly.
A typical way to analyze the impact of faults and the fault tolerance of a system is to develop fault models and failure modes and then to evaluate them. In order to model and estimate the dependability of SOA it is important to know which faults and errors the system has to tolerate to assure correct operation [11]. Then, we develop methods which can detect these faults or system’s “misbehavior” and then are able to predict failures. In combination with recovery schemes system’s dependability can be enhanced.. A number of modeling techniques have been applied to failure prediction in software systems: probability models, linear and nonlinear statistical models, expert system-based models and Hidden Markov Models. f) closing the control loop
a) System Observation
b) Variable Selection / Complexity Reduction
c) Model Estimation
d) Model Application
e) Reaction / Actuation
sensitivity analysis
offline system adaptation
forec asting
online reac tion sc hemes
ARMA / AR forw ard selec tion time series (numeric al)
system experts support vec tor mac hines (SVM) ...
Fig. 1. Building blocks for modeling and forecasting performance variables as well as critical events in complex software systems either during runtime or during off-line testing. System observations (a) include numerical time series data and/or categorical log files. The variable selection process (b) is frequently handled implicitly by system expert's ad-hoc theories or gut feeling, rigorous procedures are applied infrequently. In recent studies attention has been focused on the model estimation process (c). Univariate and multivariate linear regression techniques have been at the center of attention. Some nonlinear regression techniques such as universal basis functions or support vector machines have been applied as well. While forecasting has received a substantial amount of attention, sensitivity analysis (d) of system models has been largely marginalized. Closing the control loop (f) is still in its infancy. Choosing the right reaction scheme (e) as a function of quality of service and cost is nontrivial [8].
6
M. Malek
Reliability block diagrams, fault trees, Hidden Markov/Markov/semi-Markov chains, stochastic Petri nets and their combinations have been used for reliability and availability modeling. Such probability models can sometimes be solved in closed-form but will commonly be solved numerically and sometimes by discrete-event simulation. The key difficulty with such models is the parameterization and validation. For online failure prediction we have developed two models: 1) Universal Basis Function (UBF) based on function approximation using selected variables such as kernel memory fillup or the number of semaphores per second as fault symptoms [9], and 2) Hidden Semi-Markov Model (HSMM) in which error logs in space and time domain are analyzed using pattern recognition methods [10]. The challenge is how to integrate such non-functional properties models into the design and development process. Should we pose a question at every level: what can go wrong and how such a problem can be avoided? This type of process requires deep understanding of the task at hand and that is why it can be only partially automated. The challenge of building, for example, secure systems can be even more demanding as there is only a binary answer to the question of security. The security community at present is left with mainly qualitative assessment of security by analyzing specific threats and evaluating whether a given system is protected from them or not.
5 Conclusions A number of challenges have been outlined in this paper but some of them are more pressing than others. The problems of composability/integration, while preserving certain properties, requires the utmost attention and so it is with the question of taming complexity. The issues of handling of the two biggest tyrants on earth: the chance and the time1 continue to attract and fascinate researchers and engineers but the difficulty of integrating them with other models remains. The intricacy of chance requires ability to cope with unpredictability (faults and failures) and the main problem with time is that time cannot be stopped (or even slowed down) posing another eternal challenge. Finally, the art of creating models and integrating them will continue to evolve into a science.
References 1. Sommerville, I.: Software Engineering, 8th edn. Pearson Education, London (2006) 2. Schichl, H.: Models and the History of Modeling. In: Kallrath, J. (ed.) Modeling Languages in Mathematical Optimization, Kluwer, Boston (2006) 3. Nassi, I., Shneiderman, B.: Flowchart Techniques for Structured Programming. In: SIGPLAN Notices, August 8, 1973 (1973) 4. Christopher, A., Ishikawa, S., Silverstein, M., Jacobson, M., Fiksdahl-King, I., Angel, S.: A Pattern Language: Towns, Buildings, Construction. Oxford University Press, New York (1977) 5. Cox, B.J., Novobilski, A.J.: Object-Oriented Programming: An Evolutionary Approach, 2nd edn. Addison-Wesley, Reading (1991) 1
Two biggest tyrants on Earth are: the chance and the time (Die zwei größten Tyrannen der Erde: der Zufall und die Zeit) Johann Gottfried von Herder (1744-1803)
The Art of Creating Models and Models Integration
7
6. Szyperski, C.: Component Software: Beyond Object-Oriented Programming, 2nd edn. Addison-Wesley Professional, Boston (2002) 7. Harel, D.: Statecharts: A Visual Formalism for Complex Systems. Science of Computer Programming, vol. 8, North Holland, Amsterdam (1987) 8. Hoffmann, G.A., Trivedi, K.S., Malek, M.: A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability 56(4) (2007) 9. Hoffmann, G.A., Malek, M.: Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach. In: IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, United Kingdom (2006) 10. Salfner, F., Malek, M.: Using Hidden Semi-Markov Models for Effective Online Failure Pre-diction. In: IEEE Proceedings of the 26th Symposium on Reliable Distributed Systems (SRDS 2007), Beijing, China (2007) 11. Brüning, S., Weißleder, S., Malek, M.: A Fault Taxonomy for Service-Oriented Architecture. In: Proceedings of High Assurance Systems Engineering Symposium, Dallas, Texas (2007)
Modeling Services – An Inter-disciplinary Perspective Stefan Tai and Steffen Lamparter Karlsruhe Institute of Technology (KIT), Universität Karlsruhe (TH) Karlsruhe Service Research Institute 76128 Karlsruhe, Germany [email protected], [email protected] www.ksri.uni-karlsruhe.de
1 Introduction Service engineering is receiving increasing attention in both the service economics and service computing communities. This trend is due to two observations: 1. From an economics viewpoint, services today are contributing the majority of jobs, GDP, and productivity growth in Europe and in other countries worldwide. This includes all activities by service sector firms, services associated with physical goods production, as well as services of the public sector. 2. From an ICT viewpoint, the evolution of the Internet enables the provision of software-as-a-service on the Web, and is thus changing the way distributed computing systems are being architected. Software systems are designed as service-oriented computing architectures consisting of loosely-coupled software components and data resources that are accessible using standard Web technology. The notion of “service” used in both communities is different; however, they are not independent but have a strong impact on each other. From an economics viewpoint services are increasingly ICT-enabled. Therefore, new ways of business process management, organization and value co-creation emerge for both the service provider and the service consumer. From an ICT viewpoint the engineering and use of computing services requires careful consideration of the business context, including business requirements and opportunities, business transformation, and social, organizational and regulatory policies. In this extended abstract, we explore the question of modeling services – business services and computing services. We argue for an inter-disciplinary approach to modeling and engineering services, and discuss major challenges for model-driven service engineering.
Modeling Services – An Inter-disciplinary Perspective
9
The Service Lifecycle comprises the phases of service inception and strategy, service design, service realization, service deployment, service operation and use, and service evaluation and continuous improvement. Service Engineering describes the activities in support of the entire lifecycle of a service, with the objective to establish, sustain, and grow the service in a market (from the provider’s viewpoint), and to effectively use the service (from a consumer’s viewpoint). Service Modeling describes the engineering activities in all phases of the services lifecycle to create abstractions to reason about services.
3 Service Modeling Using the above definitions, three complementary views on service modeling can be distinguished: 1. Modeling software as (Web) services 2. Modeling (business) services as software (and thus, as Web services) 3. Modeling (business) services that use software Model-driven software engineering, and in particular the OMG’s Model-driven Architecture (MDA) has focused on modeling software and – to some limited extent – modeling software as Web services (1). In this context, MDA suggests that a software component (at first, designed using a platform-independent model, PIM) can become a Web service by making its interfaces publicly available via Web interface language and protocol standards (using a platform-specific model, PSM). This simplistic approach applies primarily to software design in forward engineering of services from a provider’s viewpoint. There are many common and important scenarios, however, which are far more complex. Consider, for example, the case of modeling a software application that aims to dynamically select (at runtime) a service from a set of available services. The selection may need to incorporate functional and non-functional criteria, some of which are only available at runtime. The service may further be provided by a services marketplace which acts as an intermediary, and contracts must be established between the consumer and the marketplace prior to using the service. To conceptualize the problem and to design a software solution, platform-independent and platform-specific information, as well as operational runtime information, must be considered. A simplistic PIM-to-PSM transformational approach is not appropriate and sufficient. Model-driven software engineering also stops short for modeling business services-assoftware (2) and for modeling business services that use software (3). MDA suggest business process modeling with a subsequent PIM and PSM software design, using model transformation and code generation. However, we question the transformational aspect and argue again for an inter-disciplinary approach, where a set of appropriate business and ICT models are used in parallel. Model transformation and code generation tend to introduce artificial orderings and dependencies, and the code generated is often of rather poor quality and thus applicable to short-lived applications at best.
10
S. Tai and S. Lamparter
4 Example: Cloud Computing Services We illustrate our discussion of service modeling for the case of cloud computing services. In support of the trend stated in the beginning of this paper, we can see a change in the middleware (software and data integration) market towards services. The traditional, heavy-weight middleware stack is being replaced by a more lightweight stack as middleware functionality is moved into “the cloud” – a network of remote servers. Several Web-based middleware services are emerging; examples are message queuing services, data storage and backup services, and the (scalable) provision of entire computing resources and infrastructure as services, such as Amazon’s Elastic Compute Cloud (EC2). Common to all these middleware services is that they are business services and Web services compliant to our definition. Cloud computing services like the EC2 allow the service provider to better utilize available compute resources by means of virtualization and by selling partial infrastructure use as a service to multiple consumers. From the viewpoint of the consumers, and small and medium-sized companies in particular, (scalable) compute infrastructure and data centers are now accessible without the need to purchase and maintain them. Programming specifics, such as the protocols required to reserve compute capacity and the formats required for data exchange are based on standard Web services programming models, but must be carefully considered to reason about applicability and profitability of the cloud compute service. Modeling cloud computing services from a provider or a consumer viewpoint must address all relevant challenges. Technical provider challenges include the need for a sophisticated resource and network management for service provisioning, and to ensure availability, reliability, and security. Economic challenges range from competitive operation and consumer pricing models to business insight generation based on monitoring and interpreting service usage patterns. Additional (often key) problems lie in understanding and solving physical constraints, such as server storage space and electricity needs. For the consumer, a major challenge lies in understanding the business implications of outsourcing middleware and the business transformation needed. Organizational and possibly governmental and other regulatory policies must be considered. Further, the application programming model for using middleware services in the cloud is different than for using a local middleware, and varies depending on the type of middleware functionality that is provided as a service. Model-driven service engineering for cloud computing consequently requires appropriate business and ICT models in support of the above challenges. Technical and economic questions go hand-in-hand; economic models stemming from market theory, for example, and software models are insufficient in isolation, but must be combined. The state-of-the-art in model-driven software engineering focuses on software design, but not on business service design – what we need are methods and tools for model-driven service engineering.
5 Summary and Outlook Emerging services such as cloud computing services introduce new and complex economic and technical challenges. These are fundamentally changing the way that businesses can operate and the way distributed computing systems are designed. Service
Modeling Services – An Inter-disciplinary Perspective
11
engineering requires modeling software-as-services as well as modeling services-assoftware. Traditional methods of model-driven software engineering have limitations in this context, and new methods and tools in support of the entire service lifecycle are needed. In addition, the Internet has evolved into a people collaboration platform with continuous on-line access, and is changing the way people interact with each other. Wikis and digital communities are examples of Web 2.0 technologies that have a significant impact on how we exchange and work with all kinds of data – and services. We believe that these developments will grow further and into the services market. Service evaluations (ratings, recommendation), and community-driven support services for business and Web services are likely to emerge and contribute to the value of a service. Without careful consideration of all relevant economic, technical, and social collaboration challenges, services such as cloud computing services and their adoption in markets cannot succeed. Modeling such services thus places new requirements on the engineering methodology and requires focusing on the challenges that really matter.
Data Mashups for Situational Applications Volker Markl, Mehmet Altinel, David Simmen, and Ashutosh Singh IBM Almaden Research Center, San Jose, CA, USA {marklv,maltinel,simmen}@us.ibm.com, [email protected]
Abstract. Situational applications require business users to create combine, and catalog data feeds and other enterprise data sources. Damia is a lightweight enterprise data integration engine inspired by the Web 2.0 mashup phenomenon. It consists of (1) a browser-based user-interface that allows for the specification of data mashups as data flow graphs using a set of Damia operators specified by programming-by-example principles, (2) a server with an execution engine, as well as (3) APIs for searching, debugging, executing and managing mashups. Damia provides a base data model and primitive operators based on the XQuery Infoset. A feed abstraction built on that model enables combining, filtering and transforming data feeds. This paper presents an overview of the Damia system as well as a research vision for data-intensive situational applications. A first version of Damia realizing some of the concepts described in this paper is available as a webserivce [17] and for download as part of IBM’s Mashup Starter Kit [18].
Damia goes beyond these efforts in several ways: 1. Damia has a principled data model of tuples of sequences of XML, which is more general than for instance Yahoo Pipes’ feed model. 2. Damia’s focus on enterprise data allows for ingestion of a larger set of data sources such as Notes, Excel, XML, as well as data from emerging information marketplaces like StrikeIron [10]. 3. Damia’s data model allows for generic joins of web data sources. 4. Damia’s entity model allows for dynamic entity resolution, thus enabling semantic joins between feeds with different representations of the same entity.
2 Damia Architecture This section gives an overview of the Damia web application and execution engine.
Fig. 1. Damia Architecture
The Damia platform is a web application that allows for searching, debugging, compiling, executing and managing data mashups. 2.1 The Damia Web Application Situational applications are typically created by departmental users with little programming knowledge; consequently visualization of data mashup operations is critical for any mashup platform. Considering this key aspect, we developed a browser-based user interface that allows Damia users to perform major operations easily and intuitively. The GUI provides facilities for composing new data mashups, searching data sources or existing mashups, and managing stored mashups. The mashup composition follows programming-by-example model which makes the development process more natural and less error prone. Figure 2 shows a snapshot of the Damia Mashup Editor which was implemented with the Dojo toolkit [4]. It communicates with the server through a set of REST API interfaces, as illustrated in Figure 1 The GUI allows users to drag and drop boxes, representing Damia operators, onto a palette and to connect them with edges representing
14
V. Markl et al.
Fig. 2. Damia Mashup Editor
the flow of data between those operators. Users can use a "preview" feature to see the result of the dataflow at any point in the process. The Damia GUI interacts with a metadata repository in order to suggest options for creating "touch points" for joins, grouping, and other operations that require that data be first put in a standard representation. Once the data flow is completed, the result is serialized as an XML document and delivered to the server for further processing. 2.2 The Damia Execution Engine Figure 3 provides a high-level view of Damia execution engine. The Damia engine contains four layers: (1) At the core, Damia relies on the PHP runtime environment [8] as its runtime. PHP was an attractive choice for our runtime engine as it has abundant features for accessing web data and for processing XML data. (2) Primitive Damia operators1 are Higher Value Functionalitie
Primitive Operators
Dynamic entity resolution Lineage tracking
Transform-Feed Merge-Feed
construct Augment-Feed Sort
Security
Construct Import
Scalability
DOM
PHP PHP
Fuse Curl
Agg
Iterate Extract Search Filter-Feed
Core runtime support
Optimized access
Group By Group-Feed
Publish-Feed
Streaming
Uncertainty
Feed Operators
Fig. 3. Damia Features
1
Due to space limitations we show a partial operator list including only the most commonly used ones.
Data Mashups for Situational Applications
15
implemented as PHP classes. They can be wired to each other to function as pipelined (pull-based) or streaming (push-based) fashion. The primitive operators are designed with the goal to enable advanced user with the almost full power of XQuery in a data flow fashion. (3) Feed Operators provide a feed-friendly abstraction on top of the primitive operators. (4) Higher-value features are implemented as extensions modules within the engine. 2.3 Damia Operators and Data Model The Damia operators produce and consume tuples of sequences. A sequence is an ordered set of items wherein an item can be an atomic value or an XML node. The Damia data model provides a closed data model which allows new data mashups to be composed of existing ones. In general, there are three classes of Damia operators: ingestion, augmentation, and publication. Ingestion wrappers are sources in the Damia data flow graph, which translate data sources into XML documents or feeds to be further consumed by Damia Augmentation Operators. Standard ingestion wrappers supported by Damia exist for REST and SOAP Web Services, Excel Spreadsheets, Lotus Notes databases, as well as screen scraping for HTML pages. The overall system of ingestion operators is extensible, as any data source can be ingested into Damia by providing simply a SOAP or REST wrapper. In the terms of the Damia data model, an ingestion operator takes a data object and translates it into a tuple of sequences of XML data. Augmentation operators are internal nodes of the data flow graph. Damia is extensible in that user defined operators can be written in either PHP, called as webservices, or written using Damia’s primitives, and can be plugged into the engine. Publication operators are the sinks of the data flow graph. They transform the result of the mashup into common output formats like JSON, HTML, and various XML formats (e.g. Atom, RSS). Damia publication operators can be extended to produce other formats. 2.4 Damia Metadata Services The metadata services of Damia handle the storage and retrieval of data feeds created by the Damia community. In addition to publishing data feeds created via Damia data flows, a user can publish resources like Excel Spreadsheets or XML documents to the Damia system, to make them consumable by mashups. It is possible to share published resources with others or to keep them private. The Damia mashup repository stores both the graphical mashup specification serialized as XML as well as the PHP code that the mashup specification was compiled into. Users execute stored mashups by calling a REST URL provided by the Damia system.
3 Important Features of a Mashup System This section summarizes advanced features offered of an enterprise data mashup system and gives insights on how the Damia research team at IBM plans to address them.
16
V. Markl et al.
Ingestion of "Enterprise Data": A common characteristic of mashup applications available on the Web today is that they consume URL addressable resources. However, this doesn’t apply for enterprise data sources as they are exposed in many different forms such as office documents, email, and pure databases. Therefore, the Damia system must provide tools and mechanisms to tap into non-URL addressable sources, and make them available in the server. We developed a set of predefined wrappers for this purpose. Furthermore, we consider utilizing existing page-scraping tools (e.g. Kapow [5] or Lixto [6]) to turn web pages (inside and outside of enterprise domains) into ready-to-use data sources in Damia server. Another unique characteristic of creating en)terprise mashups is that a multitude of authentication models are in use in enterprise systems. Although we utilize an LDAPbased common sign-on system in the current prototype, this is actually an open research issue, and we are investigating new techniques to embrace this heterogeneity. Dynamic Entity Resolution: By definition, the quality of mashup applications is directly related to how they bring together data from multiple sources. Therefore, the level of sophistication in matching different entities is a key distinguishing feature for any data mashup platform. This standardization problem is usually addressed in two dimensions: (1) identification of semantically known entities, and (2) resolving differences in representations. We observe that more and more standardization services are becoming available on the Internet for most commonly used data types2. A similar trend is also visible within enterprise systems with widespread adaptation of master data management products. Therefore, we designed dynamic entity resolution module to exploit growing number of standardization services with little effort. During the mashup design, we show applicable standardization services to users and expect them to choose the right ones for their mashup3. At runtime, the system can find out possible touch points between entities and can generate the right set of transformation functions to perform the join. For the cases where the Damia server cannot detect the entities, there are facilities for users to introduce the entities into the system and select/register suitable standardization services for entity matching. We are investigating how to exploit existing tools and platforms (e.g. ClearForest [2], UIMA [11]) for detecting known entities in input sources. Once the entity matching is finished, these steps are remembered to help future users when they try to use same sources in their mashups. Such a “folksonomy-based” approach is aimed at taking the advantage of collective power of Damia community to deliver a promising framework for large-scale data integration. Streaming: RSS and Atom feeds are a norm in the Internet. A huge amount of information today is available through feed interfaces. Hence, we designed the Damia system to be able to understand and consume feeds effortlessly. Feeds are inherently “push-based”, i.e. their content is dynamically updated at the source without any notification. To be able to cope with this streaming aspect, the Damia system includes mechanisms to detect and process the changes, and to deliver notifications to its applications. In enterprise domains, there are many useful and critical mashup scenarios 2
Geocoding service for address types is the prime example. Many Internet companies including Yahoo and Google provide this service for free. 3 Programming-by-example model enables us to perform this task.
Data Mashups for Situational Applications
17
where this feature is extremely valuable, particularly for reporting and dashboarding applications. Search: Mashup applications are typically created in large numbers since they are only useful for a specific situation concerning a small number of users. Hence, a common problem is how to find a right mashup application or a data source when it is needed. An effective search mechanism for mashups requires understanding not only the properties of input sources but also how they are processed in data flow. We performed an initial investigation on this problem, and developed a preliminary search mechanism. We believe this is an important research problem, and we are working on enhancing the current solution. Lineage: Mashups on the Web usually do not provide any metrics on data quality as this issue is not considered central. However, this is not always true for enterprise mashups applications, as sometimes important business decisions may be made based on their outcome. In such cases, when an enterprise data mashup is composed with other mashups, it becomes very critical to be able to provide lineage information for its users to asses its data quality. Again, this is an active research area, and we only implemented basic mechanisms to address this issue. Uncertainty: It is not always the case that all the operations return exact results in mashups. There may be situations where uncertain results are inevitable: (1) Data sources may return probabilistic (or ranked) data. Typical example is when a search engine result is fed into the mashup. (2) Entity matching may not yield exact results in some cases. Hence, the result of joins may become probabilistic. In such cases, when an uncertainty is introduced in the system, it has to be understood and modeled in the data flow. Like the lineage problem, this aspect is mostly ignored in Web mashups, but it may be crucial in some class of enterprise mashups. We implemented a set of operators (join, sort, aggregate, etc.) which can deal with probabilistic results in the data flow. We are actively exploring new probabilistic models, improving the existing operators and adding new ones.
4 Conclusions Proliferation Web 2.0 technologies and ever increasing number of available Web data services gave rise to phenomenal growth of mashup applications on the Internet. We anticipate that enterprise systems will greatly benefit from this new trend by means of enterprise-oriented mashup development platforms. We are laying the foundations of such a platform in the Damia project [17, 18]. In this paper, we presented our initial Damia prototype, which includes a mashuporiented data flow engine enriched with novel, enterprise-specific advanced functionalities. Most notable features include a folksonomy-based standardization service, a intuitive GUI front-end, a hosted, scalable mashup server architecture, unique ingestion facilities and utilities for mashup search and management. Going forward, we aim to use this prototype as a playground to explore many interesting research directions outlined in this paper.
18
V. Markl et al.
Acknowledgements We thank the CTO office of IBM Information Management as well as the Mashup Hub and Damia teams from IBM Research, IBM Emerging Technologies, and IBM Software Group for their help on making our vision on Damia happen.
References [1] Jhingran, A.: Enterprise Information Mashups: Integrating Information, Simply. In: VLDB 2006, pp. 3–4 (2006) [2] Clearforest Inc., http://www.clearforest.com/ [3] DB2 pureXML™ technology, http://www-306.ibm.com/software/data/db2/xml/ [4] Dojo, the Javascript toolkit, http://dojotoolkit.org/ [5] Kapow Technologies, http://www.kapowtech.com/ [6] Lixto Software Gmbh, http://www.lixto.com/ [7] PEAR - PHP Extension and Application Repository, http://pear.php.net/ [8] PHP: Hypertext Preprocessor, http://www.php.net/ [9] Programmable Web, http://www.programmableweb.com/ [10] Strikeiron Inc., http://www.strikeiron.com/ [11] Unstructured Information Management Architecture (UIMA), IBM Research, http://www.research.ibm.com/UIMA/ [12] Yahoo Pipes, http://pipes.yahoo.com/pipes/ [13] Ennals, R., Garofalakis, M.N.: MashMaker: mashups for the masses. In: SIGMOD Conference 2007, pp. 1116–1118 (2007) [14] Tatemura, J., Sawires, A., Po, O., Chen, S., Candan, K.S., Agrawal, D., Goveas, M.: Mashup Feeds: continuous queries over web services. In: SIGMOD Conference 2007, pp. 1128–1130 (2007) [15] Maximilien, M.: Web Services on Rails: Using Ruby and Rails for Web Services Development and Mashups. IEEE SCC (2006) [16] Wong, J., Hong, J.I.: Making mashups with marmite: towards end-user programming for the web. In: CHI 2007, pp. 1435–1444 (2007) [17] IBM Damia Service on IBM Alphaservices, http://services.alphaworks.ibm.com/damia [18] IBM Mashup Starter Kit on IBM Alphaworks, http://www.alphaworks.ibm.com/tech/ibmmsk
Combining Effectiveness and Efficiency for Schema Matching Evaluation Alsayed Algergawy, Eike Schallehn, and Gunter Saake Department of Computer Science, Otto-von-Guericke University, 39106 Magdeburg, Germany {alshahat,eike,saake}@iti.cs.uni-mgdeburg.de
Abstract. Schema matching plays a central role in many applications that require interoperability among heterogeneous data sources. A good evaluation for different capabilities of schema matching systems has become vital as the complexity of such systems arises. The capabilities of matching systems incorporate different (possibly conflicting) aspects among them match quality and match efficiency. The analysis of efficiency of a schema matching system, if it is done, tends to be done in a way separate from the analysis of effectiveness. In this paper, we present the trade-off between schema matching effectiveness and efficiency as a multi-objective optimization problem. This representation enables us to obtain a combined measure as a compromise between them. We combine both performance aspects in a weighted-average function to determine the cost-effectiveness of a schema matching system. We apply our proposed approach to evaluate two currently existing mainstream schema matching systems namely COMA++ and BTreeMatch. Experimental results showed that, by carefully utilizing both small-scale and large-scale schemas, it is necessary to take the response time of the matching process into account especially in large-scale schemas. Keywords: Schema matching, Schema matching performance, Effectiveness, Efficiency, Cost-effectiveness.
1
Introduction
Schema matching is the task of identifying semantic correspondences between the elements of two or more schemas and plays a central role in many data application scenarios [6,13,10]: in data integration, to identify and characterize interschema relationships between multiple (heterogeneous) schemas; in E-business, to help exchange messages between different XML formats; in semantic query processing, to map user-specified concepts in the query to schema elements; in the semantic Web, to establish semantic correspondences between concepts of different websites ontologies; and in data migration, to migrate legacy data from multiple sources into a new one [7]. To identify a solution for a particular match problem, it is important to understand which of the proposed techniques performs best. The performance R.-D. Kutsche and N. Milanovic (Eds.): MBSDI 2008, CCIS 8, pp. 19–30, 2008. Springer-Verlag Berlin Heidelberg 2008
20
A. Algergawy, E. Schallehn, and G. Saake
of a schema matching system comprises mainly two equally important factors; namely; effectiveness and efficiency [15,8]. The effectiveness is concerned with the accuracy and the correctness of the match result while the efficiency is concerned with resources consumption (time, memory,...) by the match system. Most of existing system evaluations focus on analyzing the effectiveness of a schema matching system [2,16]. The analysis of the efficiency of a schema matching system, if it is done, tends to be done in a way separate from the analysis of effectiveness [8]. In other words, effectiveness and efficiency are usually considered two very different dimensions, and the trade-off between these two dimensions has not been investigated in the schema matching community. Many real-world problems such as the schema matching problem, involve multiple measures of performance, which should be optimized simultaneously. Optimal performance according to one objective, if such an optimum exists, often implies unacceptably low performance in one or more of the other objective dimensions, creating the need for a compromise to be reached. In the schema matching problem, the performance of a matching system involves multiple aspects among them effectiveness and efficiency. Optimizing one aspect for example effectiveness will affect the other aspects such as efficiency. Hence, we need a compromise between them, and we could consider the trade-off between effectiveness and efficiency matching result as a multi-objective problem. In practice, multi-objective problems have to be re-formulated as a single objective problem. To this end, in this paper, we propose a method for computing the costeffectiveness of a schema matching system. Such a method is intended to be used in a combined evaluation of schema matching systems. This evaluation concentrates on the cost-effectiveness of schema matching approaches, i.e. the trade-off between effectiveness and efficiency. The motivation behind this is that suppose we want to compare two schema matching systems to solve a specific matching problem. If we have a schema matching problem P, and we have two matching systems A and B. The system A is more effective than system B, while the system B is more efficient than system A. The arising question here is which system will be used to solve the given problem? So far most existing matching systems [3,10,12] evaluate their performance according only to effectiveness issues, hence they all choose the system A (more effective). This paper introduces a combined approach to evaluate cost-effectiveness of matching systems based on the multi-objective optimization problem (MOOP). We apply the proposed approach to evaluate and compare two well-known systems, namely COMA++ [3] and BTreeMatch [9]. The rest of this paper is organized as follows: the next section introduces an overview of schema matching. In the following section, we present schema matching performance focusing on effectiveness and efficiency evaluations. Section 4 presents a combined measure for schema matching performance, costeffectiveness measure. In section 5, experiments and results are discussed. Section 6 gives concluding remarks and our proposed future work.
Combining Effectiveness and Efficiency for Schema Matching Evaluation
2
21
Schema Matching: An Overview
In this section, we present the main definitions used in this paper. Definition 1. (Schema) A schema is a description of the structure and the content of a model and consists of a set of related elements such as tables, columns, classes, or XML elements and attributes. By schema structure and schema content, we mean its schema-based properties and its instance-based properties, respectively. Definition 2. (Match) Match is a function that takes two or more schemas as input and produces a mapping as output Definition 3. (Mapping) A mapping is a set of mapping elements specifying the correspondence of schemas’ elements together. Each mapping element is 5-tuple < ID, S.si , T.tj , R, semrel > where: ID is an identifier for the mapping element, si is an element of the first schema, tj is an element of the second one, and R indicates the similarity value between 0 and 1. The value of 0 means strong dissimilarity while the value of 1 means strong similarity. semrel is the semantic relationship between two (or more) elements such as equivalence, synonyms, etc. To identify the correspondences among schemas’ elements, various matching algorithms have been proposed and numerous schema matching systems have been developed. The current matching algorithms can be classified by either the information they exploit or their methodologies. According to the information they exploit the matchers can be: [13] individual matchers which exploits only one type of element properties in a single algorithm, or combining matchers that can be one of two types: hybrid matchers (integrate multiple criteria [10]) and composite matchers (combine results of independently executed matchers [5,3]). The exploited information by a matcher is called element properties. These properties can be classified as atomic or structural properties; schema-based or instancebased properties; and auxiliary properties. According to the methodologies of the matching algorithm, they can be classified as either rule-based or learner-based [6,15]. Table 1 summarizes the advantages and disadvantages of both systems. Table 1. Comparison of Rule-base and learner-based Systems Criteria Rule-based Learner-based exploited information schema-based instance-based nature of schema elements more static more dynamic schema size small size large size training phase not needed needed response time less time more time
22
A. Algergawy, E. Schallehn, and G. Saake
3
Schema Matching Performance
In order to motivate the importance of trading-off performance aspects during schema matching evaluation, we present the schema matching problem as follows: consider two schemas S and T having n and m elements respectively, we could distinguish between two types of matching, simple and complex – Simple matching: for each element s of S, find the most semantically similar element t of T. This problem is referred to as one-to-one matching. – Complex matching: for each element s (or a set of elements) of S, find the most semantically similar set of elements t1 , t2 , ..., tk of T. The solution to the above problems is not unique, and this is due to inherent difficulties in schema matching. In [17], they introduce the concepts of matching state and matching space. A matching state represents a possible matching result with complexity of n m and the matching space is the all possible matching states with order of 2n×m . For example, consider n = 5 and m = 4, then to identify a suitable match result for a given schema matching problem, a schema matching system should search in a matching space of 25×4 = 1048576 (very large matching space although very small schemas’ sizes). Therefore, the schema matching problem is an optimization problem which searches for the best solution (matching state) among vast available solutions (matching space). However, the problem becomes not only identifying the best solution but also obtaining this solution in a reasonable time. Unfortunately, most previous evaluations ignore the time response of the match system and often assume that matching is an off-line process. Hence, in order to cope with large scale schemas, we should take the processing time into account. In the following subsection, we consider both performance aspects; effectiveness and efficiency. To obtain a better overview about the current state of the art in evaluating schema matching approaches, good reviews could be found in [2,16,8]. 3.1
Performance Evaluation
To unify the performance evaluation of schema matching systems, the following criteria should be taken into account – Input : what kind of input information has been used (schema-based, instancebased, auxiliary information)? – Output : what information has been included in the mappings? i.e. the type of output, output formats, and the complexity of mappings, – Performance measures: what metrics have been chosen to quantify the match result?; and – Effort : what kind of manual effort has been measured? To assess the manual effort, one should consider both pre-match effort required before an automatic matcher can run (such as training of learner-based matchers, specifying auxiliary information) as well as post-match effort to add the false negatives and to remove the false positives from the final match result.
Combining Effectiveness and Efficiency for Schema Matching Evaluation
23
Effectiveness Measures: First, the match task should be manually solved to get the real mappings Rm . Then, the matching system solves the same problem to obtain automatic mappings Am . After identifying both real and automatic mappings it is possible to define some terms that will be used in computing match effectiveness. False negatives A =Rm - Am : are the needed matches but not identified by the system; True positives B =Rm ∩ Am : are the correct matches and identified correctly by the system; False positives C =Am - Rm : are the false matches but identified by the system; and True negatives D: are the false matches and correctly discarded by the system. Precision and Recall: Based on real and automatic mappings, two measures can be computed. The two measures are precision and recall, which originate from the information retrieval (IR) field [14]. Precision P can be computed from |B| |B| and the recall R is computed R = |B|+|A| . However, neither precision P = |B|+|C| nor recall alone can accurately assess the match quality. Hence, it is necessary to consider a trade-off between them. There are several methods to handle such a trade-off, one of them is to combine both measures. The most used combined measures are: – F-Measure: is the weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is: F =
P ∗R 2 ∗ |B| = 2∗ (|B| + |A|) + (|B| + |C|) P +R
(1)
– Overall : is developed specifically in the schema matching context and embodies the idea to quantify the effort needed to add false negatives and removing false positives. It is introduced in the Similarity Flooding (SF) system [12] and is given by |A| + |C| 1 OV = 1 − = R ∗ (2 − ) (2) |A| + |B| P To determine which one is used as a measure for the effectiveness of a schema matching system, we compare the two combined measures. Figure 2 as well as equations (1) and (2) represent a good way for this comparison. Since the overall measure is more sensitive to precision than recall. Therefore, in this paper, we consider the overall (OV ) measure as an indicator for schema matching effectiveness. Efficiency Evaluation: Efficiency is mainly contained in two properties: speed (the time it takes for an operation to complete), and space (the memory or nonvolatile storage used up by the construct). In order to obtain a good efficiency measure for schema matching systems, we should consider the following factors: – A first factor to consider when we evaluate the schema matching efficiency is the critical phase of a matching process. A matching process consists of many phases and each phase contains multiple steps.
24
A. Algergawy, E. Schallehn, and G. Saake
F−measure 1
Overall
0.5
0
−0.5
−1 1 1 0.8
0.5
0.6 0.4
Recall
0
0.2 0
Precision
Fig. 1. F-Measure and Overall against Precision and Recall
– A second factor is its automatabity. Human effort is very expensive; matching approaches that require excessive human interaction are impractical. – A third factor that impacts schema matching efficiency is the type of methodology used to compute similarity values. Recent schema matching systems [16,8] introduce the time measure as a criterion of performance evaluation. In this paper, we take the time measure (T ) as an indicator for the schema matching efficiency. To sum up, With the emergence of applications that need fast matching systems, such as incremental schema matching systems [1], and the applications that match large schemas [4], the need to improve schema matching efficiency increases. For instance, consider a crisis management information system, it is not sufficient to provide its users with the best possible mappings, it is also necessary to obtain these mappings within a reasonable time.
4
Combining Effectiveness and Efficiency
From the above criteria, we could conclude that the trade-off between effectiveness and efficiency of a schema matching system is considered as a multi-objective optimization problem (MOOP). In this section, we present a definition for the MOOP and the approaches used to solve the problem [18,11]. In the following definitions we will assume minimization (without loss of generality). Definition 4. (Multi-objective Optimization Problem) An MOOP is defined as ”Find x that minimizes F (X) = (f1 (x)), f2 (x), ..., fK (x))T s.t. x ∈ S and x = (x1 , x2 , ..., xn )T where f1 (x), f2 (x), ..., fk (x) are the k-objective functions, (x1 , x2 , ..., xn ) are the n optimization parameters, and S ∈ Rn is the solution. In our approach, we have two objective functions, overall as a measure of effectiveness and time as a measure of efficiency. Therefore, we could rewrite the multi-objective function as: CE = (f1 (OV )), f2 (T )), where CE is the costeffectiveness which to be maximized here. In a multi-objective problem, the optimum solution consists of a set of solutions, rather than a single solution as
Combining Effectiveness and Efficiency for Schema Matching Evaluation
25
in global optimization. This optimal set is known as the Pareto Optimal set and is defined as follows: P := {x ∈ S| ∃x ∈ S F (x ) F (x)}. Pareto optimal solutions are known as the non-dominated or efficient solutions. There are many methods available to tackle multi-objective optimization problems. Among them, we choose priori articulation of preference information. This means that before the actual optimization is conducted the different objectives are somehow aggregated to one single figure of merit. This can be done in many ways, we choose weighted-sum approaches. Weighted-Sum Approaches: The most easy and perhaps most widely used method is the weighted-sum approach. The kobjective function is formulated as a weighted function as given min(ormax) i=1 wi × fi (x) s.t. x ∈ S and wi ∈ R |wi > 0, wi = 1. By choosing different weightings for the different objectives, the preference of the application domain is taken into account. As the objective functions are generally of different magnitudes and units, they should have to be normalized first. 4.1
The Cost-Effectiveness of Schema Matching
Consider we have two schema matching systems A and B to solve the same matching problem. Let OVA and TA represent the overall and time measures of the system A respectively, while OVB and TB denote the same measures for the system B. To analyze the cost-effectiveness of a schema matching system, we make use of the MOOP and its methods to solve it, namely the weighted-sum approach. Here, we have two objectives, namely effectiveness (measured by overall OV ) and efficiency (measured by response time T ). Obviously, we can not directly add up an overall value to a response time value, since the resulting sum would be meaningless, due to the difference of dimensional units. The overall value of a schema matching system is normalized value, i.e. its range is between 0 and 1, while the processing time is measured in seconds. Therefore, before summing (e.g. weighted average) the two quantities, we should normalize the processing time. To normalize the response time, for instance, the response time of the slower system (here TA ) is normalized to the value 1, while the response time of the faster system (TB ) can be normalized to a value in the range [0,1] by dividing B TB and TA , i.e. T TA . We name the objective function of a schema matching system the costeffectiveness (CE) and should be maximized. The cost-effectiveness is given by CE =
2 i=1
wi × fi (x) = w1 × OVn + w2 ×
1 Tn
(3)
where w1 is the weighting for the overall objective and denoted by (wov ) and w2 is the weighting for the time objective and denoted by (wt ). In the case of comparing two schema matching systems, we have the following normalized quantities
26
A. Algergawy, E. Schallehn, and G. Saake
OVAn , OVBn , TAn and TBn where OVAn = OVA , OVBn = OVB , TAn =1, and B TBn = T TA . We now endeavor to come up with a single formula involving two quantities, namely normalized overall OVn and normalized response time Tn , where each of these quantity associated with a numerical weight to indicate its importance in the evaluation of the overall performance and to enrich the flexibility of the method. We write the equations that describe the cost-effectiveness (CE )for each system as follows: CEA = wovA ∗ OVAn + wtA ∗
1 TAn
CEB = wovB ∗ OVBn + wtB ∗
1
(4)
(5) TBn where wov and wt are the numerical weights for the overall and time response quantities respectively. If we let the time weights equal to zero, i.e. wt =0, then the cost-effective becomes the same normal evaluation considering only the effectiveness aspects (wov =1). The most cost-effectiveness schema matching system is the system having the larger CE as measured by the above formulas. Equations 4 and 5 present a simple but a valuable method to combine the effectiveness and the efficiency of a schema matching system. Moreover, this method is based on and supported by a proven and verified method; the multi-objective optimization problem. Although the method is effective, it still has an inherent problem. It would be difficult to determine good values for the numerical weights, since the relative importance of overall and time response is highly domain-dependent and, of course, very subjective. For example, when we are dealing with small-scale schemas, the overall measure is more dominant than the response time. Hence, we may select wov =0.8 and wt =0.2. For the critically-time systems, the response time may have the same importance as the overall measure, then we may choose wov = wt =0.5. To accommodate this problem, we need an optimization technique which enables us to determine the optimal (or close to optimal) numerical weights. In this paper, we set these values manually in the selected case studies. Automatic determination of numerical weight values is left for future work.
5
Experimental Evaluation
We evaluate our approach by comparing two recently well-known schema matching systems, namely COMA++ and BTreeMatch. We have obtained both systems from its open source distribution1 . All the experiments were performed using the XBenchMatch tool developed in [8]. However, our proposed approach can be applied easily to other schema matching systems. The problem is that it is hard to find available matching prototypes to test. We first briefly describe the two evaluated systems according to performance criteria describe above in the 1
Combining Effectiveness and Efficiency for Schema Matching Evaluation
27
paper. Then we describe used data sets. We then present the results for applying the proposed approach to evaluate different schema matching systems. 5.1
Evaluated Systems
Motivated by the fact that the two matching systems COMA++ and BTreeMatch are available through its open distribution, we use them to validate our approach. The two systems share some features and differ in others. The shared features include they are schema-based approach; they utilize rule-based algorithms; they accept XML schemas as input and produce element-level mappings (one-to-one); they need pre-match effort e.g. tuning match parameters and defining match strategy; they evaluate matching effectiveness using precision, recall, and F-measure. The two systems differ in the following points: COMA++ exploits an external dictionary as auxiliary information; it uses a rich library of matchers including simple, hybrid, fragment, and context matchers; it did not consider matching efficiency in its evaluation. On the other hand, BTreeMatch does not utilize any auxiliary information sources; it uses a hybrid matcher based on Btree index; it deals with largescale schemas measuring time response as the measure for matching efficiency. 5.2
Data Set
We used the same data sets described in [8]. To make this paper self-contained, we summarize the properties of the data sets in Table 2. The first one describes a person, the second is related to business order, the third one represents university courses, and the last one comes from the biology domain. These data sets are tested on the COMA++ and BTreeMatch systems to determine the cost-effectiveness and compare between them. Table 2. Data set details from [8] Person University Order Biology No. nodes(S1 / S2 ) 11/10 18/18 20/844 719/80 Avg No. nodes 11 18 432 400 Max. depth (S1 / S2 ) 4/4 5/3 3/3 7/3 No. mappings 5 15 10 57
5.3
Experimental Results
In this section we show the experimental results of applying our approach to COMA++ and BTreeMatch using the XBenchMatch tool developed in [8]. Small-scale Schemas: The cost-effectiveness of test matchers using smallscale schemas such as university and person schemas can be computed by the following equations: CECOMA++s = wOV ∗ OVCOMA++ + wt ∗
1 TCOMA++n
(6)
28
A. Algergawy, E. Schallehn, and G. Saake
CEBT Ms = wOV ∗ OVBT M + wt ∗
1 TBT Mn
(7)
where OVCOMA++ =0.8, OVBT M =0.3, TCOMA++ =0.9s, TBT M =0.6s and wOV =0.8 (for small-scale schemas), and wt =0.2, then CECOMAs =0.64 +
0.2 1 =0.84,
and CEBT Ms =0.24 +
0.2 0.6 0.9
=0.5
Large-scale Schemas: The cost-effectiveness of test matchers using smallscale schemas such as the biology schema can be computed by the following equations: CECOMA++l = wOV ∗ OVCOMA++ + wt ∗ CEBT Ml = wOV ∗ OVBT M + wt ∗
1 TCOMA++n 1
(8) (9)
TBT Mn
where OVCOMA++ =0.4, OVBT M =0.8, TCOMA++ =4s, TBT M =2s and wOV =0.6 (for large-scale schemas), and wt =0.4, then CECOMAs =0.24 + 5.4
0.4 1 =0.64,
and CEBT Ms =0.48 +
0.4 2 4
=1.28
Discussion
The experiment section shows that schema matching prototypes are best suited for certain situations. For example, see Table 3, the cost-effectiveness of COMA++ is well accepted for small-scale schemas while it is not accepted for BTreeMatch. However, for large-scale schemas, the cost-effectiveness increases for BTreeMatch and decreases for COMA++. Table 3. Summary of results Evaluated System COMA++ BTreeMatch
OV T CE small-scale large-scale small-scale large-scale small-scale large-scale 0.8 0.4 0.9s 4s 0.84 0.64 0.3 0.8 0.6s 2s 0.5 1.28
We study the relationship between cost-effectiveness and both performance aspects (overall and response time). Figures 2 illustrates this relationship, where the squared line represents the overall only, the dashed line represents the response time and the solid line represents both. Figure 2(a) is drawn for the small-scale case (i.e.wOV =.8 and wt =.2 ) while Fig. 2(b) is drawn for the largescale schemas (wOV =.5 and wt =.5 ). In the case of small-scale schemas, the cost-effectiveness is more biased to overall measure than the response time of the system, while in the case of large-scale schemas, the cost-effectiveness is biased by both performance aspects.
Combining Effectiveness and Efficiency for Schema Matching Evaluation both response time overall
3
3
2.5
2.5
cost−effectiveness
cost−effectiveness
both response time overall
29
2 1.5 1 0.5 0 1
2 1.5 1 0.5 0 1
1 0.8
0.5
0.6
1 0.8
0.5
0.6
0.4 overall
0
0.4
0.2 0
response time
(a) small-scale schemas
overall
0
0.2 0
response time
(b) large-scale schemas
Fig. 2. Performance Aspects with Cost-Effectiveness
6
Summary and Future Work
In this paper, we presented an approach, where both effectiveness and efficiency are taken into account. We introduced the trade-off between aspects of schema matching performance as a multi-objective optimization problem. Then, we make use of the weighted-sum approach as priori articulation of preference information. The cost-effectiveness is taken as a measure for the overall performance. This measure combines (weighted sum) the two performance aspects in a single formula. The formula contains overall measure as an indicator for effectiveness and normalized response time as an indicator for efficiency. To enrich the flexibility of the method, each quantity is associated with a numerical weight to indicate its importance in the evaluation of the overall performance. In this paper, we set the numerical weights manually depending on our experience. We applied our proposed method to fairly evaluate and compare two wellknown schema matching systems (COMA++ and BTreeMatch). We have discussed the effect of schema size on match performance. For small-scale schemas, match performance is more affected by match effectiveness, while in large-scale schemas two performance aspects have equal effect. Our proposed approach can be integrated with the recent schema matching benchmarks. Moreover, our ongoing work is to build a unified evaluation process in order to decide on schema matching performance. The impact of numerical values of weights and identifying their optimal values automatically is one of our future work.
References 1. Bernstein, P.A., Melnik, S., Churchill, J.E.: Incremental schema matching. In: VLDB, Korea (2006) 2. Do, H.H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: the 2nd Int. Workshop on Web Databases (2002) 3. Do, H.H., Rahm, E.: COMA- a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)
30
A. Algergawy, E. Schallehn, and G. Saake
4. Do, H.-H., Rahm, E.: Matching large schemas: Approaches and evaluation. Information Systems 32(6), 857–885 (2007) 5. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine-learning approach. SIGMOD, 509–520 (2001) 6. Doan, A., Halevy, A.: Semantic integration research in the database community: A brief survey. AAAI AI Magazine 25(1), 83–94 (2005) 7. Drumm, C., Schmitt, M., Do, H.-H., Rahm, E.: Quickmig - automatic schema matching for data migration projects. In: Proc. ACM CIKM 2007, Portugal (2007) 8. Duchateau, F., Bellahsene, Z., Hunt, E.: Xbenchmatch: a benchmark for XML schema matching tools. In: VLDB 2007, Austria, pp. 1318–1321 (2007) 9. Duchateau, F., Bellahsene, Z., Roche, M.: An indexing structure for automatic schema matching. In: SMDB Workshop, Turkey (2007) 10. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: VLDB, Italy, pp. 49–58 (2001) 11. Marler, R., arora, J.: Survey of multi-objective optimization methods for engineering. Struct. Multidisc Optim. 26, 369–395 (2004) 12. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE 2002 (2002) 13. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001) 14. Rijsbergen, C.J.: Information Retrieval, 2nd edn., London (1979) 15. Smiljanic, M.: XML Schema Matching Balancing Efficiency and Effectiveness by means of Clustering. PhD thesis, Twente University (2006) 16. Yatskevich, M.: Prelimanary evaluation of schema matching systems. Technical Report #DIT-03-028, Tornoto University (2003) 17. Zhang, Z., Che, H., Shi, P., Sun, Y., Gu, J.: Formulation schema matching problem for combinatorial optimization problem. IBIS 1(1), 33–60 (2006) 18. Zitzler, E., Thiele, L.: Multiobjective evolutionaty algorithms: A comparative case study and the strength pareto approach. IEEE Tran. on EC 3, 257–271 (1999)
Model-Driven Development of Complex and Data-Intensive Integration Processes Matthias B¨ohm1 , Dirk Habich2 , Wolfgang Lehner2, and Uwe Wloka1 1
Abstract. Due to the changing scope of data management from centrally stored data towards the management of distributed and heterogeneous systems, the integration takes place on different levels. The lack of standards for information integration as well as application integration resulted in a large number of different integration models and proprietary solutions. With the aim of a high degree of portability and the reduction of development efforts, the model-driven development—following the Model-Driven Architecture (MDA)—is advantageous in this context as well. Hence, in the GCIP project (Generation of Complex Integration Processes), we focus on the model-driven generation and optimization of integration tasks using a process-based approach. In this paper, we contribute detailed generation aspects and finally discuss open issues and further challenges. Keywords: Model-Driven Architecture, Integration Processes, GCIP, Federated DBMS, Enterprise Application Integration, Extraction Transformation Loading.
1 Introduction The scope of data management continuously changes from centrally stored data to the integration of distributed and heterogeneous systems. In fact, integration is realized on different levels of abstraction: we distinguish between information integration (function integration and data integration), application integration, process integration, and partially also GUI integration. Due to missing standards for information integration and application integration, numerous different integration systems with overlapping functionality exist. In conclusion, only a low degree of portability of the integration task specification can be reached. However, portability is strongly needed, in particular, in the context of extending existing integration projects. Typically, in an enterprise IT-infrastructure, there are multiple integration systems. Assume that two relational databases are currently integrated with a federated DBMS for reasons of performance. If this integration process should be extended with the aim of integrating an SAP R/3 system, functional restrictions make it necessary to transfer the integration process to the existing EAI server, with the lowest possible development efforts. Aside from the main problem of portability, there are three more major problems. First, extensive efforts are R.-D. Kutsche and N. Milanovic (Eds.): MBSDI 2008, CCIS 8, pp. 31–42, 2008. c Springer-Verlag Berlin Heidelberg 2008
32
M. B¨ohm et al.
needed in order to specify integration tasks. This includes, for example, mapping specifications between heterogeneous data schemas. Although there are approaches to minimize these efforts, like automated schema matching or data integration with uncertainty (e.g., dataspaces), these approaches are not applicable in real-world enterprise scenarios, due to insufficient exactness. Second, very little work exists on the optimization of integration tasks. However, this is crucial, e.g., in the context of application integration, where complete business processes depend on it. Third, the decision about the optimal integration system (w.r.t. performance, functionality and development effort) is made based on subjective experience rather than on objective integration system properties. Hence, our GCIP project (Generation of Complex Integration Processes) addresses the mentioned problems using a model-driven approach to generate and optimize integration processes. Basically, integration processes are modeled as platform-independent models (PIM) using graphical process description notations like UML and BPMN. Those are transformed into an abstract platform-specific model (A-PSM). Based on this central representation, different platform-specific models (PSM) can be generated with the help of specific platform models. Currently, we provide platform models for federated DBMS, EAI servers and ETL tools. Finally, each PSM can be transformed into multiple declarative process descriptions (DPD), which represents the code layer. The contribution of this paper mainly includes three issues. First, after having surveyed related work in this research area in Section 2, we show how integration processes can be modeled in a platform-independent manner in Section 3. Second, in Section 4, we discuss generation aspects including the A-PSM, PSM and DPD specifications. Third, in Section 5, we enumerate open issues and research challenges correlated to our approach. Finally, we conclude the paper in Section 6 and give an overview of our future work on the optimization of integration processes.
2 Related Work A lot of work on MDA techniques already exists [1,2]. Further, sophisticated tools and frameworks for model-driven software engineering have been developed. Therefore, we survey related work in three steps. First, we point out major MDA techniques. Second, we show that for application development as well as for data modeling, suitable MDA support exists. Third, we illustrate the lack of model-driven techniques and approaches in the context of integration processes. The core concepts of a model-driven architecture (MDA)—specified by the Object Management Group (OMG)—are MOF [3], standardized meta models like UML [4], and model-model transformation techniques. In accordance to [5,6], it could be stated that QVT (Query/View/Transformations) and TGG (Triple Graph Grammars) are currently the most promising techniques for such model-model transformations. Due to the bidirectional mapping and the possibility of incremental model changes, TGG in particular is seen as the most suitable solution for model transformations. TGG comprises three graphs: the left-side graph (representing the first model), the right-side graph (representing the second model) and finally, a correspondence graph between the two models. Based on the given correspondence graph, a TGG rule interpreter is able to process the graph transformations for both directions.
Model-Driven Development of Complex and Data-Intensive Integration Processes
33
Further, the MDA paradigm is widely used in the area of database applications for database creation. First, the model-driven data modeling and the generation of normalized database schemas should be mentioned. This approach is well known and many different systems and tools exist. Second, there is the generation of full database applications, including the data schema as well as data layer code, business logic layer code, and even user interface code. In this area, CASE tools in particular have to be named. However, there are still research items in this application area. For example, the logical database optimization is not realized adequately. The GignoMDA project [7,8] addresses this research item by allowing hint specification. Third, it can be stated that in the application area of data warehouse schema creation, MDA is also used increasingly. In accordance to the MDA paradigm, the CWM (Common Warehouse Metamodel) [9] should be mentioned. However, in contrast to normalized data modeling, there is not as much MDA support for data warehouse modeling. Within the context of integration processes, there is insufficient support for modeldriven development. Basically, two different groups of related work should be distinguished in this area. First, there is the modeling of workflow processes. With WSBPEL [10], there is a promising standard and several extension proposals. However, this language has deficits concerning the modeling of data-intensive integration processes; plus, it is a platform-specific model. In accordance to the MDA paradigm, the existing process description languages can be classified into three layers: (1) graphical representation layer, (2) description layer and (3) execution layer. The graphical notations of UML 2.0 and BPMN 2.0 [11] are included in (1). Further, in (2), only WS-CDL [12] should be mentioned. Finally, (3) comprises WSBPEL, XPDL [13] and ebBP [14]. With this logical stratification in mind, the model-driven development of workflow processes is possible. In contrast to this, there are only few contributions on generic integration process generation. One of these is the RADES approach [15], which tries to give an abstract view on EAI solutions using technology-independent and multi-vendor-capable model-driven engineering methodologies. However, this approach is very specific to EAI solutions. Although some work on ETL process modeling [16,17,18] and ETL model transformation [19,20,21] exists, most data integration research addresses schema matching automation [22,23] or static meta data management [24] rather than the model-driven generation of integration tasks for different target integration systems. Thus, we are not aware of any solution for the generic generation of data-intensive integration processes. However, we are convinced that such a solution is required as a logical consequence of the historical evolution in this area. Finally, we want to refer to the Message Transformation Model (MTM), a conceptual model for data-centric integration processes in the area of message-based application integration, which was introduced in short with [25]. This model is used as our abstract platformspecific model and thus, as the core of the whole generation process.
3 Integration Process Modeling Due to the lack of solutions for generic model-driven integration process generation, the main approach of modeling complex integration processes is introduced in this section. As already mentioned, integration processes can be specified with
34
M. B¨ohm et al.
platform-independent models (PIM). Therefore, the PIM has to be derived from the computer-independent model (CIM), which is represented by textual specifications. Further, it is the first formal specification of the integration process and thus the starting point for the automated generation. In order to allow different target platforms, at this level, no technology is chosen. Thus, the platform independence of the model is reached by restricting the modeling to graphical notations like BPMN [11] and UML [4]. The procedure for PIM modeling is equal to the procedure of structured analysis and design (SAD). The single steps—which are introduced in the following—are used in order to allow for different degrees of abstraction during modeling time. Note that it might occur that, different persons will model the different degrees of abstraction. 1. Determination of terminators (external systems): First, all external systems, also known as the terminators of a process description, and their types are determined. This is derived from the overall system architecture. 2. Determination of interactions with the terminators: Second, the type of interaction is specified for each interaction between the integration system and an external system. As a result, this comprises the determination of whether these are read (pull), write (push) or mixed interaction forms. 3. Control flow modeling: Third, the detailed control flow is modeled, including structural aspects (e.g., alternatives), time aspects (e.g., delays and asynchronous flows) as well as signal handling (e.g., errors). 4. Data flow modeling: Fourth, the abstract data flow modeling is applied. This means the general specification of data flow activities like filters, data transformations and the transactional behavior. 5. Detailed model parameterization: Finally, all control flow and data flow activities are parameterized in detail using annotations. Thereby, condition evaluations are specified and the data transformation is described on the schema mapping level. Those five modeling steps result in a single platform-independent model which represents the integration process. It would be possible to separate aspects with different model views (terminator interactions, control and data flow, parameterization and configuration). Due to an increasing complexity, we explicitly do not use multiple models. We want to introduce the example processes P13 and P02 from the DIPBench (Data-Intensive Integration Process Benchmark) specification [26,27]. Obviously, these are not complex integration processes, but we use these process types as running examples throughout the whole paper. Figure 1 shows the PIM P13 and PIM P02, modeled with the help of StarUML, using the supported UML activity diagrams. The process type P13 basically describes the extraction of movement data from a consolidated database and its loading into a global data warehouse. Let us use the introduced procedure for PIM process modeling: First, the two external systems cs1.cdb.DBA and cs1.dwh.DBA are determined (“service“) and identified as RDBMS. Second, the interaction types (“operation“) have to be detected. So, at the beginning of the process, a stored procedure is called on cs1.cdb.DBA in order to realize the data cleansing [28]. Furthermore, two different datasets are queried from the cs1.cdb.DBA. These datasets are finally inserted into cs1.dwh.DBA. Third, the control flow may be determined as a simple sequence of process steps. Fourth, the data flow—specified by
Model-Driven Development of Complex and Data-Intensive Integration Processes
35
Fig. 1. Example PIM P13 and PIM P02
rectangles and dashed arrows—has to be provided. As illustrated, the specific datasets are transfered from the extracting to the loading activities. Fifth, and thus finally, the model is enriched with technical details (in the form of UML annotations) like table names, procedure names and so on. The process type P02 focuses on the reception of Customer master data (xml messages), its translation to relational data, and the contentbased insertion into one of three target systems, based on the extracted sitekey.
4 Integration Process Generation Based on the modeled platform-specific integration processes, our GCIP Framework supports the generation of several platform-specific models. Due to the complexity of this generation, we describe this from various perspectives. 4.1 A-PSM Generation In contrast to other approaches, we use an abstract platform-specific model (A-PSM) between the PIMs and the PSMs. This is reasoned by four facts. First, it reduces the transformation complexity between n PIMs and m PSMs from n · m to n + m. Second, it separates the most general PIM representations from the context-aware A-PSM of integration processes, still independent of any integration system type. Third, it offers the possibility for applying context-aware optimization techniques in a unique (normalized) manner. And fourth, it increases the simplicity of model transformations, using small, well-defined transformation steps. Basically, the Message Transformation Model (MTM) is used as A-PSM. This model represents the starting point of all transformations into and from platform-specific models. Obviously, we are only able to generate integration processes which can be expressed adequately with the MTM. Although the MTM was already introduced in short with [25], its importance drives us to give a short overview of this meta model. The MTM is a conceptual model for data-intensive integration processes and is separated into a conceptual message model and a conceptual process model.
36
M. B¨ohm et al.
The conceptual message model was designed with the aim of logical data independence and represents the static aspects of a message transformation. In accordance with the molecule atom data model (MAD), the MTM meta message model can be seen as a molecule type message and thus as a recursive, hierarchical, object-oriented and redundancy-free structure. The molecule type message is composed of two atom types: the header segment and the data segment. Furthermore, there is a unidirectional, recursive self-reference of the atom type data segment, with a 1:CN cardinality, which represents the molecule type data segment. There, the header segment is composed of k name-value pairs, whereas the data segment is a logical table with l attributes and n tuples. The option of nested tables ensures a dynamic and structure-carrying description of all data representations. The conceptual process model—the execution model for the defined messages— addresses the dynamic aspects of a transformation process and was designed with the aim of independence from concrete process description languages. Basically, a graphoriented base model Directed Graph is used. It is limited to the three components: node (generalized process step), transition (edge between two nodes) and the hierarchical process. A single node may have multiple leaving transitions, and during the runtime of one process instance, there may be multiple active leaving transitions but not more than the total number of its leaving transitions. Thus, one transition has exactly one target node. Indeed, multiple transitions could refer to one node. The process is a hierarchical element and contains a start node, several intermediate nodes and an end node. In fact, such a process is also a specialized node, so that the recursive execution with any hierarchy level is possible. A node receives a set of input messages, further executes several processing steps specified by its node type and its parameterization, and finally returns a set of output messages. The actual process model is defined—with the aim of a low degree of redundancy—on top of the base model Directed Graph. Operators are defined as specialized process steps and thus as node types. Basically, these are distinguished into the three categories: interaction-oriented operators (Invoke, Receive and Reply), control-flow-oriented operators (Switch, Fork, Delay and Signal) and data-flow-oriented operators (Assign, Translation, Selection, Projection, Join, Setoperation, Split, Orderby, Groupby, Window, Validate, Savepoint and Action). Definition 1. A process type P is defined with P = (N, S, F ) as a 3-tuple representation of a directed graph, where N = {n1 , . . . , nk } is a set of nodes, S = {s1 , ..., sl } is a set of services, including their specific operations si = {o1 , ..., om }, and F ⊆ (N × (S ∪ N )) is a set of flow relations between nodes or a node and a service. Each node has a specific node type as well as an identifier N ID (unique within the process type) and is either of an atomic or a complex type. Each process type P , with P ⊆ N , is also a node. A process p with P ⇒ p has, a specific state z(p) = {z(n1 ), . . . , z(nk )}. Thus, the process state is an aggregate of the specific single node states z(ni ). Basically, two different event types initiating such integration processes have to be distinguished. The specific event type has a high impact on the process modeling and on the generation of platform-specific models for different target integration systems. The main event types could be distinguished as follows:
Model-Driven Development of Complex and Data-Intensive Integration Processes
37
– Message stream: Processes are initiated by incoming messages. According to the area of data streaming, such a stream is an event stream. Processes of this event type have a RECEIVE operator and are able to reply to the invoking client. – External events and scheduled time events: Processes are initiated in dependence on a time-based schedule or by external schedulers. These processes do not have a RECEIVE operator and are not able to reply synchronously to an invoking client. An XML representation of the conceptual MTM was defined. Thus, the external XML msg1/pname = ’sp_runMovementDataCleansing’ representation of the PIM can Service{ cs1.cdb.DBA }; Operation{ CALL }; IN msg1/pname; be transformed into the XML msg2/tblname1 = ’Orders’ representation of the A-PSM msg2/tblname2 = ’Orderline’ using transformation templates. Service{ cs1.cdb.DBA }; Operation{ QUERY }; IN msg2/tblname1; OUT msg3/dataset Figure 2 shows the A-PSM Service{ cs1.cdb.DBA }; Operation{ QUERY }; P13, derived from the PIM IN msg2/tblname2; OUT msg4/dataset P13. Basically, the MTM is msg3/tblname = ’Orders’ msg4/tblname = ’Orderline’ message-based and thus uses Service{ cs1.dwh.DBA }; Operation{ INSERT }; message variables instead of IN msg3/tblname, msg3/dataset; the explicit data flow used in Service{ cs1.dwh.DBA }; Operation{ INSERT }; IN msg4/tblname, msg4/dataset; the PIM. Furthermore, detailed parameters like table names and stored procedure names Fig. 2. Example A-PSM P13 have to be assigned to the input messages. That is why the logical sequence is extended with several ASSIGN operators. In contrast to this, the interaction-oriented operator INVOKE is simply mapped. Finally, note that technical details, like schema mappings and configuration properties, can be annotated on the PIM as well as the A-PSM level, while all following models are generated in a fully automated manner. 4.2 PSM Generation From the unique A-PSM, multiple platform-specific models (PSM), including PSMs for FDBMS, ETL tools and EAI servers, could be generated. For this transformation, the defined specific platform models (PM) are used. Here, a PM represents a meta model for an integration system type, like the type ETL tool. We describe in detail only the PM for FDBMS, including the resulting PSM, while the integration system types ETL tools and EAI servers are only discussed from a high-level perspective. Federated DBMS PSM. The PSM for federated DBMS comprises structural as well as semantic differences to the A-PSM (MTM). In contrast to the MTM, it is rather hierarchically structured and not a graph-based model. Thus, some structured components recursively include other structured components as well as atomic components. Furthermore, the two different process-initiating event types are expressed with two different possibilities for the root component. First, there is the Trigger, which is bound to a queuing table and represents the event type message stream. Second, there is the
38
M. B¨ohm et al. Table 1. Interaction-Oriented Operators
Name Call Insert Delete Update Resource Scan
Description Invoke a persistently stored module (e.g., procedures, functions) Load a dataset into a specified relation Delete a specified dataset Set a specified attribute value Extract a dataset from a specified relation Table 2. Control-Flow-Oriented Operators
Name If Delay Signal Iteration
Description Execute all path children if the path condition is true Interrupt the execution based on a timestamp or for a specified time Terminate the integration process or raise a defined error Loop over all children while the specified condition is true Table 3. Data-Flow-Oriented Operators
Description Value assignment of atomic or complex objects (different query language) Execution of elementary schema translations Choice of tuples in dependence on a specific condition Choice of attributes in dependence on a specific attribute list Compound of multiple datasets depending on conditions and types Use of the set operations union, intersection and difference Decomposition of a large XML document into multiple rows Sorting of a dataset depending on a specified attribute Partitioning of tuples with grouping attributes and aggregate functions Partitioning of tuples for ordering and correlations, without grouping Constraint validation on a dataset depending on a specific condition Table 4. Transaction-Oriented Operators
Name BeginTX CommitTX RollbackTX Savepoint
Description Start a new transactional context End the current transactional context in a successful way End the current transactional context in a failed way Write intermediate results for recovery processing
Procedure, representing the event type external events and scheduled time events, where no data-intensive parameters are specified for such a procedure. Both of these root component types implicitly include the handling of the transactional context. Before revisiting our used example, the components of the PM FDBMS s hould be mentioned. Basically, they are distinguished into four groups: interaction-oriented operators, control-flow-oriented operators, data-flow-oriented operators and transactionoriented operators. These groups are explained in detail by Tables 1 to 4.
Model-Driven Development of Complex and Data-Intensive Integration Processes
39
Using the introduced platform model, the PSM FDBMS is derived from the A-PSM. Figcs1.dwh.DBA. cs1.dwh.DBA. cs1.cdb.DBA. Orders Orderline sp_runMovement ure 3 shows this PSM FDBMS DataCleansing representation of the example cs1.cdb.DBA. cs1.cdb.DBA. process P13. The process is Orders Orderline transformed into the hierarFig. 3. Example FDBMS PSM P13 chical structure Procedure. Here, a BeginTX is implicitly inserted at BOT and a CommitTX is inserted at EOT. The first invoke and all correlated properties are represented by the Call element. Finally, there are two query trees, constructed of Insert and Resource Scan elements, which copy the two movement data relations from the consolidated database to the data warehouse. ETL Tool PSM. Similar to the first mentioned PSM, the PSM for ETL tools has some semantic differences to the MTM. This platform model is based on tuple routing between data-flow-oriented process steps, where the edges between these steps contain— on the conceptual layer—queues for buffering of tuples. Furthermore, it is an acyclic model, which is typical for the concentration on data flow semantics. In our generic platform model of ETL tools, no special process steps are included directly in the model because they are almost proprietary definitions. Using the mentioned platform model, the introduced example A-PSM P13 can be transformed to the ETL PSM. The derived model includes specialized process steps, where these are specific to the used source and target systems. Further, the parameter specifications are directly included in the used steps. This approach promises high performance but causes a higher data dependence than an EAI server would do. EAI Server PSM. Many commercial EAI servers use XML technologies and standard workflow languages like WSBPEL [10] or XPDL [13] for integration process specification. This abstract definition and the well-known adapter concept realize the needed data independence. Thus, the derivation of the EAI server PSM implies the mapping from A-PSM process types to WSBPEL process descriptions. Although this is a very easy mapping for interaction and control flow operators, it is recommended that the specific target integration systems support the WSBPEL extension WSBPEL-MT. Otherwise, the data-flow-oriented operators defined for the MTM could not be used. However, for the example process P13, this extension support is not required because interactionoriented operators, control-flow-oriented operators and the data-flow-oriented operator Assign are used exclusively. 4.3 DPD Generation In contrast to the MDA paradigm, where the lowest representation level is the Code layer, we name the lowest—MDA-relevant—level of the integration process representation the Declarative Process Description (DPD). This decision was made in order to distinguish (1) the integration process specifications (DPD) deployed in the specific integration system and (2) the actual generated integration code, internally produced by the integration system. However, the mentioned process descriptions are usually
40
M. B¨ohm et al.
specified in a declarative way because this leaves enough space for physical optimization techniques. Let us face the concrete DPDs for federated DBMS, ETL tools and WSBPEL-conform EAI servers. In case of FDBMS, the PSM P13 is generated in a DPD, represented by several DDL statements and an SQL stored procedure. In contrast to that, P02 is generated in the form of several DDL statements and a trigger due to the different event type. Both examples are included in appendix A. The majority of ETL tools work based on XML specifications, which may be directly used or specified by an additional GUI. Also, the EAI server works similar to this, except for the standardized language specification. Further, there are two main types of XML specification usage for these two integration system types. Some tools statically generate imperative code in the form of internal execution plans. Other tools dynamically interpret the XML specifications, where created object graphs are used. The decision is made based on performance aspects as well as flexibility requirements.
5 Open Issues and Challenges Due to the complexity of model-driven generation, there are major challenges and open research issues to be solved. Basically, we see the following five points: – Schema mapping extraction: We extract schema mapping information from XSLT and STX stylesheets. Due to the complex functionality of these languages, there are cases, where schema mapping extraction cannot be realized. Hence, sophisticated techniques for the schema mapping generation have to be developed, including levels of uncertainty when no exact information can be provided. – Round-trip engineering: Our approach works in a top-down direction, so all changes should be made on the PIM or the A-PSM level. In order to support incremental changes and the migration of legacy integration tasks, reverse engineering is advantageous. However, there are lots of challenges correlated to this issue. – Intra-system process optimization: The model-driven generation leaves enough space for logical process optimization, rewriting process plans with rule-based and workload-based optimization techniques. – Inter-system process optimization: Aside from the aforementioned optimization, the most powerful global optimization technique is the decision on the optimal integration system (w.r.t. performance). Thus, the major challenge is the workloadbased decision on the chosen PSM and DPD. – Supporting different execution models: There are integration system types which have a completely different execution model (e.g., subscription systems like replication servers) and thus, supporting those, creates some challenges.
6 Summary and Conclusion In this paper, we addressed two major problems within the area of integration processes. First, there was the problem of a low degree of portability. Second, large efforts were needed in order to set up and maintain integration scenarios. We conceptually showed (and evaluated with the implemented GCIP Framework) that a model-driven generation approach can dramatically reduce these two problems and is thus, advantageous for this
Model-Driven Development of Complex and Data-Intensive Integration Processes
41
application context as well. However, there are two more major problems and a lot of open research challenges. In our future work, we will focus on the logical optimization of integration processes using our model-driven generation approach.
References 1. Kleppe, A., Warmer, J., Bast, W.: MDA Explained. The Model Driven Architecture: Practice and Promise. Addison-Wesley, Reading (2003) 2. Thomas, D., Barry, B.M.: Model driven development: the case for domain oriented programming. In: OOPSLA (2003) 3. OMG: Meta-Object Facility (MOF), Version 2.0 (2003) 4. OMG: Unified Modeling Language (UML), Version 2.0 (2003) 5. K¨onigs, A.: Model transformation with triple graph grammars. In: MODELS (2005) 6. Czarnecki, K., Helsen, S.: Feature-based survey of model transformation approaches. IBM Syst. J. 45(3) (2006) 7. Habich, D., Richly, S., Lehner, W.: Gignomda - exploiting cross-layer optimization for complex database applications. In: VLDB (2006) 8. Richly, S., Habich, D., Lehner, W.: Gignomda - generation of complex database applications. In: Grundlagen von Datenbanken (2006) 9. OMG: Common Warehouse Metamodel (CWM), Version 1.0 (2001) 10. OASIS: Web Services Business Process Execution Language Version 2.0 (2006) 11. BMI: Business Process Modelling Notation, Version 1.0 (2006) 12. W3C: Web Service Choreography Description Language, Version 1.0 (2005) 13. WfMC: Process Definition Interface - XML Process Definition Language 2.0 (2005) 14. OASIS: ebXML Business Process Specification Schema, Version 2.0.1. (2005) 15. Dorda, C., Heinkel, U., Mitschang, B.: Improving application integration with model-driven engineering. In: ICITM (2007) 16. Simitsis, A., Vassiliadis, P.: A methodology for the conceptual modeling of ETL processes. In: CAiSE workshops (2003) 17. Trujillo, J., Luj´an-Mora, S.: A UML Based Approach for Modeling ETL Processes in Data Warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003) 18. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: DOLAP (2002) 19. Hahn, K., Sapia, C., Blaschka, M.: Automatically generating OLAP schemata from conceptual graphical models. In: DOLAP (2000) 20. Maz´on, J.N., Trujillo, J., Serrano, M., Piattini, M.: Applying mda to the development of data warehouses. In: DOLAP (2005) 21. Simitsis, A.: Mapping conceptual to logical models for ETL processes. In: DOLAP (2005) 22. Dessloch, S., Hernandez, M.A., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: Integrating schema mapping and ETL. In: ICDE (2008) 23. Melnik, S., Rahm, E., Bernstein, P.A.: Rondo: A programming platform for generic model management. In: SIGMOD (2003) 24. G¨ores, J., Dessloch, S.: Towards an integrated model for data, metadata, and operations. In: BTW (2007) 25. B¨ohm, M., Habich, D., Wloka, U., Bittner, J., Lehner, W.: Towards self-optimization of message transformation processes. In: ADBIS (2007) 26. B¨ohm, M., Habich, D., Lehner, W., Wloka, U.: Dipbench: An independent benchmark for data intensive integration processes. In: IIMAS (2008)
42
M. B¨ohm et al.
27. B¨ohm, M., Habich, D., Lehner, W., Wloka, U.: Dipbench toolsuite: A framework for benchmarking integration systems. In: ICDE (2008) 28. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. (2000)
A Example FDBMS DPD EXEC s p a d d s e r v e r cs1 , . . . EXEC s p a d d e x t e r n l o g i n cs1 , . . . CREATE PROXY TABLE rOrdersCDB EXTERNAL TABLE CREATE PROXY TABLE r O r d e r l i n e C D B EXTERNAL TABLE CREATE PROXY TABLE rOrdersDWH EXTERNAL TABLE CREATE PROXY TABLE r O r d er l i n eDWH EXTERNAL TABLE
AT AT AT AT
” c s 1 . cd b .DBA. ” c s 1 . cd b .DBA. ” c s 1 . dwh .DBA. ” c s 1 . dwh .DBA.
Orders ” O r d e r l i n e” Orders ” O r d e r l i n e”
CREATE PROCEDURE P13 AS BEGIN BEGIN TRANSACTION EXEC c s 1 . cd b .DBA. s p r u n M o v e m e n t D a t a C l e a n s i n g INSERT INTO rOrdersDWH SELECT ∗ FROM rOrdersCDB INSERT INTO r O r d er l i n eDWH SELECT ∗ FROM r O r d e r l i n e C D B COMMIT TRANSACTION END
Listing 1.1. Example FDBMS DPD P13 ... CREATE TRIGGER P02 ON P0 2 Qu eu e FOR INSERT AS BEGIN DECLARE @cu st o mer k ey BIGINT , ... @sitekey INTEGER , . . . BEGIN TRANSACTION SELECT @t i d = (SELECT TID FROM i n s e r t e d ) SELECT @i = 1 SELECT @xpath = ” / / c u s t o m e r [ 1 ] ” WHILE NOT (SELECT x m l e x t r a c t ( @xpath , (SELECT c o n v e r t (VARCHAR( 1 6 3 8 4 ) ,MSG) FROM P0 2 Qu eu e WHERE TID= @t i d ) RETURNS VARCHAR( 1 6 3 8 4 ) ) ) IS NULL BEGIN SELECT @rowptr = x m l e x t r a c t ( @xpath , ( SELECT c o n v e r t (VARCHAR( 1 6 3 8 4 ) ,MSG) FROM P0 2 Qu eu e WHERE TID= @t i d ) RETURNS VARCHAR( 1 6 3 8 4 ) ) SELECT @cu st o mer k ey = x m l e x t r a c t ( ’ / c u s t o m e r / @Customerkey ’ , @rowptr RETURNS BIGINT ) , @l ast n ame = x m l e x t r a c t ( ’ / c u s t o m e r / @Lastname ’ , @rowptr RETURNS VARCHAR( 4 0 ) ) , ... @ l a s t m o d i f i e d = x m l e x t r a c t ( ’ / c u s t o m e r / @Last Mo d i f i ed ’ , @rowptr RETURNS DATETIME) IF ( @ s i t e k e y = 2 OR @ s i t e k e y = 3 ) BEGIN IF NOT EXISTS ( SELECT 1 FROM rCompanyBP WHERE Companykey=@companykey ) BEGIN INSERT INTO rCompanyBP ( Companykey , Name , I m p o r t a n c e F l a g ) VALUES ( @companykey , @companyname , 1 ) END INSERT INTO r C u st o mer B P ( C u st o mer k ey , Last n ame , F i r s t n a m e , A d d r e s s S t r i n g , Zi p co d e , S i t e k e y , Phone1 , Phone2 , Companykey , B i r t h d a y , C r e a t e d , L a s t M o d i f i e d ) VALUES ( @customerkey , @lastname , @ f i r s t n a m e , @ad d r ess , @zip , @si t ek ey , @phone1 , @phone2 , @companykey , @b i r t h d ay , @cr eat ed , @ l a s t m o d i f i e d ) END ELSE IF ( @ s i t e k e y = 4 ) BEGIN IF NOT EXISTS ( SELECT 1 FROM rCompanyT WHERE Companykey=@companykey ) BEGIN INSERT INTO rCompanyT ( Companykey , Name , I m p o r t a n c e F l a g ) VALUES ( @companykey , @companyname , 1 ) END INSERT INTO r C u st o mer T ( C u st o mer k ey , Last n ame , F i r s t n a m e , A d d r e s s S t r i n g , Zi p co d e , S i t e k e y , Phone1 , Phone2 , Companykey , B i r t h d a y , C r e a t e d , L a s t M o d i f i e d ) VALUES ( @customerkey , @lastname , @ f i r s t n a m e , @ad d r ess , @zip , @si t ek ey , @phone1 , @phone2 , @companykey , @b i r t h d ay , @cr eat ed , @ l a s t m o d i f i e d ) END SELECT @i = @i + 1 SELECT @xpath = ” / / c u s t o m e r [ ” || c o n v e r t (VARCHAR( 2 0 ) , @i ) || ” ] ” END DELETE FROM P0 2 Qu eu e WHERE TID = @t i d COMMIT TRANSACTION END
Listing 1.2. Example FDBMS DPD P02
Towards a Metrics Suite for Object-Relational Mappings Stefan Holder1,*, Jim Buchan2, and Stephen G. MacDonell2 1
Max-Planck-Institut für Informatik, Campus E1 4, 66123 Saarbrücken, Germany [email protected] 2 School of Computing and Mathematical Sciences, Auckland University of Technology, Private Bag 92006, Auckland 1142, New Zealand {jbuchan,smacdone}@aut.ac.nz
Abstract. Object-relational (O/R) middleware is frequently used in practice to bridge the semantic gap (the ‘impedance mismatch’) between object-oriented application systems and relational database management systems (RDBMSs). If O/R middleware is employed, the object model needs to be linked to the relational schema. Following the so-called forward engineering approach, the developer is faced with the challenge of choosing from a variety of mapping strategies for class associations and inheritance relationships. These mapping strategies have different impacts on the characteristics of application systems, such as their performance or maintainability. Quantifying these mapping impacts via metrics is considered beneficial in the context of O/R mapping tools since such metrics enable an automated and differentiated consideration of O/R mapping strategies. In this paper, the foundation of a metrics suite for objectrelational mappings and an initial set of metrics are presented. Keywords: Object-oriented software development; Relational database; Impedance mismatch; Object-relational mapping; Software metrics.
1 Introduction Applications designed using object-oriented (OO) principles often achieve persistence of the object model using a relational database management system (RDBMS). This introduces a so-called ‘impedance mismatch’ between an application based on an OO design paradigm and an RDBMS designed according to quite different principles from relational theory. Object-relational (O/R) middleware is often used to bridge this semantic gap by providing a mechanism to map the object model to relations in the database. This layered approach to O/R mapping using middleware achieves the design goal of loose coupling between application and relational schema, making it possible to change the relational schema without the need to also change the application source code. If O/R middleware is employed in this way, the object model needs to be linked to the relational schema using the mapping mechanism offered by the O/R middleware. In the case where a developer creates the object model first and then creates the relational *
The work reported here was conducted primarily while the first author was at Auckland University of Technology on exchange from Fulda University of Applied Sciences, Germany.
schema by mapping the object model to relations, the so-called forward engineering approach, the developer is faced with the challenge of choosing from a variety of mapping strategies for class associations and inheritance relationships. These mapping strategies have different impacts on the non-functional properties of application systems, such as performance and maintainability [9, 11]. The problem for the developer, then, becomes selecting the mapping strategy that best suits the desired non-functional requirement priorities for a particular application. There are a number of approaches to addressing this problem and there is some tool support to automate the selection and generation of the mapping in the O/R middleware. In [8], for example, a single inheritance mapping strategy is used and a fixed mapping strategy for one-to-one, one-to-many, and many-to-many associations is applied, respectively. This approach is straightforward; however, the different impacts of alternative mapping strategies are not considered. Philippi [11] therefore suggests a model-driven generation of O/R mappings. In his approach, the developer defines mapping requirements in terms of quality trade-offs. The mapping tool then automatically selects mappings that fulfill the specified requirements based on general heuristics regarding the impacts of mapping strategies. However, mapping impacts strongly depend on concrete object schema characteristics such as inheritance depth and the number of class attributes [9, 13], properties not considered in [11]. Furthermore, while model-driven generation of O/R mappings is said to ease mapping specification for the developer, such an approach reduces the developer’s control over the mapping process and may result in a sub-optimal mapping for a given application. The developer should therefore have the option to define and refine mappings manually when required. We therefore suggest that concrete schema characteristics such as inheritance depth and the number of class attributes should be considered in the selection of O/R mappings. We propose an approach that incorporates schema characteristics into metrics that will provide more accurate and sensitive measures of the impacts of a given mapping specification on non-functional requirements. We further suggest the application of metrics for O/R mappings in order to give the developer sophisticated feedback on the impacts of the chosen mapping. The quantification of mapping impacts using metrics is considered beneficial in the context of O/R mapping tools since these metrics provide a semi-automated mechanism for a developer to evaluate the appropriateness of a selected mapping in terms of its likely impact on desired application non-functional requirements such as performance and maintainability. As its main contribution, this paper provides the foundation for a metrics suite for inheritance mappings and defines several initial metrics. While it is feasible to measure the impact of both association and inheritance mapping strategies with O/R metrics, we focus here on metrics for inheritance mapping strategies, for two reasons. First, the three basic inheritance mapping strategies (described in the following section) are applicable to all inheritance relationships, so the developer will always be required to choose one of them for each inheritance relationship. This is in contrast to association mapping strategies, whose applicability depends on the cardinality of the association, and so sometimes there is only limited choice or even no choice. Second, the impacts of inheritance mapping strategies strongly depend on the characteristics of the object model, such as inheritance depth and number of attributes. Thus, the drivers for the application of O/R metrics for inheritance relationships are more compelling.
Towards a Metrics Suite for Object-Relational Mappings
45
The remainder of this paper is structured as follows. In Section 2, basic strategies for mapping entire inheritance hierarchies are described. In Section 3, the semantics of mapping strategies for individual inheritance relationships are defined. The goals of measurement are set in Section 4 before an initial metrics suite for inheritance mappings is proposed in Section 5, based on the defined inheritance mapping semantics. In Section 6, the coverage of the proposed metrics suite is explained. Finally, a conclusion to this paper is given in Section 7.
2 Basic Strategies for Mapping Inheritance Hierarchies In the literature, inheritance mapping strategies are generally described as being applicable to whole class hierarchies. These ‘pure’ mapping strategies are explained in this section using the notation suggested in [9]. 2.1 One Class - One Table Following the ‘one class – one table’ mapping strategy, there is a one-to-one mapping between classes and tables. The corresponding table of a class contains a relational field for each non-inherited class attribute. Thus, object persistence is achieved by distributing object data over multiple tables. In order to link these tables, all tables share the same primary key. In addition, the primary keys are also foreign keys that mimic the inheritance relationships of the object schema [9]. Each row in a table, then, maps to objects in that table’s corresponding class and subclasses. It can be seen that for this mapping strategy, only one table needs to be accessed in order to identify the objects that match the query criteria of polymorphic queries. Whereas a non-polymorphic query only returns the objects of the class against which the query is issued, a polymorphic query returns the objects of the specified class and its subclasses for which the query criteria match. This mapping strategy is commonly recommended if the object model’s changeability (i.e. ability to easily change) is of primary importance because new classes can be added easily, without the need to modify existing tables. A significant drawback of this mapping strategy, however, is that multiple joins are needed to assemble all attribute data of an object. Moreover, if no views are used, it is relatively difficult to formulate ad-hoc queries because multiple tables need to be accessed to retrieve all object data [9]. 2.2 One Inheritance Tree – One Table Following the ‘one inheritance tree – one table’ mapping strategy, all classes of an inheritance hierarchy are mapped to the same relational table. This mapping strategy requires an additional relational field in the shared table, which indicates the type of each row. This mapping strategy offers the best performance for polymorphic queries and allows easy ad-hoc reporting since only a single table needs to be accessed [9]. The changeability of the object model is reduced compared to the ‘one class – one table’ mapping strategy, however, because any object schema modification forces a modification of the sole inheritance table, which may already contain data. Finally, if objects are stored using this mapping strategy, all relational fields that are not needed to store an object must contain null values. This especially applies to
46
S. Holder, J. Buchan, and S.G. MacDonell
relational fields corresponding to attributes of subclasses since these fields are not used when storing objects from other subclasses. 2.3 One Inheritance Path – One Table The ‘one inheritance path - one table’ mapping strategy only maps each concrete class to a table. Thereby, all inherited and non-inherited attributes of a class are mapped to the same table. Each table then only contains instances of its corresponding concrete class. Under this mapping strategy non-polymorphic queries run as quickly as they do under the ‘one inheritance tree – one table’ mapping strategy because only one table needs to be accessed. In contrast, a polymorphic query against a class needs to access the table that corresponds to this class and all tables that correspond to its subclasses, which could result in a significant performance overhead. Moreover, this mapping strategy implies a duplication of relational fields, which results in multiple updates of these fields if the corresponding class attribute is changed. The three inheritance mapping strategies just described are restrictive in that they are only applicable to entire inheritance hierarchies. In the next section, we introduce more finely-grained inheritance mapping strategies that can be used in combination on elements of an inheritance hierarchy in order to produce an optimal mapping.
3 Semantics of Mapping Strategies for Individual Inheritance Relationships In practice, current O/R middleware products such as Hibernate [6] support the mixing of different inheritance mapping strategies for one inheritance hierarchy, thereby providing a finer level of granularity in the mapping strategy selection than the basic mapping strategies described in Section 2. However, the opportunity to mix inheritance mapping strategies means that the developer must decide what mix of mapping strategies will be optimum for a given object model structure. This requirement strengthens the likely usefulness of a set of mapping metrics that reflect this finer granularity. Hence, the ability to use such a suite of metrics to inform this decision, as proposed by this paper, should ease the developer’s effort and result in a better quality decision. Before discussing the development of these metrics a method of clearly representing the semantics of individual inheritance mapping strategies is needed. The following notation will be used and follows the inheritance mapping model definitions suggested by [4]. In the following definitions, P denotes the superclass (parent class) at the superclass-end of an example inheritance relationship while C denotes the subclass (child class) at the subclass-end of this inheritance relationship. Union superclass: If UC denotes the set of classes that are reachable from C (including C) via ‘union superclass’ inheritance relationships, the attributes defined by all classes in UC are mapped to a table corresponding to the most general class in UC. Moreover, a type indicator field is needed in this table in order to determine the class type of each row. Joined subclass: The attributes defined by C as well as the primary key attributes inherited from P or another superclass are mapped to an own table T. The primary key fields of T contain a foreign key constraint to the same primary key fields in the table to which P is mapped.
Towards a Metrics Suite for Object-Relational Mappings
47
Union subclass: If C is abstract and all inheritance relationships to its direct subclasses are mapped with ‘union subclass’ (or C does not have any subclasses), then no corresponding table is created for C. Otherwise, the attributes of C and the attributes of all superclasses of C are mapped to an own table. Figure 1, adapted from an example in [4], shows a sample mapping that represents the above definitions. In it, white boxes denote classes and grey boxes denote tables. The mapping from classes to tables is indicated with black arrows and inheritance relationships (white arrows) are labeled with the mapping strategies applied to them. Having introduced the semantics of mapping strategies for individual inheritance relationships, in the next section we describe the derivation of measurement goals relevant to the assessment of the impact of these mapping strategies. Person
person
id name
union subclass
union superclass
student id name university
Student university
Employee salary department
foreign key
manager union subclass id name salary department bonus
id name salary department type
joined subclass clerk
Manager bonus
Clerk occupation
id occupation
Fig. 1. Example of mixed inheritance mapping strategies
4 Measurement Goals Bearing in mind that different mapping strategies have different impacts in terms of the non-functional characteristics of applications, our intent in this section is to identify measures that would be useful in guiding the developer’s choice of mapping. In order to identify appropriate metrics, we follow a simplified variant of the commonly employed Goal/Question/Metric (GQM) approach [2, 3]. The GQM approach defines a framework for identifying metrics by defining goals, asking questions related to how these goals could be achieved and defining metrics that are intended to answer the posed questions. In the first step of identifying relevant measurement goals, we consider software quality (non-functional) characteristics that are influenced by O/R mappings. The result of our top-down approach to identifying software quality characteristics for O/R mappings is depicted in Figure 2.
48
S. Holder, J. Buchan, and S.G. MacDonell Quality characteristics according to ISO/IEC 9126
Refined quality characteristics
DB queries Time behaviour Efficiency Resource utilisation
Secondary storage
Polymorphic queries Non-polymorphic queries DB inserts and updates Additional null values Redundancy Change propagation
Fig. 2. Quality characteristics of O/R mappings Table 1. Description of quality characteristics Quality characteristic Time behavior of polymorphic database queries Time behavior of non-polymorphic database queries Time behavior of database inserts and updates Additional null values Redundancy Change propagation
Description Mainly depends on the number of tables that need to be accessed / the number of joins that need to be performed. Higher redundancy can improve the time behavior of polymorphic database queries but can negatively affect the time behavior of database inserts and updates [9]. Null values that solely result from the applied mapping strategy The degree of redundancy that is caused by the applied mapping strategy Extent to which it is necessary to adapt the relational schema and the O/R mappings to changes in the object model Depends on whether existing tables need to be modified for adding/deleting classes Depends on the ability of the relational schema to enforce integrity constraints Uniformity of applied mapping strategies. Refers to the overall mapping and is not applicable to individual mapping strategies Extent to which the object model resembles the relational schema Effort required to formulate ad-hoc queries
Towards a Metrics Suite for Object-Relational Mappings
49
This network of quality characteristics subsumes and extends the classifications of O/R mapping impacts suggested by [9, 11]. It starts with the quality characteristics efficiency, maintainability, and usability, which are a subset of the software quality characteristics defined by the ISO/IEC 9126-1 standard [7]. The other high-level software quality characteristics of the ISO/IEC 9126-1 standard – functionality, reliability, and portability – are not considered to be significantly influenced by O/R mappings, and so are not included here. Efficiency, maintainability, and usability are then split into quality sub-characteristics, also defined by the ISO/IEC 9126-1 standard. Finally, these sub-characteristics are refined into specific quality characteristics relevant to O/R mappings. In Section 5, we propose metrics to measure these specific quality characteristics. In Table 1, the specific quality characteristics are listed and explained. This builds on the works of Keller [9] and Philippi [11] by considering additional characteristics, namely Redundancy, Change isolation, Constraint assurance, and Mapping uniformity, and we further extend their work by proposing metrics for some of these. The next section describes and justifies the metrics developed for these quality characteristics and provides examples of their use.
5 Metrics for Inheritance Mappings Metrics have been suggested for object-oriented design [5] and for relational database systems [12] as well as for object-relational database systems [1]. These metrics, while useful, are considered insufficient for measuring the impacts of O/R mappings for two reasons. First, the available metrics for relational schemas do not sufficiently cover the suggested network of quality characteristics (see Section 4). Second, the metrics focus on either object-oriented design or relational schemas but not on the mapping between them. Therefore, the metrics suite suggested here, comprising four metrics at the level of individual classes, is complementary to those described elsewhere, in that it explicitly addresses the measurement of O/R mappings. 5.1 Table Accesses for Type Identification (TATI) Polymorphic queries against a class C return objects whose most specific class is C or one of its subclasses. Before an object can be completely retrieved, it is necessary to identify the most specific class of this object. It should be noted that identifying the most specific class is equivalent to identifying the tables that need to be queried in order to retrieve the requested object. In contrast, for non-polymorphic queries, the most specific class of the requested object is the same class against which the query is issued. There are two strategies to identify the most specific class: either each possible table is queried individually and the search is stopped as soon as the most specific class is identified, or all possible tables are queried with one query. While the former strategy means that the search is completed as soon as the most specific class is found, the latter strategy allows the database system to query the tables in parallel. The latter strategy can be accomplished by sending multiple queries to the database at the same time or by using the SQL UNION clause. The maximum number of table accesses measured with this metric is therefore only relevant if the former strategy is applied.
50
S. Holder, J. Buchan, and S.G. MacDonell
Definition: For queries that are issued against a class C, TATI(C) equals the maximum number of tables that have to be accessed in order to identify the most specific class of the requested object. The maximum number of tables that need to be accessed for a query issued against a class C equals the number of different tables that correspond to C and all of its subclasses. In Figure 1, TATI(Person) = 4 since Person is the root class of the inheritance hierarchy and the inheritance hierarchy is mapped to 4 tables altogether. Similarly, TATI(Employee) = 3 since Employee and its subclasses Manager and Clerk are mapped to the 3 tables person, manager, and clerk. 5.2 Number of Corresponding Tables (NCT) In contrast to polymorphic queries, the most specific class of a queried object in a nonpolymorphic query is known. Therefore, no queries are necessary in this case to determine the most specific class. The performance of non-polymorphic queries therefore mainly depends on the number of tables that contain data of the requested object. For polymorphic queries, it is possible to retrieve the complete object data while querying tables in order to identify the most specific class of the requested object. We therefore propose the metric NCT(C), which equals the number of tables that contain data from instances of a class C. This number depends on the inheritance mapping strategies that are used for the inheritance relationships on the path of class C to the root class of the inheritance hierarchy. In particular, the application of the ‘joined subclass’ strategy results in increasing values of NCT. As already indicated, this metric is a measure of object retrieval performance. In addition, this metric is a measure of query complexity in the context of ad-hoc queries since we consider the number of involved tables to be an appropriate measure of the user’s effort in formulating a query. However, using the number of tables to measure ad-hoc queries assumes that no views are employed to ease query formulation. Definition: NCT is formally defined by equation (1), where the function p(C) returns the direct superclass of C. if C is the root class or ⎧1 ⎪ ↑ (p(C ), C ) is mapped via ' union subclass' (1) ⎪ NCT (C ) = ⎨ if C is mapped to the same table as its superclass ⎪ NCT ( p (C )) ⎪⎩ NCT ( p (C )) + 1 if C is mapped to its own table
In Figure 1, NCT(Clerk) = 2 because the tables clerk and person contain data that are necessary to assemble objects of Clerk. In contrast NCT(Manager) = 1, as all defined and inherited attributes of the class Manager are mapped to relational fields of the table manager. 5.3 Number of Corresponding Relational Fields (NCRF) The Number of Corresponding Relational Fields (NCRF) gives a measure for the degree of change propagation for a given O/R mapping. More specifically, this metric reflects the effort required to adapt the relational schema after inserting, modifying, or deleting a class attribute. This effort is mainly influenced by the application of the ‘union subclass’
Towards a Metrics Suite for Object-Relational Mappings
51
mapping strategy since applying this mapping strategy typically results in the duplication of relational fields (see Section 2.3). Because of these duplications, changes in the object model result in multiple changes to the relational schema. In contrast, the duplication of relational fields does not occur when the ‘joined subclass’ or the ‘union superclass’ mapping strategies are applied. (Note: primary key fields are not considered by this metric because they should be resistant to changes.) Definition: For a class C, NCRF(C) equals the number of relational fields in all tables that correspond to each non-inherited non-key attribute of C. If C does not have any noninherited non-key class attributes, NCRF(C) equals the number of relational fields to which each non-inherited non-key class attribute of C would be mapped. In Figure 1, NCRF(Person) = 3 since the class attribute Person.name is mapped to the three relational fields person.name, student.name, and manager.name. NCRF(Employee) = 2 since each of the attributes Employee.salary and Employee.department is mapped to one relational field in the table person and one relational field in the table manager. Finally, NCRF(Student) = NCRF(Manager) = NCRF(Clerk) = 1 since the only noninherited class attribute of each of these three classes (Student.university, Manager.bonus, Clerk.occupation) is mapped to 1 relational field. 5.4 Additional Null Values (ANV) ANV measures additional storage space in terms of null values that result when different classes are stored together in the same table using the ‘union superclass’ mapping strategy (see Section 2.2). For a definition of ANV(C), the following is considered. Let AC be the set of noninherited attributes of class C and let FC be the set of corresponding relational fields in the shared table. Applied to a particular class C, the aim of ANV(C) is to give a measure for the number of null values that occur at the relational fields FC. An important observation is that null values at the relational fields FC occur if and only if instances of classes different from C and different from subclasses of C are stored in the shared table. More precisely, if instances of two distinct classes B and C are stored together in a shared table and B is not a subclass of C, then each row in the shared table that represents an instance of B contains a null value at each relational field in FC. ANV(C) is therefore higher the more classes different from C and different from subclasses of C are mapped to the same table as C. The number of null values depends on the number of instances of each class, something that may be unknown at the stage of mapping specification. In order to give an approximation of additional null values, it is assumed that there is the same number of instances for all (concrete) classes. Furthermore, ANV is normalized by assuming that there is only one instance of each class. This assumption also ensures that ANV only depends on a given object model and is thus in line with the previously described metrics. Note, however, that this metric could easily be generalized to take the number of instances per class into account if this is known. Definition: ANV(C) equals the number of non-inherited attributes in C multiplied by the number of concrete classes that are mapped to T, excluding C and all of its subclasses.
52
S. Holder, J. Buchan, and S.G. MacDonell
person «abstract» Person
id name university majorSubject isEnrolled salary occupation type
id name
union superclass
union superclass
Student
Employee
university majorSubject isEnrolled
salary
joined subclass
union superclass
manager Manager
id bonus
bonus
Clerk occupation
Fig. 3. Example mapping to illustrate the ANV metric
For the example mapping shown in Figure 3, ANV(Student) is calculated as follows. Concrete classes that are mapped to the same table as Student are Employee and Clerk. These two classes do not inherit the attributes declared by Student; therefore, additional null values are contained by rows that correspond to instances of Employee and Clerk. If one instance for each class was stored, in total there would be 2·3 = 6 null values in the rows of the table person at the fields university, majorSubject, and isEnrolled. Thus, ANV(Student) = 6.
6 Metric Coverage Table 2 shows the coverage of the defined quality characteristics by the proposed metrics. This table shows that more metrics are needed in order to more fully measure the relevant quality characteristics of mapping strategies. Table 2. Metric coverage for inheritance mapping strategies Quality characteristic Polymorphic queries Non-polymorphic queries DB inserts and updates Additional null values Change propagation Change isolation Constraint assurance Mapping uniformity Schema correspondence Query complexity
TATI X
NCT X X X
NCRF
ANV
X X
X
Towards a Metrics Suite for Object-Relational Mappings
53
The quality characteristic Redundancy is not included in the table since this characteristic is only applicable to association mapping strategies and not to inheritance mapping strategies [10]. It should be noted that although the ‘one class – one inheritance tree’ mapping strategy leads to a duplication of relational fields, it does not imply redundancy.
7 Conclusions and Further Research The application of metrics in O/R mapping tools provides significant potential for supporting the developer in the considerably difficult task of mapping specification. Developers would benefit from a facility that supports the manual specification of O/R mappings by giving feedback on the impacts of these mapping strategies. Since the impacts of mapping strategies strongly depend on the concrete characteristics of the object model, metrics are considered an appropriate means to convey these schema characteristics. Moreover, the adoption of O/R metrics in model-driven generation of O/R mappings should enable a more appropriate selection of mapping strategies that leads to better fulfillment of non-functional requirements. As mapping impacts differ particularly for inheritance mappings, we have focused on developing a set of metrics for inheritance mapping strategies. This metrics suite is based on a novel inheritance mapping model that supports the mixing of inheritance mapping strategies in inheritance hierarchies. We plan to empirically evaluate the suggested metrics in terms of their utility in giving feedback about the impacts of mapping strategies to developers. We will extend this further by investigating algorithms that automatically determine the appropriate mapping strategy based on the requirements of the developer. To achieve this goal, normalization of the metrics will be required to ensure that metric values can be weighted and compared appropriately. Finally, as object-relational database management systems (ORDBMSs) become increasingly prevalent, support for mapping from object-oriented programming languages to ORDBMS-schema also becomes more important. We will therefore further investigate how mapping specification for ORDBMS-schemas can be supported by the application of metrics.
References 1. Baroni, A.L., Calero, C., Piattini, M., Abreu, F.B.: A Formal Definition for ObjectRelational Database Metrics. In: 7th International Conference on Enterprise Information Systems (ICEIS 2005), Miami, pp. 334–339 (2005) 2. Basili, V.R., Caldiera, G., Rombach, H.D.: The Goal Question Metric approach. Encyclopedia of Software Engineering 1, 528–532 (1994) 3. Basili, V.R., Rombach, H.D.: The TAME project: Towards Improvement-Oriented Software Environments. IEEE Transactions on Software Engineering 14(6), 758–773 (1988) 4. Cabibbo, L., Carosi, A.: Managing Inheritance Hierarchies in Object/Relational Mapping Tools. In: Pastor, Ó., Falcão e Cunha, J. (eds.) CAiSE 2005. LNCS, vol. 3520, pp. 135– 150. Springer, Heidelberg (2005)
54
S. Holder, J. Buchan, and S.G. MacDonell
5. Chidamber, S.R., Kemerer, C.F.: A Metrics Suite for Object Oriented design. IEEE Transactions on Software Engineering 20(6), 476–493 (1994) 6. Hibernate Open Source Software, http://www.hibernate.org 7. ISO/IEC 9126-1:2001 Software Engineering - Product Quality - Part 1: Quality Model, http://www.iso.org 8. Keller, A.M., Jensen, R., Keene, C.: Persistence Software: Bridging Object-Oriented Programming and Relational Databases. In: Proc. of the 1993 ACM SIGMOD Int. Conf. on Management of Data, Washington, D.C, pp. 523–528 (1993) 9. Keller, W.: Mapping Objects to Tables: A Pattern Language. In: Proc. of the 1997 European Conf. on Pattern Languages of Programming (EuroPLoP 1997), Irrsee, Germany (1997) 10. Oertly, F., Schiller, G.: Evolutionary database design. In: 5th International Conference on Data Engineering (ICDE), Los Angeles, pp. 618–624 (1989) 11. Philippi, S.: Model Driven Generation and Testing of Object-Relational Mappings. Journal of Systems and Software 77(2), 193–207 (2005) 12. Piattini, M., Calero, C., Genero, M.: Table Oriented Metrics for Relational Databases. Software Quality Journal 9(2), 79–97 (2001) 13. Rumbaugh, J., Blaha, M.R., Premerlani, W., Eddy, F., Lorensen, W.: Object-Oriented Modelling and Design. Prentice-Hall, Englewood Cliffs (1991)
View-Based Integration of Process-Driven SOA Models at Various Abstraction Levels Huy Tran, Uwe Zdun, and Schahram Dustdar Distributed Systems Group, Institute of Information Systems Vienna University of Technology, Austria {htran,zdun,dustdar}@infosys.tuwien.ac.at
Abstract. SOA is an emerging architectural style to achieve looselycoupling and high interoperability of software components and systems by using message exchanges via standard public interfaces. In SOAs, software components are exposed as services and typically coordinated by using processes which enable service invocations from corresponding activities. These processes are described in high-level or low-level modeling languages. The extreme divergence in term of syntax, semantics and levels of abstraction of existing process modeling languages hinders the interoperability and reusability of software components or systems being built upon or relying on such models. In this paper we present a novel approach that provides an automated integration of modeling languages at different abstraction levels using the concept of architectural view. Our approach is realized as a view-based reverse engineering tool-chain in which process descriptions are mapped onto appropriate high-level or low-level views, offered by a view-based modeling framework.
1
Introduction
In a Service-oriented Architecture (SOA), software components are often exposed as services that have standard interfaces and can be invoked by message exchanges. A number of relevant services can be coordinated to achieve a specific business functionality. The integration and interoperability of the software components or systems are accomplished by orchestrating them using a process, which is deployed in a process engine. Each process typically consists of a control flow, a number of service invocations and other activities for data processing, fault and transaction handling, and so on. Processes are often developed using modeling languages such as EPC [4, 13], BPMN [9], UML Activity Diagram extensions [8], BPEL [7] or XPDL [15]. Business analysts usually design processes in high abstraction languages, such as BPMN, EPC, or UML Activity Diagram, and developers implement them using executable languages, such as BPEL/WSDL. An important issue that hinders the interoperability and the reusability of existing process models is the huge divergence of these modeling languages. This issue occurs because there is no explicit link between two modeling languages at the same or different abstraction levels. For instance, developers could not re-use or integrate the R.-D. Kutsche and N. Milanovic (Eds.): MBSDI 2008, CCIS 8, pp. 55–66, 2008. c Springer-Verlag Berlin Heidelberg 2008
56
H. Tran, U. Zdun and S. Dustdar
whole or part of a process described using BPEL in another process developed using BPMN or EPC, and vice versa. The most popular solution for this issue is to define direct transformations between the different process modeling languages [11,6,16,5]. These approaches, even though they partially solve the problem, pose a serious limitation regarding extensibility. First, they focus on one concern, the control flow of the process models, and ignore other crucial concerns, such as collaborations, data processing, fault handling, and so on. Second, each of these transformation approaches only provides the integration of two specific kinds of process models, but provides neither interoperability with process models realized in other languages than those two nor the reusability of these models. To overcome this issue and the limitations of transformation-based approaches, we present a novel integration approach for enhancing the interoperability and reusability of process-driven SOA models. Our approach exploits the concept of architectural view during the bottom-up analysis, then maps the process descriptions onto relevant high-level or low-level views. Using a view-based modeling framework (VbMF), originally introduced in our previous work [12], we can provide the interoperability and reusability between these views, or in other words, between different process modeling languages. In this paper, we first present an overview of VbMF in Section 2. Section 3 describes the model integration approach we propose in this paper in terms of our view-based reverse engineering tool-chain. Section 4 depicts the empirical analysis of this approach to exemplify process descriptions in popular modeling languages, namely, BPEL and WSDL, via a simple but realistic example. Finally, we compare to related work in Section 5 and conclude.
2
Overview of View-based Modeling Framework (VbMF)
Figure 1(a) gives an overview of the concepts in VbMF. A view is defined according to a corresponding meta-model, which is a (semi-)formalized representation of a particular process concern. At the heart of VbMF is the Core meta-model (see Figure 1(b)) from which each view meta-model is derived. Example metamodels that we have derived from the Core include the following views [12]: Collaboration (see Figure 2(b)), Control Flow (see Figure 2(a)) and Information. Based on the meta-models, we derive particular views. For specific technologies, for instance, BPEL and WSDL, we provide extension meta-models which comprise additional details required to depict the specifics of these technologies. Figure 3 depicts the BPEL-specific extension of the collaboration view in Figure 2(b). Hence, we can use the distinction of Core meta-model, generic view meta-models, and extension view meta-models to handle different abstraction levels, including business-level concerns and technical concerns. The main task of the Core meta-model (shown in Figure 1(b)) is to provide integration points for the various meta-models defined as parts of VbMF, as well as extension points for enabling the extension with views for other concerns or more specific view meta-models, such as those for specific technologies. Consider the control flow view meta-model in Figure 2(a) as an example: A control flow
View-Based Integration of Process-Driven SOA Models
Fig. 2. The control flow view and collaboration view meta-models
view comprises many activities and control structures. The activities are process tasks, such as service invocations, or data handling, while control structures describe the execution order of the activities. The control flow view meta-model is defined by extending the View and ExtensibleElement meta-classes from Core. All other view meta-models are essentially defined in the same way: by extending the model elements provided as integration and extension points in the Core meta-model. The more specific extension view meta-models are also defined in the same way: they extend the elements of their root meta-models and of the Core meta-model. For instance, a BPEL-specific control flow meta-model with
58
H. Tran, U. Zdun and S. Dustdar Interaction (collaboration) CorrelationSet
0..1 out variable Variable 0..1 variable variable * 0..1 0..1 in
-type : String
Fig. 3. BPEL extension of the collaboration view meta-model interpretes
Process descriptions (BPEL,WSDL,etc.) High-level Views
produces
Framework meta-models conforms
produces
Low-level Views
"virtual" integration of high-level and low-level representations in various languages corresponds to
High-level Languages described in
defines
refines into
View-based intepreters
corresponds to
"virtually" refines
conforms
described in
Low-level Languages
Fig. 4. Integration of various modeling languages using view-based reverse engineering
extra elements, such as while, wait, terminate, etc., can be defined by extending the elements from Figures 1(b) and 2(a). In our implementation of these concepts, we exploit the model-driven software development (MDSD) paradigm [14] to separate the platform-neutral views from the platform-specific views. Platform-specific models or executable code, for instance, Java code, or BPEL and WSDL descriptions, can be generated from the views by using model-to-code transformations. Our prototype is realized using openArchitectureWare (oAW) [10], a model-driven software development tool, and the Eclipse Modeling Framework [3]. The tools allow stake-holders of a process to only view a specific perspective, by examining a single view, or to analyze any combination of views (i.e., to produce an integrated view).
3
View-Based Model Integration
In this section, we present a view-based reverse engineering approach for addressing the divergence issue of modeling languages and for overcoming the limitations of transformation-based solutions. The ultimate goal of this approach is
View-Based Integration of Process-Driven SOA Models
59
to map the process descriptions onto high-level or low-level views, appropriate for different stakeholders (see Figure 4). To demonstrate our approach, we have exemplified it using the combination of BPEL and WSDL, which are possibly the most popular process and service modeling descriptions used by numerous companies today. The same approach can be taken for any other process-driven SOA technologies. In addition, due to space limitation, we only present the extraction and the integration of two basic views in VbMF: the control flow view and the collaboration view. Other views, for instance, the Information View, the Transaction View, etc., can be integrated using the same approach. The tool-chain consists of a number of view-based interpreters, such as control flow interpreter, information view interpreter, collaboration view interpreter, and so on. Each interpreter is responsible for extracting one view from the process descriptions. Therefore, an interpreter for a certain view must be defined based on the meta-model which that view conforms to. For instance, the control flow view consists of elements, such as Activity, Flow, Sequence, Switch, etc. (see also Figure 2(a)). To extract the control flow view from process descriptions, the interpreter walks through the input descriptions to pick only these elements and ignores others. As the modeling framework grows with additional views, the reverse engineering tool-chain can be scaled to fit to the growing framework. To add a new view to the framework, an adequate meta-model of this view has to be specified using extension mechanisms [12]. Next, we develop an interpreter based on the new meta-model specification and hook the it into the tool-chain.
4
Empirical Analysis of Model Integration
In this section, we exemplify our approach for the combination of the process and service modeling languages BPEL and WSDL. However, the concepts are not specific for BPEL/WSDL, but applicable in the same way for other processdriven modeling languages. To demonstrate the realization of aforementioned concepts, we present a simple but realistic case study, namely, a Shopping process, which depicts a typical e-commerce scenario. The Shopping process is represented in BPEL and WSDL. The process starts when the customer’s purchase order arrives together with their billing information (e.g., credit card). Then, the process invokes a Banking service to validate the customer’s credit card through the VerifyCreditCard activity. If the validation is successful, the customer will obtain the purchased items which is shipped by a delegated Shipping service. After that, the process will send the order invoice to the customer. Otherwise, the order will be canceled as a negative confirmation is received from the Banking service. An excerpt from the BPEL/WSDL code is shown on the left hand side of Figures 5 and 6. 4.1
Control Flow View Mapping
According to the specification of the control flow view meta-model (see Figure 2(a)), a control flow view consists of elements which represent the control
60
H. Tran, U. Zdun and S. Dustdar <process name="Shopping"> <partnerLinks> <partnerLink name="Seller" partnerLinkType="Seller" myRole="Seller" /> <partnerLink name="Approver" partnerRole="Approver" partnerLinkType="Approver" /> <partnerLink name="Payer" partnerRole="Payer" partnerLinkType="Payer" /> <partnerLink name="Shipping" partnerRole="Shipper" partnerLinkType="Shipping"/> <sequence> <switch> <sequence>
Fig. 5. The mapping of the Shopping process descriptions in BPEL to the collaboration view (top-right) and the control flow view (bottom-right) Table 1. The basic mapping from BPEL onto the control flow view BPEL element invoke, receive, reply, assign name=”...” sequence name=”...” flow name=”...” switch name=”...” case name=”...” condition=”...” otherwise name=”...”
Control flow view element controlflow::SimpleActivity setName() controlflow::Sequence/setName() controlflow::Flow/setName() controlflow::Switch/setName() controlflow::Case setName() setCondition() controlflow::Otherwise/setName()
flow of a business process. The hierarchy and the execution order of the control flow are defined using basic structured activities. These structured activities include: sequence, for defining a sequential execution order; flow, for the concurrent executions; and switch-case-otherwise, for conditional branches. Structured activities can be nested and combined in arbitrary manners to represent various complex control flows in BPEL processes. In addition, BPEL also has primitive activities, such as invoke, an service invocation; receive, waiting for a message
View-Based Integration of Process-Driven SOA Models
61
from partners; reply, sending back a response to a certain partner; assign, assigning values to BPEL variables. The interpreter walks through the process description in BPEL and collects the information of atomic and structured activities. Then, it creates the correspondent elements in the view and assigns relevant values (e.g. the name attribute) to their attributes. We describe in Table 1 the mapping specification between BPEL and the control flow view elements/attributes. The BPEL mapping is illustrated in Figure 5. 4.2
Collaboration View Mapping
The collaboration view interpreter is realized using the same approach as the control flow interpreter. However, the collaboration view comprises not only the elements from BPEL but also from WSDL. Hence, first of all, the interpreter has to collect all service interface, message, role, partnerLinkType descriptions from WSDL. Then, the interpreter creates relevant elements in the collaboration view according the mapping rules given in Table 2. Figure 6 depicts the mapping of the Shopping process in WSDL to corresponding elements in the collaboration view. Next, the interpreter walks through the BPEL code to extract collaborative elements in a similar manner. The basic activities, namely, invoke, receive, and reply, appear on the collaboration view with the same name as in the control flow view. However, these activities contain additional collaborative attributes as depicted in Table 3. The BPEL mapping is illustrated in Figure 5. Table 2. The basic mapping from WSDL onto collaboration view elements WSDL element definition message name=”...” portType name=”...” operation name=”...” input,output name=”...” message=”...” plnk::partnerLinkType name=”...” plnk::Role name=”...” service name=”...”
Besides generating the collaboration view’s elements, the interpreter uses the information collected in the former step to establish necessary relationships between these elements. For instance, the relationship between collaboration::Interaction and collaboration::PartnerLink elements is derived from the association between the communication activities (e.g.,invoke, receive, reply) and the partnerLink, or the relationship between collaboration::PartnerLink and collaboration::PartnerLinkType is derived from the association among the partnerLinkType elements in WSDL and the partnerLink elements in BPEL, and so on.
62
H. Tran, U. Zdun and S. Dustdar <definitions> <message name="PurchaseOrder" /> <message name="OrderResponse" /> <portType name="Shopping"> <partnerLinkType name="Seller"> <portType name="Shopping" /> <partnerLinkType name="Approver"> <portType name="CreditCard" /> <partnerLinkType name="Payer"> <portType name="CreditCard" /> <partnerLinkType name="Shipping"> <portType name="Shipping" />
Fig. 6. The mapping of the Shopping process descriptions in WSDL to the collaboration view
4.3
BPEL-extension Collaboration View Mapping
According to the collaboration view meta-model, we map the Shopping process onto a high abstraction view as shown in Figures 6 and 5. This view is suitable for communication with business analysts, but it does not provide appropriate information for IT experts. As we mentioned in Section 2, the collaboration view can be refined into a lower abstraction view, namely, the BPELCollaboration View (see Figure 3). To demonstrate the concept of view refinement, we present the mapping of the Shopping process in BPEL to the corresponding BPEL-specific collaboration view in Figure 7 according to the specification in Table 3. For the sake of readability, we omit the elements inherited from the collaboration view and depict an excerpt of the BPEL-extension collaboration view together with additional features. 4.4
Discussion
In the previous empirical analysis, we illustrated the mapping of process descriptions onto high-level or low-level views. The high-level views in our approach are platform-independent models designed to capture abstract features in a process. Corresponding to these views are high-level modeling languages, such as EPC, BPMN or UML Activity Diagram. In this paper, we illustrated a mapping of such models into our framework from high-level view models, which reflect the concepts in the low-level models rather closely. This has a number of advantages. Firstly, it uses one kind of modeling approach for all types of views. Secondly,
View-Based Integration of Process-Driven SOA Models
Fig. 7. The mapping of the Shopping process descriptions in BPEL to the BPEL collaboration view
it avoids any semantic mismatch or transformation between modeling concepts. But, on the other hand, this approach has the disadvantage that existing modeling language code (say realized in EPCs, BPMN or UML Activity Diagrams) would have to be mapped to our high-level models, which could be a considerable effort for huge existing process repositories. But in general this is possible and can even be largely automated, because our control flow view represents five basics patterns that exist in any modeling language; the collaboration view describes generic interactions that typically occur between a process and its partners [12]. Hence, an adequate interpreter for a certain language can distill these views from the process descriptions in that language using our approach. Alternatively, our approach can of course also be extended with a respective new view model, such as an EPC or BPMN control flow view. A high-level view can be refined into a low-level view which captures the specifics of a particular technology. For example, the collaboration view is extended and refined to the BPEL collaboration view that embodies several BPELspecific features. Using the same approach, we can define appropriate metamodels and interpreters for pulling out corresponding views from process implementations in low-level, executable languages such as BPEL. Using VbMF the stakeholders can work on a particular view or can examine any combination of several views instead of manipulating various kinds of process descriptions or digging into the implementation code in executable languages. Our approach can help the stakeholders to quickly understand and grasp the information in (different) modeling languages as well as re-use adequate views.
5
Related Work
In the software engineering area, the concept of reverse engineering is the process of analyzing a system to identify the system’s components and their relationships
64
H. Tran, U. Zdun and S. Dustdar Table 3. The basic mapping from BPEL description onto the collaboration view
and create representations of the system in another form or at a higher level of abstraction [1, 2]. We devised a novel view-based reverse engineering approach that supports extracting relevant abstraction levels of process representations in terms of architectural views. Various modeling languages can be integrated into VbMF and manipulated or re-used in other process models. Existing modeling languages provide high-level abstractions, such as EPC [13, 4], BPMN [9], or UML Activity Diagram extensions [8], or low-level and executable descriptions, such as BPEL [7], or XPDL [15]. There are several efforts to transform process models described in one language into models represented in another language. For instance, Mendling et al. [6] present the mapping of
View-Based Integration of Process-Driven SOA Models
65
BPEL to EPCs; Ziemann et al. [16] report on an approach to model BPEL processes using EPC-based models; Recker et al. [11] translate between BPMN and BPEL; Mendling et al. [5] discuss X-to-BPEL and BPEL-to-Y transformations. These transformation-based approaches mostly focus on one concern of the process models, namely, the control flow. There is no support for handling or integrating other process concerns, such as service interactions, data processing, or transaction handling. Moreover, each of these approaches only provide the integration of a certain pair of process modeling languages, but does not offer the interoperability of process models in other languages, or the reusability of these models to develop other processes. Zou et al. [17] propose an approach for extracting business logic, in term of workflows, from existing e-commerce applications. The analyzing process is guided by documented workflows to identify the business logic. Then, the business logic is captured in terms of the control flows using the concept of process algebra. This approach aims at providing high-level representations of processes and maintaining the relationships among different abstraction levels to quickly re-act to changes in business requirements. This approach only focuses on control flow and does not target other concerns. In addition, there is no support for the interoperability and the re-usability of different software components.
6
Conclusion
Interoperability and reusability suffer from the heterogeneous nature of the participants of a software system. SOA partially reconciles this heterogeneity by defining standard service interfaces as well as messaging mechanisms for communicating between services. Process-driven SOAs provide an efficient way of coordinating various services in terms of processes to accomplish a specific business goal. However, the huge divergence of process modeling languages raises a critical issue that deteriorates the interoperability and the reusability of software components or systems. Our approach, presented in this paper, exploits the concept of architectural views and a reverse engineering tool-chain to map high-level or low-level descriptions of processes into appropriate views. The resulting views are integrated into the view-based modeling framework and can be manipulated or re-used to develop other processes.
References 1. Biggerstaff, T.J.: Design recovery for maintenance and reuse. IEEE Computer 22(7), 36–49 (1989) 2. Chikofsky, E.J., Cross, J.H.I.: Reverse engineering and design recovery: A taxonomy. IEEE Software 7(1), 13–17 (1990) 3. Eclipse. Eclipse Modeling Framework (2006), http://www.eclipse.org/emf/ 4. Kindler, E.: On the semantics of EPCs: A framework for resolving the vicious circle. In: Business Process Management, pp. 82–97 (2004)
66
H. Tran, U. Zdun and S. Dustdar
5. Mendling, J., Lassen, K.B., Zdun, U.: Transformation strategies between blockoriented and graph-oriented process modelling languages. Technical Report JM200510 -10, WU Vienna (2005) 6. Mendling, J., Ziemann, J.: Transformation of BPEL processes to EPCs. In: Proc. of the 4th GI Workshop on Event-Driven Process Chains (EPK 2005), December 2005, vol. 167, pp. 41–53 (2005) 7. OASIS. Business Process Execution Language (WSBPEL) 2.0 (May 2007), http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.pdf 8. OMG. Unified Modelling Language 2.0 (UML) (2004), http://www.uml.org 9. OMG. Business Process Modeling Notation (February 2006), http://www. bpmn.org/Documents/OMG-02-01.pdf 10. openArchitectureWare.org (August 2002), http://www.openarchitectureware.org 11. Recker, J., Mendling, J.: On the translation between BPMN and BPEL: Conceptual mismatch between process modeling languages. In: Eleventh Int. Workshop on Exploring Modeling Methods in Systems Analysis and Design (EMMSAD 2006), June 2006, pp. 521–532 (2006) 12. Tran, H., Zdun, U., Dustdar, S.: View-based and Model-driven Approach for Reducing the Development Complexity in Process-Driven SOA. In: Intl. Working Conf. on Business Process and Services Computing (BPSC 2007), Lecture Notes in Informatics, vol. 116, September 2007, pp. 105–124 (2007) 13. van der Aalst, W.: On the verification of interorganizational workflows. In: Computing Science Reports 97/16, University of Technology, Eindhoven (1997) 14. V¨ olter, M., Stahl, T.: Model-Driven Software Development: Technology, Engineering, Management. Wiley, Chichester (2006) 15. WfMC. XML Process Definition Language (XPDL) (April 2005), http://www.wfmc.org/standards/XPDL.htm 16. Ziemann, J., Mendling, J.: EPC-based modelling of BPEL processes: a pragmatic transformation approach. In: Proc. of the 7th Int. Conference Modern Information Technology in the Innovation Processes of the Industrial Enterprises (MITIP 2005) (2005) 17. Zou, Y., Hung, M.: An approach for extracting workflows from e-commerce applications. In: ICPC 2006: Proc. of the 14th IEEE Int. Conf. on Program Comprehension (ICPC 2006), Washington, DC, USA, pp. 127–136. IEEE Computer Society, Los Alamitos (2006)
Model-Driven Development of Composite Applications Susanne Patig University of Bern, IWI, Engehaldenstrasse 8, CH-3012 Bern, Switzerland [email protected]
Abstract. In service-oriented architectures, composite applications (CA) are created by assembling existing software services. Model-driven development does not implement a CA directly, but starts from models that describe the services and their interactions (and map to source code). This article classifies existing approaches for the model-driven development of CAs. Based on a small example it is demonstrated that current approaches do not support the development of CAs where the order of service calls is not constrained and depends on user input. To solve this problem, a new approach for the composition of web services is presented, which combines the Service Component Architecture (SCA) and state transition models. Keywords: model-driven development, web services, composition
1 Composite Applications by Example In service-oriented architectures, composite applications (CA) are developed by assembling existing services [8]; this is also called service aggregation [12]. Services are self-contained, self-describing, stateless pieces of software that offer some functionality [23]. Based on the description of their functionality, services can be found and accessed by other services. This paper focuses on web services, which rely for their description and interaction on non-proprietary internet standards like XML, HTTP, SOAP and WSDL [29]. Service composition can be static or dynamic. Static compositions select and connect services at design time and execute this configuration at run time, whereas in dynamic compositions [2] the actual services invoked or their interactions are determined during the execution of the CA. Here, static service composition is examined, since dynamic composition requires (semantic) web service discovery, which still is an open issue in practice. Static web service composition is not necessarily model-driven. Imagine the following example: A small holiday planning CA should be implemented by assembling web services that allow users to look for airports in some country (getAir1 portInformationByCountry(country:String)::ListAirport) , to check the temperature in a town (getWeather(town:String)::Temperature) and to look for a guide book (getBook(town:String)::ListBook). The strings set in courier font are in fact operations of the web services Airport and GlobalWeather 1
(http://www.webservicex.net/) as well as Amazon AWSECommerceService (http://soap.amazon.com/onca/soap?Service=AWSECommerceService). Direct implementation of such a CA usually proceeds as follows: (1) Determine the web services and their WSDL files. In this respect, the sample CA is simple, since the location of the web services is known and has not to be looked up in a UDDI. (2) For each web service involved, generate a local proxy from its WSDL. (3) Analyze the signatures of the web service operations - i.e., the inbound and outbound messages and code GUI and interaction logic accordingly. Fig. 1 shows a screenshot of the resulting CA, implemented by Eclipse 3.2.1 and JDK 1.5.0.
Fig. 1. GUI of the sample CA
From the functional point of view (non-functional requirements are not considered in this paper), the sample CA is characterized as follows: 1. The services require input data (country and airport town). 2. The input data can be provided by users or by other services, e.g., the town can be entered as a free text or selected from a list that is the outbound message of getAirportInformationByCountry. 3. The services are not ordered (unconstrained composition [12]). For example, if the decision on the travel destination mainly depends on the weather, users will start with getWeather. Alternatively, since nice weather is useless if the town does not have an airport, the web service operation getAirportInformationByCountry could be invoked first. Moreover, searching for a guide book (getBook) can be done at any planning stage (to get information about the town with the nice weather). Model-driven development (MDD) of web service CAs suggests itself for two reasons: First, there are several standard programming activities like the processing of WSDL, XML and SOAP and GUI development that produce considerable amount of technical code. MDD aims at generating such technical code from models. Secondly, in direct implementations the ‘application logic’ of the CA is interwoven with the technical code and, hence, for non-developers difficult to maintain. This obstructs the vision of Enterprise SOA, according to which business consultants should be able to
Model-Driven Development of Composite Applications
69
assemble new applications from existing services [23]. MDD supports this vision by modelling the application logic (and by subsequently transforming these models). Surprisingly, finding an appropriate approach for the model-driven development of CA with the characteristics of the sample turned out to be difficult. Section 2 sketches the existing approaches to model services and their interactions. Since none of them is completely capable of modelling the sample CA, in Section 3 a new approach is presented, which combines the programming specifications of SCA (Service Component Architecture [19]) and state transition diagrams.
2 Model-Driven Approaches for Service Composition 2.1 Model-Driven Development and (Web) Services In model-driven development (MDD), a model is an abstract description of some software system [13]. The content of the model either hides all details of the implementation platform (PIM: Platform Independent Model) or not (PSM: Platform Specific Model); in other words, models can be more (PIM) or less (PSM) abstract. Basically, MDD consists in transforming more abstract models into more specific ones –with source code being the most concrete description of a software system. MDD models are created by applying some dedicated modelling language (metamodel). As a result of the Model-driven Architecture (MDA), which was proposed by the OMG in 2001 [16], the Unified Modelling Language (UML) became a de facto standard for modelling in MDD. However, MDD and its manifestation MDA are not tied to UML, since the Meta-Object Facility (MOF) – another standard included in the MDA recommendation - allows one to define new metamodels. Consequently, this paper does not restrict the discussion of model-driven approaches for the development of composite applications to UML-based ones. Because of the numerous standards involved, describing the composition of web services by models seems to be both possible and worthwhile. The most important standard to get access as well as usage information on existing web services is the Web Services Description Language (WSDL) [27], [28]. This paper heavily relies on the WSDL definition of services, which separates the abstract functionality of a web service from its mapping to particular implementations. The abstract part of a WSDL 2.0 definition consists of an interface (port type in WSDL 1.1), which is a collection of the abstract operations provided by the web service (and optional faults). Each operation is defined by its signature, i.e., its inbound and outbound messages, and a mandatory message exchange pattern (MEP), e.g., in-only, out-only, in-out, out-in [26]. In contrast to WSDL 1.1, WSDL 2.0 does not specify messages explicitly under a corresponding tag, but uses the XML type system. The implementation part of a WSDL definition comprises binding and endpoint. A binding specifies concrete message formats as well as transport protocols and supplies this information for each operation in the interface. An endpoint (port in WSDL 1.1) relates a binding to the network address of some implemented service. Altogether, a WSDL 2.0 service consists of a single interface and a list of endpoints. Platform independence as the distinctive feature of the web service standards in general and the WSDL in particular are in line with the MDA distinction between
70
S. Patig
PIM and PSM. In MDD, the transformation from PIM to PSM is strictly forward. If software systems are developed that use web services, it must be distinguished between (1) setting up a (web) service-oriented architecture and (2) reusing an existing one. In contrast to the first case, which involves only forward transformations (e.g., from UML to WSDL to Java), the second case, which is the relevant one for CA development, may also require reverse transformations. Apart from the direction of the transformations, both cases differ in the content to be modelled: To set up an architecture, the services and their allowed message-based interactions must be described. In the CA case, the service-oriented architecture is given and restricts possible interactions, which are the focus of modelling. Service interactions come in three facets: conversation, choreography and orchestration. The particular messages exchanged between services and their order form a conversation [25]. Choreography describes all valid conversations by stating temporal and logical dependencies between messages (featuring sequencing rules, correlation, exception handling and transactions) [24]. Orchestrations differ from choreographies in the objects modelled (activities instead of messages) and in the view on behaviour (central control instead of decentralization), since they define an order for the activities (services) within a process [22]. Both static aspects (services and their interfaces) and dynamic ones (orchestration, choreography) must be described when CAs are developed; Section 2.2 and Section 2.3, respectively, summarize corresponding approaches. Data centric approaches (see Section 2.4) connect static and dynamic aspects by the data exchanged. 2.2 Static Modelling Approaches Static aspects of a CA can be modelled by either UML 2.0 profiles or SCA. UML 2.0 profiles, which follow the MDA tradition, are specializations of the UML metamodel, i.e., the elements of a profile are created by adding domain-specific semantics to standard UML elements. In detail, profiles are UML packages that collect a set of stereotypes. Stereotypes are domain-specific metaclasses carrying metaattributes (formerly ‘tagged values’), which are common in some domain [17]. Here, software architectures that rely on web services form the domain. Consequently, the corresponding UML 2.0 profiles relatively closely mirror the WSDL (see Table 1). Some service-relevant elements (e.g., interface, operation, message, and port) are already contained in the UML metamodel [OMG04] and, hence, need not to be defined in profile. Table 1. Selected UML 2.0 profiles for software services Abstract Types Messages Opera- Interface (Types) tion () X Service Specification () Port Type
Concrete Binding
Endpoint Property of Service Channel ? Service
: Already contained in UML 2.0 metamodel
Service Service Provider Service
?: Not explained in the paper
Connections between Services
Ref. [27]
Channel
[9]
Connector Type [2] Ref.: Reference
Model-Driven Development of Composite Applications
71
The UML 2.0 class diagrams created by applying these profiles can be easily transformed into WSDL files and skeleton code for software components that provide web services (see, e.g., the profile usage in the IBM Rational Software Architect [9]). However, the described profiles at best rudimentary support the modelling of dynamic aspects of CAs: For example, profile elements (e.g., “collaboration” in the IBM profile) are provided that establish a link to processes in BPEL4WS (Business Process Execution Language for Web Services) [10]. The Service Component Architecture (SCA) is a set of specifications that describe how software systems can be built out of existing services and how service components can be created ([11], [19]). A SCA (service) component, the basic modelling element, is a configured instance of an implementation that provides a business function (service). Implementations may depend on services provided by other SCA components (modelled by references). Moreover, SCA components can have properties whose values influence the service provided by the implementation. Property values are set during configuration, which also wires the references to other services. The set of SCA specifications consists of the assembly specification (defining how SCA components can be combined to form a SCA system), the policy framework (constraints and capabilities related to security, reliability and transactions), the client implementation specification (guidelines for building clients for SCA components by using particular implementation technologies, e.g., Enterprise Java Beans or C++) and the binding specification (how to access service components by using specific protocols). Currently, SCA is more a set of programming guidelines than an MDD approach, since it does not provide a metamodel, but only a weak concrete syntax (notation) for SCA components and their constituents. However, a translational semantics [15] of the missing metamodel can be constructed from the SCA specifications that map SCA elements to implementation technologies. Such a metamodel also enables transformations from models to source code, which is the core of MDD. By including attributes that refer to the policy framework, SCA (in contrast to UML 2.0 profiles for software services) goes beyond architecture and functional requirements. Moreover, SCA explicitly considers the development of clients for web services. Dynamic aspects of CAs must be modelled by SCA components that are BPEL4WS processes [11]. [19]. 2.3 Dynamic Modelling Approaches For CAs, the static aspects are already given by the WSDL definitions of the services. The application logic of a CA refers to the dynamic aspects, since it consists in the required or allowed interactions between services. To support non-developers in assembling a CA, these dynamic aspects should be represented by models. Approaches that model dynamic aspects of composite applications aim at separating the design time of a CA from its run time: During design, the interactions are described by some model. At runtime, this model is executed by an execution engine, which invokes the services and does the message handling [20]. Models of dynamic aspects of CAs describe either the activities (services) that constitute a process (orchestration) or the messages exchanged between services (choreography). For both, UML or alternative modelling languages are used.
72
S. Patig
If UML is applied, orchestrations are expressed by activity diagrams ([7], [21]), where activities represent (particular operations of) services, and their inbound and outbound messages are visualized <<user>> by pins (see Fig. 2). Consequently, message decideOnCountry exchange between services corresponds to data Country flows; any other service interaction is a control flow. Occasionally, additional stereotypes (marked by <<stereotypeName>> in UML diagrams) are used Country to distinguish activities that are service calls from <<service>> activities that perform data transformations [21]. getAirportInformationbyCountry Analogously to UML activity diagrams, other ListAirport process modelling languages like BPMN (Business Process Modelling Notation [18]) or YAWL (Yet Another Workflow Language2) equate services with ListAirport activities in some process and show them as nodes in <<user>> decideOnAirport a graph; directed arcs between the nodes represent Airport data and control flows. If the process modelling language does not have its own execution engine3, the resulting process models are mapped to Airport BPEL4WS for their execution [20]. To model choreographies, UML sequence dia<<service>> getWeather grams can be directly applied, since they show the exchange of messages between lifelines. In the CA context, the lifelines correspond to services (see Fig. 3; the lifelines are enclosed in boxes). Once more, BPEL4WS provides the runtime; hence, the UML <<service>> sequence diagram must be transformed [3]. getBook By definition, choreographies concern more than one web service. For individual services, their handling of inbound and outbound messages can be described by automata-like modelling languages (e.g., Mealy implementations) [4]. The concrete Fig. 2. UML activity diagram for syntax of these metamodels consists in graphs, the sample CA where inbound and outbound messages correspond to transitions (labelled, directed arcs) between states (nodes). Since these automata can be directly composed [8], the modelling approach is applicable to CAs. However, so far, automata have been mainly used to obtain theoretical results concerning service composition (e.g., on composability); execution environments and mappings to source code are missing. Additionally, the deterministic Mealy automata underlying the sketched modelling approaches are not expressive enough for real world CAs that involve user interaction: Often, such CAs are more equivalently described by non-deterministic automata (although any nondeterministic automata can be transformed in an equivalent deterministic one [15]). 2 3
http://www.yawl-system.com/ As opposed to BPMN, YAWL has its own execution environment.
Model-Driven Development of Composite Applications
73
Fig. 3. UML sequence diagram for the sample CA (IBM Rational Software Architect® 7.0)
2.4 Data-Centric Modelling Approaches Data-centric approaches for the model-driven development of CAs are the only ones for which successful applications in industry, banks, government, health care etc. have been reported [14], [1]. They centre around the data exchanged by messages. The data may be of arbitrary complex type and is usually described by XML. Data is associated with both activities (web service operations) and human user interactions. Thus, each activity is assigned at the most one form, which allows users to access data (see Fig. 4). A form must contain all data that is mandatory for the activity. In the forms of the case handling approach [1], users can also insert data that is required by other (later) activities. Additionally, so-called ‘restricted data’ may only be entered during a particular activity.