Software Language Engineering: Second International Conference, SLE 2009, Denver, CO, USA, October 5-6, 2009 Revised Selected Papers (Lecture Notes ... Programming and Software Engineering)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Mark van den Brand | Dragan Gasevic | Jeff Gray

18 downloads 808 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5969

Mark van den Brand Dragan Gaševi´c Jeff Gray (Eds.)

Software Language Engineering Second International Conference, SLE 2009 Denver, CO, USA, October 5-6, 2009 Revised Selected Papers

13

Volume Editors Mark van den Brand Dept. of Mathematics and Computer Science, Software Engineering and Technology Eindhoven University of Technology Den Dolech 2, 5612 AZ Eindhoven, The Netherlands E-mail: [email protected] Dragan Gaševi´c School of Computing and Information Systems Athabasca University 1 University Drive, Athabasca, AB T9S 3A3, Canada E-mail: [email protected] Jeff Gray Department of Computer Science University of Alabama P.O. Box 870290, Tuscaloosa, AL, USA E-mail: [email protected]

Library of Congress Control Number: 2010922313 CR Subject Classiﬁcation (1998): D.2, D.3, I.6, F.3, K.6.3 LNCS Sublibrary: SL 2 – Programming and Software Engineering ISSN ISBN-10 ISBN-13

0302-9743 3-642-12106-3 Springer Berlin Heidelberg New York 978-3-642-12106-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

We are pleased to present the proceedings of the Second International Conference on Software Language Engineering (SLE 2009). The conference was held in Denver, Colorado (USA) during October 5–6, 2009 and was co-located with the 12th IEEE/ACM International Conference on Model-Driven Engineering Languages and Systems (MODELS 2009) and the 8th ACM International Conference on Generative Programming and Component Engineering (GPCE 2009). The SLE conference series is devoted to a wide range of topics related to artiﬁcial languages in software engineering. SLE is an international research forum that brings together researchers and practitioners from both industry and academia to expand the frontiers of software language engineering. SLE’s foremost mission is to encourage and organize communication between communities that have traditionally looked at software languages from diﬀerent, more specialized, and yet complementary perspectives. SLE emphasizes the fundamental notion of languages, as opposed to any realization in speciﬁc technical spaces. In this context, the term “software language” comprises all sorts of artiﬁcial languages used in software development, including general-purpose programming languages, domain-speciﬁc languages, modeling and meta-modeling languages, data models, and ontologies. Software language engineering is the application of a systematic, disciplined, quantiﬁable approach to the development, use, and maintenance of these languages. The SLE conference is concerned with all phases of the lifecycle of software languages; these include the design, implementation, documentation, testing, deployment, evolution, recovery, and retirement of languages. Of special interest are tools, techniques, methods, and formalisms that support these activities. In particular, tools are often based on, or automatically generated from, a formal description of the language. Hence, the treatment of language descriptions as software artifacts, akin to programs, is of particular interest—while noting the special status of language descriptions and the tailored engineering principles and methods for modularization, refactoring, reﬁnement, composition, versioning, co-evolution, and analysis that can be applied to them. The response to the call for papers for SLE 2009 was quite enthusiastic. We received 79 full submissions from 100 initial abstract submissions. From those 79 submissions, the Program Committee selected 23 papers: 15 full papers, 6 short papers, and 2 tool demonstration papers, resulting in an acceptance rate of 29%. To ensure the quality of the accepted papers, each submitted paper was reviewed by at least three PC members. Each paper was discussed in detail during a week-long electronic PC meeting, as facilitated by EasyChair. The conference was quite interactive, and the discussions provided additional feedback to the authors. Accepted papers were then revised based on the reviews, in some cases a PC discussion summary, and feedback from the conference. The

VI

Preface

ﬁnal versions of all accepted papers are included in this proceedings volume. The resulting program covered diverse topics related to software language engineering. The papers cover engineering aspects in diﬀerent phases of the software language development lifecycle. These include the analysis of languages in the design phase and their actual usage after deployment. The papers also represent various tools and techniques used in language implementations, including diﬀerent approaches to language transformation and composition. The organization of these papers in this volume reﬂects the sessions in the original program of the conference. SLE 2009 had two renowned keynote speakers: Jim Cordy (a joint keynote talk with GPCE 2009) and Jean B´ezivin. They each provided informative and entertaining keynote talks. Trying to address the problems of complexity, usability, and adoption of generative and transformational techniques, Cordy’s keynote suggested using generative and transformational techniques to implement domain-speciﬁc languages. B´ezivin’s keynote discussed the many diﬀerent possibilities where model-driven research and practice can advance the capabilities for software language engineering. The proceedings begin with short papers summarizing the keynotes to provide a broad introduction to the software language engineering discipline and to identify key research challenges. SLE 2009 would not have been possible without the signiﬁcant contributions of many individuals and organizations. We are grateful to the organizers of MODELS 2009 for their close collaboration and management of many of the logistics. This allowed us to oﬀer SLE participants the opportunity to take part in two high-quality research events in the domain of software engineering. The SLE 2009 Organizing Committee and the SLE Steering Committee provided invaluable assistance and guidance. We are especially grateful to the Software Engineering Center at the University of Minnesota for sponsoring the conference and for all the support and excellent collaboration. We must also emphasize the role of Eric Van Wyk in making this arrangement with the Software Engineering Center possible and his great help in acting as the SLE 2009 Finance Chair. We are also grateful to the PC members and the additional reviewers for their dedication in reviewing the large number of submissions. We also thank the authors for their eﬀorts in writing and then revising their papers, and we thank Springer for publishing the papers in the proceedings. We are grateful to the developers of EasyChair for providing an open conference management system. Finally, we wish to thank all the participants at SLE 2009 for the energetic and insightful discussions that made SLE 2009 such an educational and fun event. January 2010

Mark van den Brand Dragan Gaˇsevi´c Jeﬀ Gray

Organization

SLE 2009 was organized by Athabasca University, Eindhoven University of Technology, and the University of Alabama. It was sponsored by the Software Engineering Center of the University of Minnesota.

General Chair Dragan Gaˇsevi´c

Athabasca University, Canada

Program Committee Co-chairs Jeﬀ Gray Mark van den Brand

University of Alabama, USA Eindhoven University of Technology, The Netherlands

Organizing Committee Alexander Serebrenik Bardia Mohabbati Marko Boˇskovi´c Eric Van Wyk James Hill

Eindhoven University of Technology, The Netherlands (Publicity Co-chair) Simon Fraser University, Canada (Web Chair) Athabasca University, Canada University of Minnesota, USA (Finance Chair) Indiana University/Purdue University, USA (Publicity Co-chair)

Program Committee Colin Atkinson Don Batory Paulo Borba John Boyland Marco Brambilla Shigeru Chiba Charles Consel Stephen Edwards Gregor Engels Robert Fuhrer Martin Gogolla Giancarlo Guizzardi Reiko Heckel

Universit¨ at Mannheim, Germany University of Texas, USA Universidade Federal de Pernambuco, Brazil University of Wisconsin-Milwaukee, USA Politecnico di Milano, Italy Tokyo Institute of Technology, Japan LaBRI / INRIA, France Columbia University, USA Universit¨ at Paderborn, Germany IBM Research, USA University of Bremen, Germany Federal University of Espirito Santo, Brazil University of Leicester, UK

VIII

Organization

Fr´ed´eric Jouault Nicholas Kraft Thomas K¨ uhne Julia Lawall Timothy Lethbridge Brian Malloy Kim Mens Marjan Mernik Todd Millstein Pierre-Etienne Moreau Pierre-Alain Muller Richard Paige James Power Daniel Oberle Jo˜ ao Saraiva Alexander Serebrenik Anthony Sloane Mary Lou Soﬀa Steﬀen Staab Jun Suzuki Walid Taha Eli Tilevich Juha-Pekka Tolvanen Jurgen Vinju Eelco Visser Ren´e Witte

INRIA & Ecole des Mines de Nantes, France University of Alabama, USA Victoria University of Wellington, New Zealand University of Copenhagen, Denmark University of Ottowa, Canada Clemson University, USA Universit´e catholique de Louvain, Belgium University of Maribor, Slovenia University of California, Los Angeles, USA Centre de recherche INRIA Nancy - Grand Est, France University of Haute-Alsace, France University of York, UK National University of Ireland, Ireland SAP Research, Germany Universidad do Minho, Portugal Eindhoven University of Technology, The Netherlands Macquarie University, Australia University of Virginia, USA Universit¨ at Koblenz-Landau, Germany University of Massachusetts, Boston, USA Rice University, USA Virginia Tech, USA MetaCase, Finland CWI, The Netherlands Delft University of Technology, The Netherlands Concordia University, Canada

Additional Reviewers Marcel van Amstel Emilie Balland Olivier Barais Paul Brauner Behzad Bordbar Johan Brichau Alfredo Cadiz Sergio Castro Loek Cleophas Cristobal Costa-Soria Duc-Hanh Dang Adwoa Donyina Nicolas Drivalos

Jo˜ ao Fernandes Frederic Fondement Xiaocheng Ge Danny Groenewegen Lars Hamann Kees Hemerik Karsten Hoelscher Lennart Kats Paul Klint Dimitros Kolovos Mirco Kuhlmann Nicolas Loriant Markus Luckey

Arjan van der Meer Muhammad Naeem Diego Ordonez Fernando Orejas Nicolas Palix Fernando Silva Perreiras Maja Pesic Zvezdan Protic Alek Radjenovic Ant´ onio Nestor Ribeiro M´ arcio Ribeiro Louis Rose Christian Soltenborn

Organization

Daniel Spiewak Tijs van der Storm Leopoldo Teixeira

Massimo Tisi Sander Vermolen Nicolae Vintilla

Tobias Walter Andreas W¨ ubbeke Tian Zhao

Steering Committee Mark van den Brand James Cordy Jean-Marie Favre Dragan Gaˇsevi´c G¨ orel Hedin Ralf L¨ammel Eric Van Wyk Andreas Winter

Technische Universiteit Eindhoven, The Netherlands Queen’s University, Canada University of Grenoble, France Athabasca University, Canada Lund University, Sweden Universit¨at Koblenz-Landau, Germany University of Minnesota, USA Johannes Gutenberg-Universit¨ at Mainz, Germany

Sponsoring Institutions

IX

Table of Contents

I

Keynotes

Eating Our Own Dog Food: DSLs for Generative and Transformational Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James R. Cordy If MDE Is the Solution, Then What Is the Problem? . . . . . . . . . . . . . . . . . Jean B´ezivin

II

1 2

Regular Papers

Session: Language and Model Evolution Language Evolution in Practice: The History of GMF . . . . . . . . . . . . . . . . . Markus Herrmannsdoerfer, Daniel Ratiu, and Guido Wachsmuth A Novel Approach to Semi-automated Evolution of DSML Model Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tihamer Levendovszky, Daniel Balasubramanian, Anantha Narayanan, and Gabor Karsai Study of an API Migration for Two XML APIs . . . . . . . . . . . . . . . . . . . . . . Thiago Tonelli Bartolomei, Krzysztof Czarnecki, Ralf L¨ ammel, and Tijs van der Storm

3

23

42

Session: Variability and Product Lines Composing Feature Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathieu Acher, Philippe Collet, Philippe Lahire, and Robert France VML* – A Family of Languages for Variability Management in Software Product Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steﬀen Zschaler, Pablo S´ anchez, Jo˜ ao Santos, Mauricio Alf´erez, Awais Rashid, Lidia Fuentes, Ana Moreira, Jo˜ ao Ara´ ujo, and Uir´ a Kulesza Multi-view Composition Language for Software Product Line Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mauricio Alf´erez, Jo˜ ao Santos, Ana Moreira, Alessandro Garcia, Uir´ a Kulesza, Jo˜ ao Ara´ ujo, and Vasco Amaral

62

82

103

XII

Table of Contents

Session: Short Papers Yet Another Language Extension Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . Anya Helene Bagge

123

Model Transformation Languages Relying on Models as ADTs . . . . . . . . . Jer´ onimo Iraz´ abal and Claudia Pons

133

Towards Dynamic Evolution of Domain Speciﬁc Languages . . . . . . . . . . . . Paul Laird and Stephen Barrett

144

ScalaQL: Language-Integrated Database Queries for Scala . . . . . . . . . . . . . Daniel Spiewak and Tian Zhao

154

Integration of Data Validation and User Interface Concerns in a DSL for Web Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danny M. Groenewegen and Eelco Visser Ontological Metamodeling with Explicit Instantiation . . . . . . . . . . . . . . . . Alfons Laarman and Ivan Kurtev

164 174

Session: Parsing, Compilation, and Demo Veriﬁable Parse Table Composition for Deterministic Parsing . . . . . . . . . . August Schwerdfeger and Eric Van Wyk

184

Natural and Flexible Error Recovery for Generated Parsers . . . . . . . . . . . . Maartje de Jonge, Emma Nilsson-Nyman, Lennart C.L. Kats, and Eelco Visser

204

PIL: A Platform Independent Language for Retargetable DSLs . . . . . . . . Zef Hemel and Eelco Visser

224

Graphical Template Language for Transformation Synthesis . . . . . . . . . . . Elina Kalnina, Audris Kalnins, Edgars Celms, and Agris Sostaks

244

Session: Modularity in Languages A Role-Based Approach towards Modular Language Engineering . . . . . . . Christian Wende, Nils Thieme, and Steﬀen Zschaler Language Boxes: Bending the Host Language with Modular Language Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lukas Renggli, Marcus Denker, and Oscar Nierstrasz Declarative Scripting in Haskell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tim Bauer and Martin Erwig

254

274 294

Table of Contents

XIII

Session: Metamodeling and Demo An Automated Process for Implementing Multilevel Domain Models . . . . Fr´ed´eric Mallet, Fran¸cois Lagarde, Charles Andr´e, S´ebastien G´erard, and Fran¸cois Terrier Domain-Speciﬁc Metamodelling Languages for Software Language Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steﬀen Zschaler, Dimitrios S. Kolovos, Nikolaos Drivalos, Richard F. Paige, and Awais Rashid

314

334

Generating Smart Wrapper Libraries for Arbitrary APIs . . . . . . . . . . . . . . Uwe Jugel

354

Closing the Gap between Modelling and Java . . . . . . . . . . . . . . . . . . . . . . . . Florian Heidenreich, Jendrik Johannes, Mirko Seifert, and Christian Wende

374

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

385

Eating Our Own Dog Food: DSLs for Generative and Transformational Engineering James R. Cordy School of Computing, Queen’s University Kingston, Ontario, Canada [email protected]

Abstract. Languages and systems to support generative and transformational solutions have been around a long time. Systems such as XVCL, DMS, ASF+SDF, Stratego and TXL have proven mature, eﬃcient and eﬀective in a wide range of applications. Even so, adoption remains a serious issue - almost all successful production applications of these systems in practice either involve help from the original authors or years of experience to get rolling. While work on accessibility is active, with efforts such as ETXL, Stratego XT, Rascal and Colm, the fundamental big step remains - it’s not obvious how to apply a general purpose transformational system to any given generation or transformation problem, and the real power is in the paradigms of use, not the languages themselves. In this talk I will propose an agenda for addressing this problem by taking our own advice - designing and implementing domain speciﬁc languages (DSLs) for speciﬁc generative, transformational and analysis problem domains. We widely advise end users of the need for DSLs for their kinds of problems - why not for our kinds? And we use our tools for implementing their DSLs - why not our own? I will outline a general method for using transformational techniques to implement transformational and generative DSLs, and review applications of the method to implementing example text-based DSLs for model-based code generation and static code analysis. Finally, I will outline some ﬁrst steps in implementing model transformation DSLs using the same idea - retaining the maturity and eﬃciency of our existing tools while bringing them to the masses by “eating our own dogfood”.

M. van den Brand, D. Gaˇ sevi´ c, J. Gray (Eds.): SLE 2009, LNCS 5969, p. 1, 2010. c Springer-Verlag Berlin Heidelberg 2010

If MDE Is the Solution, Then What Is the Problem? Jean Bézivin AtlanMod research team INRIA and EMNantes Nantes, France [email protected]

For nearly ten years, modern forms of software modeling have been used in various contexts, with good apparent success. This is a convenient time to reflect on what has been achieved, where we stand now, and where we are leading to with Model-Driven Engineering (MDE). If there is apparently some consensual agreement on the core mechanisms, it is much more difficult to delimitate the scope and applicability of MDE. The three main questions we have to answer in sequence are: 1. What is a model? 2. Where are models coming from? 3. What may models be useful for? There is now some consensus in the community about the answer to the first question. A (terminal) model is a graph conforming to another graph usually called its metamodel, and this terminal model represents a system. Terminal models and their metamodels are similarly organized and may be unified as abstract models, yielding a regular organization. In such an organization, some of the models (e.g., a transformation) may be executable. The relation of conformance between a terminal model and its metamodel provides most of the information on the first question. The second question about the origin of models is much more difficult to answer and is still the central challenge of computer science. This is more related to the representation relation between a terminal model and a system. Different situations could be considered here (e.g., a system derived from a model, a model derived from a system, or system and model co-existence), but there are basically two possibilities to create a model: by transformation or by observation of a system, the second one being much more important and much less understood. The discovery of a terminal model from a system is always made by an observer (possibly but rarely automated), with a goal and a precise metamodel. Making explicit this discovery process represents one of the most important and urgent open research issues in MDE. When we have answered the second question about model creation methodology, it is then easier to answer the third question about usability. There are three main categories of MDE application related to forward engineering (mainly software artifact production from models), to reverse engineering (primarily legacy code analysis) and to general interoperability problems (when two heterogeneous systems must interact). Instead of solving the direct interaction problems between the heterogeneous systems, it seems advantageous to represent these systems by models (possibly conforming to different metamodels) and to use generic Model-Driven Interoperability (MDI) techniques.

M. van den Brand, D. Gašević, J. Gray (Eds.): SLE 2009, LNCS 5969, p. 2, 2010. © Springer-Verlag Berlin Heidelberg 2010

Language Evolution in Practice: The History of GMF Markus Herrmannsdoerfer1, Daniel Ratiu1 , and Guido Wachsmuth2 1

Institut f¨ ur Informatik Technische Universit¨ at M¨ unchen Boltzmannstr. 3, 85748 Garching b. M¨ unchen, Germany {herrmama,ratiu}@in.tum.de 2 Institut f¨ ur Informatik Humboldt-Universit¨ at zu Berlin Unter den Linden 6, 10099 Berlin, Germany [email protected]

Abstract. In consequence of changing requirements and technological progress, software languages are subject to change. The changes aﬀect the language’s speciﬁcation, which in turn aﬀects language processors as well as existing language utterances. Unfortunately, little is known about how software languages evolve in practice. This paper presents a case study on the evolution of four modeling languages provided by the Graphical Modeling Framework. It investigates the following research questions: (1) What is the impact of language changes on related software artifacts?, (2) What activities are performed to implement language changes? and (3) What kinds of adaptations capture the language changes? We found out that the language changes aﬀect various kinds of related artifacts; the distribution of the activities performed to evolve the languages mirrors the classical software maintenance activities, and most language changes can be captured by a small suite of operators that can also be used to migrate the language utterances.

1

Introduction

Software languages change [1]. A software language, as any other piece of software, is designed, developed, tested, and maintained. Requirements, purpose, and scope of software languages change, and they have to be adapted to these changes. This applies particularly to domain-speciﬁc languages that are specialized to a speciﬁc problem domain, as their specialization causes them to be vulnerable with respect to changes of the domain. But general-purpose languages like Java or the UML evolve, too. Typically, their evolution is quite slow and driven by heavy-weighted community processes. Software language evolution implicates a threat for language erosion [2]. Typically, language processors and tools do no longer comply with a changing language. But we do not want to build language processors and tools from scratch every time a language changes. Thus, appropriate co-evolution strategies are M. van den Brand, D. Gaˇ sevi´ c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 3–22, 2010. c Springer-Verlag Berlin Heidelberg 2010

4

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

required. In a similar way, language utterances like programs or models might become inconsistent with a changing language. But these utterances are valuable assets for language users making their co-evolution a serious issue. Software language engineering [3,4] evolves as a discipline to the application of a systematic approach to the design, development, maintenance, and evolution of languages. It concerns various technological spaces [5]. Language evolution aﬀects all these spaces: Grammars evolve in grammarware [6], metamodels evolve in modelware [2], schemas evolve in XMLware [7] and dataware [8], ontologies evolve [9], and APIs evolve [10], too. In this paper, we focus on the technological space of modelware. There is an ever increasing variety of domain-speciﬁc modeling languages each developed by a small group of programmers. These languages evolve frequently to meet the requests of their users. Figure 1 illustrates the status quo: modeling languages come with a series of artifacts (e. g. editors, translators, code generators) centered around a metamodel that deﬁnes the language syntax. The ever increasing number of language users (usually decoupled from language developers) build many models by using these languages. As new features need to be incorporated, languages evolve, requiring the co-evolution of existing models.

Fig. 1. Development and evolution of modeling languages

In this paper, we investigate the evolution of modeling languages by reengineering the evolution of their metamodels and the migration of related software artifacts. Our motivation is to identify requirements for tools that support the (semi-)automatic coupled evolution of modeling languages and related artifacts in a way that avoids the language erosion and minimizes the handwritten code for migration. As a case study we investigated the evolution of the four modeling languages provided by the Graphical Modeling Framework (GMF). We focus on the following research questions: – RQ1) What is the impact of language changes on related software artifacts? As the metamodel is in the center of the language deﬁnition, we are interested to understand how other artifacts change, when the metamodel changes.

Language Evolution in Practice: The History of GMF

5

– RQ2) What activities are performed to implement language changes? We investigate the distribution of the activities performed to implement metamodel changes in order to examine the similarities between the evolution of programs and the evolution of languages. – RQ3) What kinds of adaptations capture the language changes? We are interested to describe the metamodel changes based on a set of canonical adaptations, and thereby to investigate the measure in which these adaptations can be used to migrate the models. Outline. In Section 2, we introduce the Graphical Modeling Framework as our case study. We present our approach to retrace the evolution of metamodels in Section 3. In Section 4, we answer the research questions from both a quantitative and qualitative point of view. We interpret and discuss the results of the case study in Section 5 by focusing on lessons learned and threats to the study’s validity. In Section 6, we present work related to the investigation of language evolution, before we conclude in Section 7.

2

Graphical Modeling Framework

The Graphical Modeling Framework (GMF)1 is a widely used open source framework for the model-driven development of diagram editors. GMF is a prime example for a Model-Driven Architecture (MDA) [11], as it strictly separates platform-independent models (PIM), platform-speciﬁc models (PSM) and code. GMF is implemented on top of the Eclipse Modeling Framework (EMF)2 and the Graphical Editing Framework (GEF)3 . 2.1

Editor Models

In GMF, a diagram editor is deﬁned by models from which editor code can be generated automatically. For this purpose, GMF provides four modeling languages, a transformator that maps PIMs to PSMs, a code generator that turns PSMs into code, and a runtime platform on which the generated code relies. The lower part of Fig. 2 illustrates the diﬀerent kinds of GMF editor models. On the platform-independent level, a diagram editor is modeled from four different views. The domain model focuses on the abstract syntax of diagrams. The graphical definition model deﬁnes the graphical elements like nodes and edges in the diagram. The tool definition model deﬁnes the tools available to author a diagram. In the mapping model, the ﬁrst three views are combined to an overall view which maps the graphical elements from the graphical definition model and the tools from the tool definition model onto the domain model elements from the domain model. 1 2 3

see GMF website http://www.eclipse.org/modeling/gmf see EMF website http://www.eclipse.org/modeling/emf see GEF website http://www.eclipse.org/gef

6

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

(0)

3,0

360 HFRUH

FRGH

PHWDPRGHO

FRQIRUPVWR GHSHQGVRQ

PRGHO

*0)$SSOLFDWLRQ

*0)/DQJXDJHV

*0)

WUDQVIRUPVWR JPIJUDSK

PDSSLQJV

WUDQVIRUPDWRU -DYD

JPIJHQ

JHQHUDWRU -(7;SDQG

-DYD

PDSSLQJ PRGHO

WUDQVIRUP

GLDJUDP JHQHUDWRU PRGHO

JHQHUDWH

GLDJUDP HGLWRU

WRROGHI

JUDSKLFDO GHILQLWLRQ PRGHO

GRPDLQ PRGHO

WRRO GHILQLWLRQ PRGHO

Fig. 2. Languages involved in the Graphical Modeling Framework

The platform-independent mapping model is transformed into a platformspeciﬁc diagram generator model. This model can be altered to customize the code generation. 2.2

Modeling Languages

We can distinguish two kinds of languages involved in GMF. First, GMF provides domain-speciﬁc languages for the modeling of diagram editors. Each of these languages comes with a metamodel deﬁning its abstract syntax and a simple tree-based model editor integrated in Eclipse. The upper part of Fig. 2 shows the metamodels involved in GMF. These are ecore for domain models, gmfgraph for graphical definition models, tooldef for tool definition models, mappings for mapping models, and gmfgen for diagram generator models. The mappings metamodel refers to elements in the ecore, gmfgraph, and tooldef metamodels. This kind of dependency is typical for multi-view modeling languages. For example, there are similar dependencies between the metamodel packages deﬁning the various sublanguages of the UML. Second, GMF itself is implemented in various languages. All metamodels are expressed in ecore, the metamodeling language provided by EMF. EMF is an implementation of Essential MOF which is the basic metamodeling standard proposed by the Object Management Group (OMG) [12]. Notably, the ecore metamodel conforms to itself. Additionally, the metamodels contain context constraints which are attached as textual annotations to the metamodel elements

Language Evolution in Practice: The History of GMF

7

to which they apply. These constraints are expressed in the Object Constraint Language (OCL) [13]. The transformator from a mapping model to a generator model is implemented in Java. For model access, it relies on the APIs generated from the metamodels of the GMF modeling languages. The generator generates code from a generator model. It was formerly implemented in Java Emitter Templates (JET)4 , which was later changed in favor of Xpand5 . The generated code conforms to the Java programming language, and is based on the GMF runtime platform. 2.3

Metamodel Evolution

With a code base of more than 600k lines of code, GMF is a framework of considerable size. GMF is implemented by 13 developers from 3 diﬀerent countries using an agile process with small development cycles. Since starting the project, the GMF developers had to adapt the metamodels a signiﬁcant number of times. As a number of metamodel changes were breaking the existing models, the developers had to manually implement a migrator. Figure 3 quantiﬁes the metamodel evolution for the two release cycles we studied, each taking one year. The ﬁgures show the number of metamodel elements for each revision of each GMF metamodel. During the evolution from release 1.0 to release 2.1, the number of classes deﬁned by all metamodels e. g. increased from 201 to 252. We chose GMF as a case study, because the evolution is extensive, publicly available, and well documented by means of commit comments and change requests. However, the evolution is only available in the form of revisions from the version control system, and its documentation is only informal.

3

Investigating the Evolution

Due to the considerable size of the GMF metamodels, we developed a systematic approach to investigate its evolution as presented in the following subsections. 3.1

Modeling the History

To investigate the evolution of modeling languages, we model the history of their metamodels. In the history model, we capture the evolution of metamodels as sequences of metamodel adaptations [14,15]. A metamodel adaptation is a well-understood transformation step on metamodels. We provide a metamodel for history models as depicted in Figure 4. The History of a modeling language is subdivided into a number of releases. A Release denotes a version of the modeling language which has been deployed, and for which models can thus exist. Modeling languages are released at a certain date, and are tagged by a certain version number. A Release is further subdivided into 4 5

see JET website http://www.eclipse.org/modeling/m2t see Xpand website http://www.openarchitectureware.org

8

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

5HOHDVH

5HOHDVH

($QQRWDWLRQ

($QQRWDWLRQ

((QXP

(3DUDPHWHU (2SHUDWLRQ

(5HIHUHQFH

($WWULEXWH

((QXP/LWHUDO 1XPEHU

1XPEHU

((QXP/LWHUDO

((QXP (3DUDPHWHU

(2SHUDWLRQ (5HIHUHQFH ($WWULEXWH

(&ODVV

(&ODVV

(3DFNDJH

(3DFNDJH

(a) tooldef metamodel.

5HYLVLRQ

5HYLVLRQ

5HOHDVH

(b) gmfgraph metamodel.

5HOHDVH

1XPEHU

($QQRWDWLRQ

($QQRWDWLRQ

((QXP/LWHUDO

((QXP/LWHUDO

((QXP

((QXP

(3DUDPHWHU

(2SHUDWLRQ (5HIHUHQFH

1XPEHU

(3DUDPHWHU

(2SHUDWLRQ

(5HIHUHQFH

($WWULEXWH

($WWULEXWH

(&ODVV

(&ODVV

(3DFNDJH

(3DFNDJH

5HYLVLRQ

5HYLVLRQ

(c) mappings metamodel.

(d) gmfgen metamodel.

Fig. 3. Statistics of metamodel evolution &RPPLW 5HOHDVH +LVWRU\

UHOHDVHV

GDWH'DWH YHUVLRQ6WULQJ

FRPPLWV

GDWH'DWH YHUVLRQ6WULQJ FRPPHQW6WULQJ DXWKRU6WULQJ

DGDSWDWLRQV

$GDSWDWLRQ

Fig. 4. Modeling language history

a number of commits. A Commit denotes a version of the modeling language which has been committed to the version control system. Modeling languages are committed at a certain date, by a certain author, with a certain comment, and are tagged by a certain version number. A Commit consists of the sequence of adaptations which have been performed since the last Commit. 3.2

Operator Suite

The metamodel for history models includes an operator suite for stepwise metamodel adaptation. As is depicted in Figure 5, each operator subclasses the abstract class Adaptation. Furthermore, we classify each operator according to four diﬀerent criteria: Granularity. Similar to [16], we distinguish primitive and compound operators. A Primitive supports a metamodel adaptation that can not be decomposed into

Language Evolution in Practice: The History of GMF

*UDQXODULW\

9

$GDSWDWLRQ

3ULPLWLYH

&RQWHQW3ULPLWLYH

&RPSRXQG

9DOXH3ULPLWLYH

0HWDPRGHO$VSHFW

$GDSWDWLRQ

6WUXFWXUDO$GDSWDWLRQ

&RQVWUDLQW$GDSWDWLRQ

/DQJXDJH([SUHVVLYHQHVV

&RQVWUXFWRU

0RGHO0LJUDWLRQ

'RFXPHQWDWLRQ$GDSWDWLRQ

$GDSWDWLRQ

'HVWUXFWRU

5HIDFWRULQJ

$GDSWDWLRQ

3UHVHUYLQJ$GDSWDWLRQ

&XVWRP0LJUDWLRQ

$3,$GDSWDWLRQ

PLJUDWLRQ

%UHDNLQJ$GDSWDWLRQ

&XVWRP$GDSWDWLRQ

&RXSOHG$GDSWDWLRQ

Fig. 5. Classiﬁcation of operators for metamodel adaptation

smaller adaptation steps. In contrast, a Compound adaptation can be decomposed into a sequence of Primitives. The required kinds of Primitive operators can be derived from the meta-metamodel. There are two basic kinds of primitive changes: ContentPrimitives and ValuePrimitives. A ContentPrimitive modiﬁes the structure of a metamodel, i. e. creates or deletes a metamodel element. We thus need ContentPrimitives for each kind of metamodel element deﬁned by the meta-metamodel. For classes, e.g., we need ContentPrimitives to create a class in a package and to delete it from its package. A ValuePrimitive modiﬁes an existing metamodel element, i. e. changes a feature of a metamodel element. We thus need ValuePrimitives for each feature deﬁned by the meta-metamodel. For classes, e.g., we need a ValuePrimitive to rename a class, and we need ValuePrimitives to add and remove a superclass. The set of primitive operators already oﬀers a complete operator suite in the sense that every metamodel adaptation can be described by composing them. Metamodel aspects. We classify an operator according to the metamodel aspect which it addresses. The diﬀerent classes can be derived from the constructs provided by the meta-metamodel to which the metamodels have to conform. An

10

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

operator concerns either the structure of models, constraints on models, the API to access models, or the documentation of metamodel elements. A StructuralAdaptation like extracting a superclass aﬀects the abstract syntax deﬁned by the metamodel. A ConstraintAdaptation adds, deletes, moves, or changes constraints in the metamodel. An APIAdaptation concerns the additional access methods deﬁned in the metamodel. This includes volatile features and operations. A DocumentationAdaptation adds, deletes, moves, or changes documentation annotations to metamodel elements. Language expressiveness. According to [14], we can distinguish three kinds of operators with respect to the expressiveness of the modeling language. By expressiveness of a modeling language, we refer to the set of valid models we can express in the modeling language. Constructors increase this set, i. e. in the new version of the language we can express new models. In contrast, Destructors decrease the set, i. e. in the old version we could express models which we cannot express in the new version of the language. Finally, Refactorings preserve the set of valid models, i. e. we can express all models in the old and the new version of the language. Model migration. According to [17], we can determine for each operator to what extent model migration can be automated. PreservingAdaptations do not require the migration of models. BreakingAdaptations break the instance relationship between models and the adapted metamodel. In this case, we need to provide a migration for possibly existing models. For a CoupledAdaptation, the migration does not depend on a speciﬁc metamodel. Thus it can be speciﬁed as a generic couple of metamodel adaptation and model migration. In contrast, a CustomAdaptation is so speciﬁc to a certain metamodel that it cannot be composed of generic coupled adaptation steps. Consequently, it can only be covered by a sequence of adaptation steps and a reconciling CustomMigration6. As mentioned above, three of the criteria have its origin in existing publications, while metamodel aspects is kind of a natural criterion. There might be other criteria which are interesting in the context of modeling language evolution. Given the sequence of adaptations, it is however easy to classify them according to other criteria. The presented criteria are orthogonal to each other to a large extent. Granularity is orthogonal to all other criteria and vice versa, as we can think of example operators from each granularity for all these criteria. Additionally, language expressiveness and model migration are orthogonal to each other: the ﬁrst concerns the diﬀerence in cardinality between the sets of valid models before and after adaptation, whereas the second concerns the correct migration of a model from one set to the other. However, language expressiveness and model migration both focus on the impact on models, and are thus only orthogonal to the 6

The categories from [17] were renamed to be more conforming to the literature: metamodel-only change was renamed to PreservingAdaptation, coupled change to BreakingAdaptation, metamodel-independent coupled change to CoupledAdaptation, and metamodel-specific coupled change to CustomAdaptation.

Language Evolution in Practice: The History of GMF

11

metamodel aspects StructuralAdaptation and ConstraintAdaptation. This is due to the fact that operators concerning APIAdaptation and DocumentationAdaptation do not aﬀect models. Consequently, these operators are always Refactorings and PreservingAdaptations. The operator suite necessary for our case study is depicted in Figure 9. We classify each operator in the operator suite according to the categories presented before. For example, the operator Extract Superclass creates a new common superclass for a number of classes. This operator is a Compound, since we can express the same metamodel adaptation by the primitive operators Create Class and Add Superclass. The operator is a StructuralAdaptation, since it aﬀects the abstract syntax deﬁned by the metamodel. It is a Constructor, because we can instantiate the introduced superclass in the new language version. Finally, it is a PreservingAdaptation, since no migration of old models to the new language version is required. 3.3

Reverse Engineering the GMF History

Procedure. We applied the following steps to reconstruct a history model for GMF based on the available information: Step 1. Extracting the log: We extracted the log information for the whole GMF repository. The log information lists the revisions of each ﬁle maintained in the repository. Step 2. Detecting the commits: We grouped revisions of ﬁles which were committed together with high probability. Two revisions of diﬀerent ﬁles were grouped, in case they were committed within the same time interval and with the same commit comment. Step 3. Filtering the commits: We ﬁltered out all commits which do not include a revision of one of the metamodels. Step 4. Clustering the revisions: We clustered the ﬁles which were committed together into more abstract artifacts like metamodels, transformator, code generator, and migrator. This step was performed to reduce the information, as the implementation of each of the artifacts may be modularized into several ﬁles. The information available at this point can be used to answer RQ1. Step 5. Classifying the commits: We classiﬁed the commits according to the software maintenance categories (i. e. perfective, adaptive, preventive, and corrective) [18] based on the commit comments and change requests. The information available at this point can be used to answer RQ2. Step 6. Extracting the metamodel revisions: We extracted the metamodel revisions from the GMF repository. Step 7. Comparing the metamodel revisions: We compared subsequent metamodel revisions with each other resulting in a diﬀerence model. The diﬀerence model consists of a number of primitive changes between subsequent metamodel revisions. Step 8. Detecting the adaptation sequence: We detected the adaptations necessary to bridge the diﬀerence between the metamodel revisions. In contrast to the diﬀerence model, the adaptations also combine related primitive changes and are

12

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

ordered as a sequence. To ﬁnd the most plausible adaptations, we also analyzed commit comments, change requests, and the co-adaptation of other artifacts. The information available at this point can be used to answer RQ3. Step 9. Validating the adaptation sequence: We validated the resulting adaptation sequence by applying it to migrate the existing models for testing the handcrafted migrator. We set up a number of test cases each of which consists of a model before migration and the expected model after migration. Tool Support. We employed a number of helper tools to perform the study. statCVS7 was employed to parse the log information into a model which is processed further by a handcrafted model transformation (steps 1-4). The diﬀerence models between two subsequent metamodel revisions were generated with the help of EMF Compare8 (step 7). To bridge the diﬀerence between subsequent metamodel revisions, we employed the existing tool COPE9 [15] whose user interface is depicted in Figure 6 (step 8). COPE allows the language developer to directly execute the operators in the metamodel editor and automatically records them in a history model [19]. Generic CoupledAdaptations can be invoked through an operator browser which oﬀers all such available operators. To perform a CustomAdaptation, a custom migration needs to be attached to metamodel changes recorded in the metamodel editor. For the study, we extended COPE to support its user in letting the metamodel converge to a target metamodel by displaying the diﬀerence model as obtained from EMF Compare. From the recorded history model, a migrator can be generated which was employed for validating the adaptation sequence (step 9). The handcrafted migrator that comes with GMF was used to generate the expected models for validation.

4

Result

In this section, we present the results of our case study in an aggregated manner. However, the complete history can be obtained from our web site10 . RQ1) What is the impact of language changes on related software artifacts? To answer this question, we determined for each commit which other artifacts were committed together with the metamodels. Figure 7 shows how many of the overall 124 commits had an impact on a certain artifact. The ﬁrst four columns denote the metamodels that were changed in a commit, and the ﬁfth column denotes the number of commits. For instance, row 6 means that the metamodels mappings and gmfgen changed together in 6 commits. The last three columns denote the number of commits in which other artifacts, like transformator, code generator and migrator, were changed. In the example row, 7 8 9 10

see statCVS website http://statcvs.sourceforge.net see EMF Compare website http://www.eclipse.org/emft/projects/compare Available as open source at http://cope.in.tum.de Available at http://cope.in.tum.de/pmwiki.php?n=Documentation.GMF

Language Evolution in Practice: The History of GMF

PHWDPRGHOHGLWRU

RSHUDWRUEURZVHU

GLIIHUHQFHPRGHO

13

WDUJHWPHWDPRGHO

Fig. 6. COPE User Interface

JPIJUDSK FKDQJHG FKDQJHG

0HWDPRGHOV PDSSLQJV JPIJHQ

WRROGHI FKDQJHG

FKDQJHG FKDQJHG

FKDQJHG FKDQJHG FKDQJHG

FKDQJHV FKDQJHG FKDQJHG FKDQJHG

7UDQVIRU PDWRU

*HQHUD WRU

0LJUDWRU

Fig. 7. Correlation between commits of metamodels and related artifacts

the transformator was changed 4 times, the generator 2 times, and the migrator had to be changed once. In a nutshell, metamodel changes are very likely to impact artifacts which are directly related to them. For instance, the changes to mappings and gmfgen propagated to the transformator from mappings to gmfgen, and to the generator from gmfgen to code. Additionally, metamodel changes are not always carried out on a single metamodel, but are sometimes related to other metamodels. RQ2) What activities are performed to implement language changes? To answer this question, we classiﬁed the commits into the well-known categories of maintenance activities, and we investigated their distribution over these categories. Figure 8 shows the number of commits for each category. Note that several commits could not be uniquely associated to one category, and thus had to be assigned to several categories. However, all commits could be classiﬁed into at least one of the four categories.

14

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth 3HUIHFWLYH 0RGHOQDYLJDWRU 5LFKFOLHQWSODWIRUP 'LDJUDPSUHIHUHQFHV 'LDJUDPSDUWLWLRQLQJ (OHPHQWSURSHUWLHV ,QGLYLGXDOIHDWXUHV

$GDSWLYH 7UDQVLWLRQWR;SDQG (FRUHFRQVWUDLQWV 1DPHVSDFH85, 2&/SDUVHU

3UHYHQWLYH 6HSDUDWLRQ 6LPSOLILFDWLRQ 8QXVHGHOHPHQWV 'RFXPHQWDWLRQ

&RUUHFWLYH %XJUHSRUW 5HQDPH 5HYHUWFKDQJHV :URQJFRQVWUDLQW

Fig. 8. Classiﬁcation of metamodel commits according to maintenance categories

We classiﬁed 45 of the commits as perfective maintenance, i. e. add new features to enhance GMF. Besides a number of individual commits, there are a few features whose introduction spanned several commits. The generated diagram editor was extended with a model navigator, to run as a rich client, to set preferences for diagrams, to partition diagrams, and to set properties of diagram elements. We classiﬁed 33 of the commits as adaptive maintenance, i. e. adapt GMF to a changing environment. These commits were either due to the transition from JET to Xpand, adapted to changes to the constraints of ecore, were due to releasing GMF, or adapted the constraints to changes of the OCL parser. We classiﬁed 36 of the commits as preventive maintenance, i. e. refactor GMF to prevent faults in the future. These commits either separated concerns to better modularize the generated code, simpliﬁed the metamodels to make the transformations more straightforward, removed metamodel elements no longer used by transformations, or added documentation to make the metamodel more understandable. We classiﬁed 16 of the commits as corrective maintenance, i. e. correct faults discovered in GMF. These commits either ﬁxed bugs reported by GMF users, corrected incorrectly spelled element names, reverted changes carried out earlier, or corrected invalid OCL constraints. In a nutshell, the typical activities known from software maintenance also apply to metamodel maintenance [18]. Furthermore, similar to the development of software, the number of perfective activities (34,6%) outranges the preventive (27,7%) and adaptive (25,4%) activities which are double the number of corrective activities (12,3%). RQ3) What kinds of adaptations capture the language changes? To answer this question, we classiﬁed the operators which describe the metamodel evolution. Figure 9 shows the number and classiﬁcation of each operator occurred during the evolution of each metamodel. The operators are grouped by their granularity and the metamodel aspects to which they apply. Most of the changes could be covered by Primitive adaptations: we found 379 (51,8%) ContentPrimitive adaptations, 279 (38,2%) ValuePrimitive adaptations and 73 (10,0%) Compound adaptations. Only half of the adaptations aﬀected the structure deﬁned by a metamodel: we identiﬁed 361 (49,4%) StructuralAdaptations, 303 (41,5%) APIAdaptations, 36 (4,9%) DocumentationAdaptations, and 31 (4,2%) ConstraintAdaptations. Most of the changes are refactorings which do not change the expressiveness of the modeling language: we found 453 (62,0%) Refactorings, 194 (26,5%) Constructors, and 84 (11,5%) Destructors. Only very few changes cannot be covered by generic coupled operators which are able to

Language Evolution in Practice: The History of GMF

2SHUDWRU &ODVVLILFDWLRQ *UDQXODULW\ 0HWDPRGHO $VSHFW &RQWHQW 6WUXFWXUDO &UHDWH&ODVV 3ULPLWLYH &UHDWH(QXP &UHDWH2SWLRQDO$WWULEXWH &UHDWH2SWLRQDO5HIHUHQFH &UHDWH5HTXLUHG$WWULEXWH &UHDWH5HTXLUHG5HIHUHQFH 'HOHWH)HDWXUH 1HZ2SSRVLWH5HIHUHQFH &RQVWUDLQW &UHDWH&RQVWUDLQW$QQRWDWLRQ 'HOHWH&RQVWUDLQW$QQRWDWLRQ $3, &UHDWH'HSUHFDWHG$QQRWDWLRQ &UHDWH6HWWHU9LVLELOLW\$QQRWDWLRQ &UHDWH9RODWLOH$WWULEXWH &UHDWH2SHUDWLRQ &UHDWH9RODWLOH5HIHUHQFH 'HOHWH6HWWHU9LVLELOLW\$QQRWDWLRQ 'HOHWH2SHUDWLRQ 'RFXPHQW &UHDWH'RFXPHQWDWLRQ$QQRWDWLRQ 'HOHWH'RFXPHQWDWLRQ$QQRWDWLRQ 9DOXH 6WUXFWXUDO $EVWUDFW&ODVVWR,QWHUIDFH 3ULPLWLYH $GG6XSHUFODVV &KDQJH$WWULEXWH7\SH 'URS$WWULEXWH,' 'URS&ODVV$EVWUDFW 'URS&ODVV,QWHUIDFH 'URS5HIHUHQFH2SSRVLWH 0DNH&ODVV$EVWUDFWZKHQ,QWHUIDFH 0DNH&ODVV,QWHUIDFHZKHQ$EVWUDFW 0DNH)HDWXUH5HTXLUHG 0DNH)HDWXUH9RODWLOH 0DNH5HIHUHQFH&RQWDLQPHQW 5HPRYH6XSHUFODVV 5HQDPH&ODVV 5HQDPH)HDWXUH 5HQDPH/LWHUDO 6HW3DFNDJH1DPHVSDFH85, 6SHFLDOL]H5HIHUHQFH7\SH &RQVWUDLQW 0RGLI\&RQVWUDLQW$QQRWDWLRQ $3, 5HQDPH9RODWLOH)HDWXUH 5HQDPH2SHUDWLRQ 6HW)HDWXUH&KDQJHDEOH 6HW5HIHUHQFH5HVROYH3UR[LHV 'RFXPHQW 0RGLI\'RFXPHQWDWLRQ$QQRWDWLRQ &RPSRXQG 6WUXFWXUDO &ROOHFW)HDWXUH &RPELQH)HDWXUH &RPSOH[5HVWUXFWXULQJ ([WUDFWDQG*URXS$WWULEXWH ([WUDFW&ODVV ([WUDFW6XEFODVV ([WUDFW6XSHUFODVV )ODWWHQ&RQWDLQPHQW+LHUDUFK\ *HQHUDOL]H$WWULEXWH *HQHUDOL]H5HIHUHQFH ,PLWDWH6XSHUFODVV ,QOLQH6XSHUFODVV 0RYH)HDWXUH 3URSDJDWH)HDWXUH 3XOOXS)HDWXUH 3XVKGRZQ)HDWXUH 5HSODFH&ODVV 5HSODFH(QXP 5HSODFH,QKHULWDQFHE\'HOHJDWLRQ 5HSODFH/LWHUDO 6SHFLDOL]H6XSHUFODVV &RQVWUDLQW 0RYH&RQVWUDLQW$QQRWDWLRQ $3, 0RYH2SHUDWLRQ 2SHUDWLRQWR9RODWLOH)HDWXUH 3XOOXS2SHUDWLRQ 3XVKGRZQ2SHUDWLRQ 9RODWLOHWR2SSRVLWH5HIHUHQFH

1XPEHURI$GDSWDWLRQV &ODVVLILFDWLRQ /DQJXDJH 0RGHO JPI WRRO JPI JPI DOO ([SUHVVLY 0LJUDWLRQ JUDSK GHI PDS JHQ &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU &XVWRP &RQVWUXFWRU &XVWRP 'HVWUXFWRU &RXSOHG 5HIDFWRULQJ 3UHVHUYLQJ 'HVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ 5HIDFWRULQJ &RXSOHG &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 'HVWUXFWRU &XVWRP 'HVWUXFWRU &RXSOHG &RQVWUXFWRU &RXSOHG 'HVWUXFWRU &RXSOHG 5HIDFWRULQJ &RXSOHG 5HIDFWRULQJ &RXSOHG 5HIDFWRULQJ &RXSOHG 5HIDFWRULQJ &RXSOHG 5HIDFWRULQJ 3UHVHUYLQJ 'HVWUXFWRU 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 'HVWUXFWRU &RXSOHG 5HIDFWRULQJ &RXSOHG 5HIDFWRULQJ &XVWRP 5HIDFWRULQJ &RXSOHG 5HIDFWRULQJ &RXSOHG &RQVWUXFWRU &RXSOHG &RQVWUXFWRU 3UHVHUYLQJ 'HVWUXFWRU &RXSOHG &RQVWUXFWRU 3UHVHUYLQJ &RQVWUXFWRU 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 'HVWUXFWRU &RXSOHG 5HIDFWRULQJ &RXSOHG 5HIDFWRULQJ &RXSOHG &RQVWUXFWRU 3UHVHUYLQJ 'HVWUXFWRU &RXSOHG 'HVWUXFWRU &RXSOHG 'HVWUXFWRU &RXSOHG 5HIDFWRULQJ &RXSOHG 'HVWUXFWRU &RXSOHG &RQVWUXFWRU 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ 5HIDFWRULQJ 3UHVHUYLQJ

Fig. 9. Classiﬁcation of operators occurred during metamodel adaptation

15

16

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

automatically migrate models: we identiﬁed 630 (86,2%) PreservingAdaptations, 95 (13,0%) CoupledAdaptations, and 6 (0,8%) CustomAdaptations. As can be seen in Figure 9, a custom migration was necessary 4 times to initialize a new mandatory feature or a feature that was made mandatory. In these cases, the migration is associated to one Primitive, and consists of 10 to 20 lines of handwritten code. Additionally, 2 custom migrations were necessary to perform a complex restructuring of the model. In these cases, the migration is associated to a sequence of 11 and 13 Primitives, and consists of 60 and 70 lines of handwritten code. In a nutshell, a large fraction of changes can be captured by primitive changes or operators which are independent of the metamodel. A signiﬁcant number of operations are known from object-oriented refactoring. Only very few changes were speciﬁc to the metamodel, denoting more complex evolution.

5

Discussion

We interpret and discuss the results of the case study by focusing on lessons learned and threats to the study’s validity. 5.1

Lessons Learned

Based on the results of our case study, we learned a number of lessons about the evolution of modeling languages in practice. Metamodels evolve due to user requests and technological changes. On the one hand, a metamodel deﬁnes the abstract syntax of a language, and thereby metamodels evolve when the requirements of the language change. In GMF, user requests for new features imposed many of such changes to the GMF modeling languages. On the other hand, an API for model access is intimately related to a metamodel, and thereby metamodels evolve when requirements for model access change. In GMF, particularly the shift from JET to XPand as the language to implement the generator imposed many of such changes in the gmfgen metamodel. Since a metamodel captures the abstract syntax as well as the API for model access, language and API evolution interact. Changes in the abstract syntax clearly lead to changes in the API. But API changes can also require to change the abstract syntax of the underlying language: in GMF, we found several cases where the abstract syntax was changed to simplify model access. Other artifacts need to be migrated. The migration is not restricted to models, but also concerns other language development artifacts, e. g. transformators and code generators. During the evolution of GMF, these artifacts needed to be migrated manually. In contrast to models, these artifacts are mostly under control of the language developers, and thereby their migration is not necessarily required to be automated. However, automating the migration of these artifacts would further reduce the eﬀort involved in language evolution. The model-driven development of metamodels with EMF facilitated the identiﬁcation of changes

Language Evolution in Practice: The History of GMF

17

between two diﬀerent versions of the metamodel. In contrast, the speciﬁcation of transformators and code generators as Java code made it hard to trace the evolution. We thus need a more structured and appropriate means to describe the other artifacts depending on the metamodels. Language development could beneﬁt from the same advantages as model-driven software development. Language evolution is similar to software evolution. This hypothesis was postulated by Favre in [1]. The answers to RQ2 and RQ3 provide evidence that the hypothesis holds. First, the distribution of activities performed by the developers of GMF to implement language changes mirrors the distribution of classical software maintenance activities (i. e. perfective and adaptive maintenance activities being the most frequent) [18]. Second, many operators to adapt the metamodels (Figure 9) are similar to operators known from object-oriented refactoring [20] (e. g. Extract Superclass). Like software evolution, the time scale for language evolution can be quite small. In the ﬁrst year of the investigated evolution of GMF, the metamodels were changed 107 times, i. e. on average every four days. However, in the second year the number of metamodel changes decreased to 17, i. e. the stability of GMF increased over time. Apparently, the time scale in which the changes happen increases with the language’s maturity. The same phenomenon applies to the relation between the metamodels and the meta-metamodel, as the evolution of ecore required the migration of the GMF metamodels. However, the more abstract the level, the less frequent the changes: we identiﬁed two changes in the meta-metamodel of the investigated evolution of GMF. Operator-based coupled evolution of metamodels and models is feasible. The developers of GMF provided a migrator to automatically migrate the already existing models. This migrator allows the GMF developers to make changes that are not backward compatible, and are essential as the kinds and number of built models is not under control of the language developers. We reverse engineered the evolution of the GMF metamodels by sequencing operators. Most of the metamodel evolution can be covered by operators which are independent of the speciﬁc metamodel. Only a few custom operators were required to capture the remaining changes. The employed operators can be used to migrate the models as well. In addition, the case study provides evidence for the suitability of operator-based metamodel evolution in forward engineering like proposed in [14,15]. Operator-based forward engineering of modeling languages documents changes on a high level of abstraction which allows for a better understanding of language evolution. 5.2

Threats to Validity

We are aware that our results can be inﬂuenced by threats to construct, internal and external validity. Construct validity. The results might be inﬂuenced by the measurement we used for our case study. For our measurements, we assumed that a commit represents exactly one language change. However, a commit might encapsulate

18

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

several language changes, and one language change might be implemented by several commits. This interpretation is a threat to the results for both RQ1 and RQ2. Other case studies are required to investigate these research questions in more detail, and to increase the conﬁdence and generality of our results. However, our results are consistent with the view that languages evolve like software, which was postulated and tacitly accepted as a fact [1]. Internal validity. The results might be inﬂuenced by the method applied for investigating the evolution. The algorithm to detect the commits (step 2) might miss artifacts which were also committed together. To mitigate this threat, we have manually validated the commits by looking into the temporal neighborhood. By ﬁltering out the commits which did not change the metamodel (step 3), we might miss language changes not aﬀecting the metamodel. Such changes might be changes to the language semantics deﬁned by code generators and transformators. However, the model migration deﬁned by the handcrafted migrator could be fully assigned to metamodel adaptations. We might have misclassiﬁed some commits, when classifying the commits according to the maintenance categories (step 5). However, the results are in line with the literature on software evolution [18]. When detecting the adaptation sequence (step 8), the picked operators might have a diﬀerent intention than the developers had when performing the changes. To mitigate this threat, we have automatically validated the model migration by means of test cases. Furthermore, we have manually validated the migration of all artifacts by taking their co-adaptation into account. External validity. The results might be inﬂuenced by the fact that we investigated a single data point. The modeling languages provided by GMF are among the many modeling languages that are developed using EMF. The relevance of our results obtained by analyzing GMF can be aﬀected when analyzing languages developed with other technologies. Our results are however in line with the literature on grammar evolution [21,6], and this increases our conﬁdence on the fact that the deﬁned operators are valid for many other languages. Furthermore, our past studies on the evolution of metamodels [17,15] revealed similar results.

6

Related Work

Work related to language evolution can be found in several technological spaces of software language engineering [5]. This includes grammar evolution in grammarware, metamodel evolution in modelware, schema evolution in dataware, and API evolution. Grammar evolution has been studied in the context of grammar engineering [3]. L¨ammel proposes a comprehensive suite of grammar transformation operators for the incremental adaptation of context-free grammars [16]. The proposed operators are based on sound, formal preservation properties that allow reasoning about the relationship between grammars. The operator suite proved to be valuable for semiautomatic recovery of the COBOL grammar from an informal speciﬁcation [21].

Language Evolution in Practice: The History of GMF

19

Based on similar operators, L¨ammel proposes a lightweight veriﬁcation method called grammar convergence for establishing and maintaining the correspondence between grammars ingrained in diﬀerent software artifacts [22]. Grammar convergence proved to be useful for establishing the relationship between grammars from diﬀerent releases of the Java grammar [6]. The approach presented in this paper transfers these ideas to the technological space of modelware. In contrast to the Java case study, the GMF case study provides us with intermediate revisions of the metamodels. Taking these revisions into account allows us to investigate how languages changes are actually implemented. Metamodel evolution has been mostly studied from the angle of model migration. To specify and automate the migration of models, Sprinkle introduces a visual graph-transformation-based language [23,24]. However, this language does not provide a mechanism to reuse migration speciﬁcations across metamodels. To reuse migration speciﬁcations, there are two kinds of approaches: diﬀerencebased and operator-based. Diﬀerence-based approaches try to automatically derive a model migration from the diﬀerence between two metamodel versions. Gruschko et al. classify primitive metamodel changes into non-breaking, breaking resolvable and unresolvable changes [25,26]. Based on this classiﬁcation, they propose to automatically derive a migration for non-breaking and resolvable changes, and envision to support the developer in specifying a migration for unresolvable changes. Cichetti et al. go even one step further and try to detect compound changes in the diﬀerence between metamodel versions [27]. However, Sprinkle et al. claim that in the general case it is undecidable to automatically synthesize a model migration that preserves the semantics of the models [28]. To avoid the loss of intention during evolution, we follow an operator-based approach where the developers can perform the operators encapsulating the intended model migration [14,15]. The GMF case study continues and extends our earlier studies [17,15] which focused solely on the automatability of the model migration. Beyond that, the presented study shows that an operator-based approach can be useful in a reverse engineering process to reveal and document the intention of language evolution on a high level of abstraction. Furthermore, it provides evidence that operator-based metamodel adaptation should be used in forward engineering in order to control and document language evolution. In contrast, diﬀerence-based approaches still lack a proof of concept by means of real-life case studies both for forward and reverse engineering. Schema evolution has been a ﬁeld of study for several decades, yielding a substantial body of research [29,30]. For the ORION database system, Banerjee et al. propose a ﬁxed set of change primitives that perform coupled evolution of the schema and data [31]. While reusing migration knowledge in case of these primitives, their approach is limited to local schema restructuring. To allow for non-local changes, Ferrandina et al. propose separate languages for schema and instance data migration for the O2 database system [32]. While more expressive, their approach does not allow for reuse of coupled transformation knowledge. In order to reuse recurring coupled transformations, SERF – as proposed by Claypool et al. – oﬀers a mechanism to deﬁne arbitrary new high-level primitives

20

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

[33], providing both reuse and expressiveness. However, the last two approaches never found their way into practice, as it is diﬃcult to perform complex migration without taking the database oﬄine. As a consequence, it is hard to ﬁnd real-world case studies which include complex restructuring. Framework evolution can be automated by refactorings which encapsulate the changes to both the API and its clients [20]. Dig and Johnson present a case study to investigate how object-oriented APIs evolve in practice [10]. They found out that a signiﬁcant number of API changes can be covered by refactoring operators. In the GMF case study, we found that metamodel evolution is not restricted to the syntax of models, but also includes evolution of APIs to access models. For the migration of client code relying on those APIs, existing work on framework evolution should provide a good starting point.

7

Conclusion

In this paper, we presented a method to investigate the evolution of modeling languages. Our approach is based on retracing the evolution of the metamodel as the central artifact of the language. For this purpose, we provide an operator suite for the stepwise transformation of metamodels from old to new versions. The operators allow us to state clearly the changes made to the language metamodel on a high level of abstraction, and to capture the intention behind the change. Furthermore, these operators can be used to accurately describe the impact of the metamodel changes on related models, and to hint at the possible eﬀects on the related language development artifacts. Thus, we can qualify a certain change with respect to its impact on the other artifacts. This can be in turn used to predict, detect, and prevent language erosion. In the future, the operators could also support the (semi-)automatic migration of co-evolving artifacts other than models. There is an increasing amount of related work proposing alternative approaches to metamodel evolution and model co-evolution. Real-life case studies are needed to evaluate these approaches. In [17], we presented an industrial case study for operator-based metamodel adaptation. However, the studied evolution is not publicly available due to a non-disclosure agreement. In this paper, we studied the evolution of metamodels in GMF as another extensive case study. GMF’s evolution is publicly available through a version control system. The evolution is well-documented in terms of commit comments made by developers, and change requests made by users. Consequently, GMF is a good target to study diﬀerent approaches to metamodel evolution either on its own (as we did in this paper) or in camparison to each other. But GMF is not only a case study for metamodel evolution. We consider it as a general case study on software language evolution and the integration of diﬀerent technological spaces in software language engineering. Not only evolve the modeling languages provided by the framework, but also do APIs. We revealed that a huge amount of GMF metamodel changes were changes to the

Language Evolution in Practice: The History of GMF

21

API for accessing GMF editor models. Further work is needed to investigate the relationship between metamodel evolution and API evolution in frameworks. Another interesting topic for future work would be a comparison of operatorbased approaches in software language engineering. As mentioned in the section on related work, there are many operator-based approaches to software language engineering in diﬀerent technological spaces, e. g. for grammar evolution, metamodel evolution, schema evolution, and API evolution. It’s worth to investigate their common properties, facilities, and restrictions. Acknowledgement. The work of the ﬁrst two authors is supported by grants from the BMBF (Federal Ministry of Education and Research, Innovationsallianz SPES 2020), and the work of the third author is supported by grants from the DFG (German Research Foundation, Graduiertenkolleg METRIK).

References 1. Favre, J.M.: Languages evolve too! changing the software time scale. In: IWPSE 2005: 8th Int. Workshop on Principles of Software Evolution, pp. 33–44. IEEE, Los Alamitos (2005) 2. Favre, J.M.: Meta-model and model co-evolution within the 3D software space. In: ELISA: Workshop on Evolution of Large-scale Industrial Software Applications, pp. 98–109 (2003) 3. Klint, P., L¨ ammel, R., Verhoef, C.: Toward an engineering discipline for grammarware. ACM Trans. Softw. Eng. Methodol. 14(3), 331–380 (2005) 4. B´ezivin, J., Heckel, R.: Guest editorial to the special issue on language engineering for model-driven software development. Software and Systems Modeling 5(3), 231– 232 (2006) 5. Kurtev, I., B´ezivin, J., Aksit, M.: Technological spaces: An initial appraisal. In: CoopIS, DOA 2002 Federated Conferences, Industrial track (2002) 6. L¨ ammel, R., Zaytsev, V.: Recovering Grammar Relationships for the Java Language Speciﬁcation. In: 9th Int. Working Conference on Source Code Analysis and Manipulation. IEEE, Los Alamitos (2009) 7. L¨ ammel, R., Lohmann, W.: Format Evolution. In: RETIS 2001: 7th Int. Conference on Reverse Engineering for Information Systems. [email protected], OCG, vol. 155, pp. 113–134 (2001) 8. Meyer, B.: Schema evolution: Concepts, terminology, and solutions. IEEE Computer 29(10), 119–121 (1996) 9. Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G.: Ontology change: Classiﬁcation and survey. Knowl. Eng. Rev. 23(2), 117–152 (2008) 10. Dig, D., Johnson, R.: How do apis evolve? a story of refactoring: Research articles. J. Softw. Maint. Evol. 18(2), 83–107 (2006) 11. Kleppe, A.G., Warmer, J., Bast, W.: MDA Explained: The Model Driven Architecture: Practice and Promise. Addison-Wesley, Reading (2003) 12. Object Management Group: Meta Object Facility, Core Spec., v2.0 (2006) 13. Object Management Group: Object Constraint Language, Spec., v2.0 (2006) 14. Wachsmuth, G.: Metamodel adaptation and model co-adaptation. In: Ernst, E. (ed.) ECOOP 2007. LNCS, vol. 4609, pp. 600–624. Springer, Heidelberg (2007) 15. Herrmannsdoerfer, M., Benz, S., Juergens, E.: COPE - automating coupled evolution of metamodels and models. In: Drossopoulou, S. (ed.) ECOOP 2009. LNCS, vol. 5653, pp. 52–76. Springer, Heidelberg (2009)

22

M. Herrmannsdoerfer, D. Ratiu, and G. Wachsmuth

16. L¨ ammel, R.: Grammar adaptation. In: Oliveira, J.N., Zave, P. (eds.) FME 2001. LNCS, vol. 2021, pp. 550–570. Springer, Heidelberg (2001) 17. Herrmannsdoerfer, M., Benz, S., Juergens, E.: Automatability of coupled evolution of metamodels and models in practice. In: Czarnecki, K., Ober, I., Bruel, J.-M., Uhl, A., V¨ olter, M. (eds.) MODELS 2008. LNCS, vol. 5301, pp. 645–659. Springer, Heidelberg (2008) 18. Lientz, B.P., Swanson, E.B.: Software Maintenance Management. Addison-Wesley, Reading (1980) 19. Herrmannsdoerfer, M.: Operation-based versioning of metamodels with COPE. In: CVSM 2009: Int. Workshop on Comparison and Versioning of Software Models, pp. 49–54. IEEE, Los Alamitos (2009) 20. Fowler, M.: Refactoring: improving the design of existing code. Addison-Wesley, Reading (1999) 21. L¨ ammel, R., Verhoef, C.: Semi-automatic grammar recovery. Softw. Pract. Exper. 31(15), 1395–1448 (2001) 22. L¨ ammel, R., Zaytsev, V.: An introduction to grammar convergence. In: Leuschel, M., Wehrheim, H. (eds.) IFM 2009. LNCS, vol. 5423, pp. 246–260. Springer, Heidelberg (2009) 23. Sprinkle, J.M.: Metamodel driven model migration. PhD thesis, Vanderbilt University, Nashville, TN, USA (2003) 24. Sprinkle, J., Karsai, G.: A domain-speciﬁc visual language for domain model evolution. J. Vis. Lang. Comput. 15(3-4), 291–307 (2004) 25. Becker, S., Goldschmidt, T., Gruschko, B., Koziolek, H.: A process model and classiﬁcation scheme for semi-automatic meta-model evolution. In: MSI 2007: 1st Workshop MDD, SOA und IT-Management, pp. 35–46. GiTO-Verlag (2007) 26. Gruschko, B., Kolovos, D., Paige, R.: Towards synchronizing models with evolving metamodels. In: Int. Workshop on Model-Driven Software Evolution (2007) 27. Cicchetti, A., Ruscio, D.D., Eramo, R., Pierantonio, A.: Automating co-evolution in model-driven engineering. In: EDOC 2008: 12th Int. IEEE Enterprise Distributed Object Computing Conference, pp. 222–231. IEEE, Los Alamitos (2008) 28. Sprinkle, J., Gray, J., Mernik, M.: Fundamental limitations in domain-speciﬁc language evolution (2009), http://www.ece.arizona.edu/∼sprinkjm/wiki/uploads/Publications/ sprinkle-tse2009-domainevolution-submitted.pdf 29. Li, X.: A survey of schema evolution in object-oriented databases. In: TOOLS 1999: 31st Int. Conference on Technology of Object-Oriented Language and Systems, p. 362. IEEE, Los Alamitos (1999) 30. Rahm, E., Bernstein, P.A.: An online bibliography on schema evolution. SIGMOD Rec. 35(4), 30–31 (2006) 31. Banerjee, J., Kim, W., Kim, H.J., Korth, H.F.: Semantics and implementation of schema evolution in object-oriented databases. In: SIGMOD 1987: ACM SIGMOD Int. conference on Management of data, pp. 311–322. ACM, New York (1987) 32. Ferrandina, F., Meyer, T., Zicari, R., Ferran, G., Madec, J.: Schema and database evolution in the O2 object database system. In: VLDB 1995: 21th Int. Conference on Very Large Data Bases, pp. 170–181. Morgan Kaufmann, San Francisco (1995) 33. Claypool, K.T., Jin, J., Rundensteiner, E.A.: SERF: schema evolution through an extensible, re-usable and ﬂexible framework. In: CIKM 1998: 7th Int. Conference on Information and knowledge management, pp. 314–321. ACM, New York (1998)

A Novel Approach to Semi-automated Evolution of DSML Model Transformation Tihamer Levendovszky, Daniel Balasubramanian, Anantha Narayanan, and Gabor Karsai Vanderbilt University, Nashville, TN 37203, USA {tihamer,daniel,ananth,gabor}@isis.vanderbilt.edu

Abstract. In the industrial applications of Model-Based Development, the evolution of modeling languages is an inevitable issue. The migration to the new language involves the reuse of the existing artifacts created for the original language, such as models and model transformations. This paper is devoted to an evolution method for model transformations as well as the related algorithms. The change description is assumed to be available in a modeling language speciﬁc to the evolution. Based on the change description, our method is able to automate certain parts of the evolution. When automation is not possible, our algorithms automatically alert the user about the missing semantic information, which can then be provided manually after the automatic part of the interpreter evolution. The algorithms have been implemented and tested in an industrial environment. The results indicate that the semi-automated evolution of model transformations decreases the time and eﬀort required with a manual approach.

1

Introduction

The use of model-based software development techniques has expanded to a degree where it may now be applied to the development of large heterogeneous systems. Due to their high complexity, it often becomes necessary to work with a number of diﬀerent modeling paradigms in conjunction. Model-based development tools, to a large extent, meet this challenge. However, short turnover times mean that only a limited time can be spent deﬁning meta-models for these modeling paradigms before users begin creating domain-speciﬁc models. Deﬁciencies, inconsistencies and errors are often identiﬁed in the meta-models after the development is well underway and a large number of domain models have already been created. Changes may also result from an improved understanding of the domain over time, along with other modiﬁcations in the domain itself. Newer versions of meta-models must therefore be created, and these may no longer be compatible with the large number of existing models. The existing models must then be recreated or manually evolved using primitive methods, adding a signiﬁcant cost to the development process. The problem is especially acute in the case of multi-paradigm approaches [MV04], where multiple modeling languages are used and evolved, often concurrently. M. van den Brand, D. Gaˇ sevi´ c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 23–41, 2010. c Springer-Verlag Berlin Heidelberg 2010

24

T. Levendovszky et al.

2

Problem Statement

The general solution for model migration is to allow the migrator to specify a general model transformation to perform the necessary migration operations. A general method has been contributed in [Spr03]. Creating a general model transformation is not an easy task; it is often quite challenging even for a domain expert. Thus, our objective is to provide an evolution method usable by domain experts and more speciﬁc to the evolution than the general approach. Our migration method is based on the following observation motivated by our experience. In most of the practical cases, modeling language evolution does not happen as an abrupt change in a modeling language, but in small steps instead. This also holds for UML: apart from adding completely new languages to the standard, the language has been changing in rather small steps since its ﬁrst release. This assumption facilitates further automation of the model evolution by tools for metamodeled visual languages [BvKK+ 08]. The main concepts of a step-bystep evolution method are depicted in Fig. 1.

Fig. 1. Step-By-Step Evolution Concepts

The backbone of the diagram is a well-known DSL scenario depicted in the upper half of the ﬁgure. When a domain-speciﬁc environment is created, it consists of a meta-model ( M Msrc ), which may have an arbitrary number of instance models (SM1 , SM2 , ...,SMn . The models need to be processed or transformed (”interpreted”), therefore, an interpreter is built. The interpreter expects that its input models are compliant with M Msrc . In parallel, the output models of the interpreter must be compliant with the target meta-model M Mdst . The inputs to the interpreter are M Msrc, M Mdst and an input model SMi , and the interpreter produces an output model DMi . The objective is to migrate the the existing models and interpreters to the evolved language. The evolved counterparts are denoted by adding a prime to the original notation. In the evolution process, we create the new (evolved) meta model (M Msrc ). We assume that the changes are minor enough both in size and nature, such that they are worth being modeled and processed by a tool, rather

A Novel Approach to Semi-automated Evolution

25

than writing a transformation from scratch to convert the models in the old language to models in the evolved language. This is a key point in the approach. Having created the new language by the evolved meta-model, we describe the changes in a separate migration DSL (Model Change Language, MCL). The MCL model is denoted by Δsrc , and it represents the diﬀerences between M Msrc and M Msrc . Besides the changes, this model contains the actual mappings from the old models to the evolved ones, providing more information that describes how to evolve the models of the old language to models of the new language. Given (M Msrc ), (M Msrc ) and the M CL model, a tool can automatically migrate the models of the old language to models of the evolved language. The concepts are similar on the destination side. Evolving the models with MCL is described in [BvKK+ 08] [NLBK09]. Based on the (M Msrc ), (M Msrc ), (M Mdst ), and M CL model, it is possible to evolve the model interpreter, which is the main focus of this paper. Practically, this means evolving the model transformation under the following set of assumptions. (i) The change description is available and speciﬁc to evolution. In our implementation, this is an MCL model, but it could be any model/textual representation with at least the same information content about the changes. (ii) The model elements left intact by the evolution should be interpreted in the same way as they were by the original interpreter. If the intent is diﬀerent, manual correction is required. In our experience, this occurs rarely. Furthermore, we treat the unambiguously changed elements (such as renamed classes) in the same way when it is possible. (iii) The handling of missing semantic information is inevitable. It cannot be expected that methods to process the new concepts added by the evolution can be invented without human interaction. Therefore, a tool cannot achieve more than to produce an initial version of the evolved interpreter only, and show the missing semantic information. (iv) We assume that the interpreter is speciﬁed by graph rewriting rules. Our implementation is based on GReAT [AKNK+ 06], but the algorithms can be used with any tool or theoretical framework specifying the transformation by rewriting rules such as AGG [Tae04], FUJABA [NNZ00], ViATRA [BV06], VMTS [AALK+ 09] tools, or frameworks of the single or double pushout (SPO, DPO) [Roz97] approaches or the High-Level Replacement Systems [EEPT06].

3

Related Work

Providing methods for semi-automated evolution of graph-rewriting -based model transformations for DSLs is a fairly new area. Existing solutions to this problem are more or less ad-hoc techniques that often resort to directly specifying the alterations in terms of the storage format of the models. One such approach is the use of XSL transformations to evolve models stored in XML. Database schema migration techniques have been applied to the migration of models stored as relational data. These approaches are often nothing more than pattern based replacement of speciﬁc character strings, and they do not capture

26

T. Levendovszky et al.

the intent driving a meta-model change [Kar00]. When dealing with complex meta-models covering multiple paradigms, comprehension is quickly lost when capturing meta-model changes using these methods. Although semi-automated evolution of model transformation is a novel approach, it incorporates the transformation of graph rewriting rules. In [EE08], the authors assume the case in which model transformations preserve the behavior of the original models. In this framework, the behavior of the source system and the target system can be deﬁned by transformation rules. In translating from a source to target model, one wants to preserve the behavior of the original model. A model transformation is semantically correct if for each simulation of the source, a corresponding target simulation is obtained, and the transformation is semantically complete if the opposite holds. The authors use graphs to represent models, and graph transformation rules to represent the operational behavior and the transformation from source to target model. The operational rules of the source are also input to the transformation from source to target, and conditions are deﬁned for model and rule transformation correctness. Our approach makes it possible to handle semantic evolution, where this constraint does not hold, and most of our evolution case studies fell into this category. The paper gives a formal description of transforming the DPO transformation rules in an exhaustive manner. Our approach does not enforce DPO rules or exhaustive rule control. The paper [PP96] deals with multilevel graphs, where some parts of the graphs are hidden, but the hidden information can be restored by rules from another, known graph. The authors claim that in many applications, it is useful to be able to represent a graph in terms of particular subgraphs and hide the details of other structures that are only needed in certain conditions. If one repeats this hiding of details, it leads to representations on more than one level of visibility. A graph representation consists of a graph and productions to restore hidden information. If the restrictions that one needs in order to make the restoring of the productions are suitable, then one can produce several graphs and thus have a graph grammar. The paper deﬁnes morphisms between graph grammars, and shows that the grammar and their morphisms form a ﬁnitely cocomplete category, along with other properties. The paper makes a distinction between two cases: (i) global grammar transformation, when a subgrammar is replaced with another grammar, and (ii) local transformation, when the rules are modiﬁed. Our interpreter evolution method takes the latter approach. Using the DPO approach as a theoretical framework, the author deﬁnes the rewriting of a rule by another rule. The actual rewriting takes place on the interface graph, and only these changes are ”propagated” to the left-hand side and the right-hand side of the rules to make them consistent. The main results of the paper deal with the applicability and the satisfaction of the DPO gluing conditions. In our approach, GReAT and the underlying UDM framework [MBL+03] do not allow dangling edges and non-injective matches, and constantly validate the graphs and the rules at run-time.

A Novel Approach to Semi-automated Evolution

4

27

Case Study

Our case study is based on a hierarchical signal ﬂow paradigm. An example model is depicted in Fig. 2.

Fig. 2. An example for hierarchical signal ﬂow model

A signal ﬂow may contain the following elements. An InputSignal represents a signal that is processed by a signal processing unit. An OutputSignal is the result of the processing operation. Signal processing components can be organized into hierarchies, which reduces the complexity of the model. A signal processing unit can either be Primitive or Compound. A Primitive can contain only elementary elements, while a Compound can also contain Primitive processing units. In our example model, Preprocessing and Controller are compound processing units, whereas Filter1, Filter2, ControlAlgorithm, and DAC elements are primitive signal processing components. The input signals and the output signals cannot be connected directly: they need an intermittent LocalPort. Our case study begins with a hierarchical signal ﬂow modeling language and deﬁnes a transformation targeting a non-hierarchical signal ﬂow language. This transformation may be useful for several reasons, but the main motivation is usually implementation-related: if one wants to generate a low-level implementation for a signal ﬂow, some of the simulation engines do not support the concept of hierarchies. While having investments in a form of hierarchical signal ﬂow models, we realize certain weak points in our language, and there are additional features and clariﬁcations that require modiﬁcations in the original language. We then modify the original hierarchical language in several ways typical of meta-model changes, including class renamings, attribute renamings, and the introduction of new meta-classes. We would like preserve our investments, therefore, we would like to transfer the existing models to the new, evolved language. In order to migrate the now invalidated models and transformations, we deﬁne MCL rules that describe the relationships between elements in the old and new meta-models. Using these rules, our MCL language is able to migrate models,

28

T. Levendovszky et al.

and our interpreter evolver is able to create a new version of the transformation that translates from models conforming to the new meta-model to the same target meta-model (Mdst = Mdst ). We begin by describing the original hierarchical language and the target nonhierarchical language, along with the transformation between the two. We then describe the updated version of the hierarchical language and the MCL rules used to migrate models corresponding to the old meta-model so that they conform to the updated meta-model. We then give details about the updated interpreter that is automatically produced using our interpreter evolver tool, including the number of rules requiring hand-modiﬁcation. 4.1

Hierarchical Signal Flow

Fig. 3 shows the meta-model of the original signal ﬂow language.

Fig. 3. The original meta-model

The Component class represents some unit of functionality performed on an input signal and contains a single integer attribute named SignalGain. The CompoundComponent class does not represent any functionality performed on signals, rather it is used to hierarchically organize both types of components. Signal s are passed between components using ports; the Port class has a single

A Novel Approach to Semi-automated Evolution

29

Boolean attribute that is set to true if an instance is an input port and false if it is an output port. The LocalPort class is contained only in CompoundComponents and is used to buﬀer signals between Component s (i.e., the LocalPort buﬀers between the units of functionality). Because the ports share no common base class, four types of connections are deﬁned to represent the possible connections between each type. This is an ineﬃcient design typically made by beginner domain experts. The evolved meta-model can improve upon this. Fig. 2 shows an example model that represents a simple controller. The top of the ﬁgure represents a high level view of the system. The Preprocessing and Controller elements are both CompoundComponents; the internals of both are shown in the bottom of the ﬁgure. The Preprocessing block contains two Component s that represent ﬁlters that are applied to the input signal, while the Controller block contains one Component for implementing the control algorithm and other Component to convert from a digital back to an analog signal, which is then passed out of the system through output ports. All of the ports named Forwarder are LocalPort elements representing a buﬀering element in between functional elements. 4.2

Original Transformation

The target meta-model of the transformation is a “ﬂat” actor-based language without hierarchy, shown in Fig. 4.

Fig. 4. Target meta-model

The Actor class represents basic units of functionality and corresponds to the Component s in the hierarchical signal ﬂow language. The Receiver and

30

T. Levendovszky et al.

Transmitter classes, respectively, are used to send signals to and from, respectively, an Actor. The Queue class corresponds to the LocalPort class in the hierarchical language, and acts as a local buﬀering element between Actor s. The overall goal of the transformation is to create an Actor in the target model for each Component in the input model. Receivers and Transmitter s should be created inside each Actor for each Port inside the corresponding Component. The CompoundComponent s in the input model are present only for organizational purposes, so their eﬀect will be removed in the output model. Fig. 5 shows the full transformation, with two hierarchical blocks expanded to show their full contents. The ﬁrst two transformation rules (shown at the top of Fig. 5) create a RootContainer element and top level Queues for Ports. The block that is recursively called to ﬂatten the hierarchy is expanded on the second line of rules in Fig. 5. The ﬁrst rule on the second line creates top level Queues for each LocalPort in the input model. The third line of rules in Figure Fig. 5 is responsible for creating temporary associations so that the hierarchy can be ﬂattened. The transformation rule named FilterPrimitives is a conditional block that sends nested CompoundComponent s back through the recursive rule and sends all of the regular Components to the ﬁnal row of rules. This ﬁnal row of rules is responsible for creating the Actors in the output model, along with their Receivers, Transmitters and the connections between them. Note that because of the several types of connection classes in the original meta-model, four rules are needed to deal with translating these into the target model, which are the ﬁrst four rules in the third row of Fig. 5. The transformation contains a total of twelve transformation rules, two test cases, and one recursive rule. Fig. 6 shows the transformation rule that creates a Queue in the output model for each Port in the top-level CompoundComponents. This rule indicates that for each Port contained inside the CompoundComponent, a Queue should be created in the RootContainer of the output model (the check mark on the lower right hand corner of the Queue indicates that it will be newly created), along with a temporary association between the Port and its corresponding Queue. The temporary association is created so that later in the transformation, other rules can ﬁnd the Queue that was created in correspondence with a given Port. Also note that this transformation has an AttributeMapping block, which contains imperative code to set the name attribute of the newly created Queue. This imperative code uses the IsInput attribute of the Port class, which will be deleted in the evolved meta-model. Fig. 7 shows the transformation rule that creates an Actor in the output model. The rule indicates that for each Component, an Actor should be created (again, the small check-mark on the Actor indicates it should be newly created). This rule also contains an AttributeMapping block, which allows imperative code to be written for querying and setting an element’s attribute values. The code inside this block is also shown in the ﬁgure. Note that this code uses

A Novel Approach to Semi-automated Evolution

Fig. 5. Entire Transformation

Fig. 6. Transformation rule to create a Queue for each Port

31

32

T. Levendovszky et al.

Fig. 7. Transformation rule to create Actor

the SignalGain attribute on Component ; this will be referenced later during the evolution. 4.3

MCL Rules and Evolved Transformation

The evolved meta-model, shown in Fig. 8, contains several changes typical of meta-model evolutions, including the following migration operations. 1. Component has been renamed to PrimitiveComponent. 2. The IsInput attribute of Port has been removed from InputPort and OutputPort. 3. The attribute SignalGain on Component has been renamed to Gain on PrimitiveComponent. 4. Port has been subtyped into InputPort and OutputPort. 5. InputPort, OutputPort and LocalPort all now share a common base class. 6. All of the connection classes have been replaced with a single connection class named Signal. Fig. 9 shows the MCL rules to accomplish the ﬁrst four points above. Component is connected to PrimitiveComponent with a MapsTo connection, which deals with the ﬁrst point above. The second point above is addressed by setting the IsInput attribute to “Delete” (the delete option is not visible in the ﬁgure). Similarly, the SignalGain attribute on Component is connected to the Gain attribute on PrimitiveComponent via a MapsTo connection, which accomplishes the third point above. The Port class is connected to both InputPort

A Novel Approach to Semi-automated Evolution

33

Fig. 8. Evolved meta-model

and OutputPort with two separate MapsTo connections. A Port should become an InputPort if its IsInput attribute is true, and should become an OutputPort otherwise. This conditional mapping is accomplished by including mapping conditions on the connections (not visible in the ﬁgure). The ﬁfth item above, the introduction of a common base class, is accomplished implicitly. The last point is accomplished with four MCL rules that are all similar to the one shown in Fig. 10. This rule migrates PortToLocal connections to Signal connections. For each PortToLocal connection found in the input model, its source and destination are located, as well as the elements in the destination model to which they were mapped. Then, a Signal connection is created between these two elements.

Fig. 9. Migration rules for ports and components

34

T. Levendovszky et al.

Fig. 10. Migration rule for local ports

5

Contributions

In addition to existing models, we have also invested time and eﬀort in the transformation described above, and we would like to save as much from the original transformation as possible. However, the solution is not so straightforward as it is in case of model migration, since the MCL rules have been designed for model migration, and in most cases they do not hold all the information necessary to migrate the interpreter. Accordingly, we use three distinct categories to describe the availability of information. There are operations, such as renaming a meta-model element or an attribute. These are fully automated transformation operations. For example, in Fig. 9, SignalGain is renamed to Gain. This means that we must set all the references of the original meta-model class SignalGain to the evolved meta-model class Gain in the transformation, and we must tokenize the attribute mappings, and substitute the symbol name SignalGain with Gain. If we would like to delete an attribute, we are lacking information. If the attribute appears in a rule, we do not know what the attribute computation involving the deleted attribute should be substituted with. We can mark the deleted attribute in the attribute mapping code of the transformation, but it is still necessary to have some corrections from the transformation developer. This category is referred to as partially automated transformation operations. Among the transformation operations, additions mean the greatest problems. The original transformation does not include any cues how the added elements should be processed, and while the MCL rules sometimes contain attribute mapping to set the values of new attributes, this still does not describe how these should be introduced in the evolved transformation. Whereas in case of partially automated operations the transformation developer needed to contribute only the part of the migration based on the semantic information he has about the new model, if additions are preformed, the full semantic description of the added elements is required. Without that, these operations cannot be automated. We call these operations fully semantic transformation operations. Currently, we do not treat fully semantical operations.

A Novel Approach to Semi-automated Evolution

35

Accordingly, the automated pass is performed, which is completely automatic. Secondly, a manual pass is required, where the migrator performs manual tasks that involves completing the transformation with the code and other DSML constructs for the new elements and adjusting it for the modiﬁed elements. 5.1

Automated Pass

The MCL rules discussed in Section 4.3 are given as input to the interpreter migration tool, which creates an updated version of the interpreter according to the algorithm in Section 5.3. This updated interpreter automatically reﬂects the ﬁrst meta-model change described above: references to the Component class are now references to the PrimitiveComponent class in the new meta-model. The second meta-model change is handled semi-automatically: the IsInput attribute of Port has been removed from InputPort and OutputPort. This attribute was used in the attribute mapping code shown in Fig. 6 to set the values of attributes in the output model, and this imperative code cannot be migrated without user input because the attribute was deleted. Therefore, all uses of this attribute in the imperative code are commented out, and a warning is emitted to the user. The third change (SignalGain renamed to Gain), is handled automatically because it involves only renaming an attribute. The tool can automatically migrate any imperative attribute mapping code that uses this attribute. Another example of how the transformation is evolved in response to the migration rules is shown in Fig. 11. This is the evolved version of the original transformation rule shown in Fig. 7. Note that this rule reﬂects two changes: (i) Component now has type PrimitiveComponent, and (ii) the imperative attribute mapping code now uses the Gain attribute of PrimitiveComponent, which was previously named SignalGain. The ﬁfth rule is handled implicitly, and the ﬁnal rule (Fig. 10), which maps all connections in the original meta-model to a single type of connection in the new meta-model, is handled automatically. 5.2

Handling Missing Semantic Information

As mentioned, a typical source of missing semantic information is addition. In MCL, one can specify the addition of (i) classes, (ii) attributes, (iii) associations. The detection of these elements is simple: they can be identiﬁed by either comparing the original and the evolved meta-models or analyzing the MCL models. From the interpreter evolution’s point of view, this means that interpreter rules/rule parts for these elements must be added in the manual pass phase. The nodes and edges in a transformation rule reference the meta-model elements. When the transformation rules are migrated, these references must be adapted to the evolved meta-models (M Msrc and M Mdst ). Referenced but deleted elements mean missing semantic information for the rules. The simplest solution is to delete these nodes and edges from the rules. Our experience has shown that the topology (structure) of the rules is lost in this case, which is not the desired behavior, since the topology is usually preserved or modiﬁed subtly.

36

T. Levendovszky et al.

Fig. 11. Evolved migration rule for creating actors

Therefore, such nodes are set to null reference, which means the preservation of the rule structure, but losing the type information. Fig. 12 shows an example of how diﬀerent parts of a rule can be evolved in varying degrees. This rule is the evolved version of the original transformation rule shown in Fig. 6. There are two things to note. First, the use of the IsInput attribute of Port is automatically commented out of the attribute mapping and a warning is issued to the user. Second, the Port class from the original metamodel is still present. This is because the mapping from Port to either InputPort or OutputPort is a conditional MCL rule, and thus there is no way to automate this part of the transformation rule. The main strength of MCL is that it not only speciﬁes primitive operations, such as deletion, addition, and modiﬁcation, but also mappings to express causal dependencies. We can use these mappings to replace certain elements with their evolved counterparts. Frequently, these mappings are split: depending on an attribute value, a concept evolves into two or more distinct concepts. This implies an ambiguous mapping. In this case it cannot be assumed that the evolved elements can be processed the same way as their predecessors, meaning that the interpretation logic must be added manually. In our case study, mapping a Port to InputPort and OutputPort is such a situation (Fig. 9). Therefore, the fourth meta-model change, the sub-typing of Port into InputPort and OutputPort, is a fully semantic change and cannot be handled by the algorithm. This is because the MCL rules describe how a given instance of a Port will be migrated to either an InputPort or OutputPort in an instance model, but do not give enough information to decide how the meta-class Port should be evolved in a transformation. In general, this cannot be decided without user intervention.

A Novel Approach to Semi-automated Evolution

37

Fig. 12. Evolved migration rule for creating queues

The warnings emitted by the evolver tool reﬂect the treatment of the missing semantic information well. The most important warning categories are as follows. If a model element or an attribute has been removed, then the user has to substitute the elements by hand, since the automatic deletion might lead to unexpected behavior either in the pattern matching or in the actual rewriting process. The other important warning group is generated by ambiguous situations. When the evolver tool cannot make a decision, typically in case of multiple migration mappings decided by conditions, a warning is emitted. In the case study, the evolved transformation consisted of the same number rewriting rules. Four pairs were then manually combined due to the newly introduced common base class for InputPort and OutputPort. Another rule was split into two rules to deal with the introduction of InputPort and OutputPort. The deletion of the IsInput attribute of Port required changing the imperative attribute mapping code of one rule. The introduction of a common base class for InputPort, OutputPort and LocalPort required modifying four rules to use the new base class. Overall, three of the rules and both of the test blocks were migrated entirely automatically with no manual changes. A warning was issued about a deleted attribute in one block, which required a manual change because imperative code had been written which referenced that deleted attribute. The rest of the rules were evolved semi-automatically. Manual changes were required in all rules which used the Port class because of the conditional nature of its mapping in the MCL rules as described above.

38

5.3

T. Levendovszky et al.

Implementation and Algorithm

The high-level outline of the algorithm for evolving the transformation is described as follows. ProcessRule(Rule r) for all (PatternClass p in r) do if (p.ref() is in removeClassCache) then DeleteAttributeReferences(p) p.ref()=null else if (p.ref() is in migrateCache and has an outgoing mapsTo) then MigratePatternClass(p) else if (Class c = evolvedMetamodelCache.ﬁnd(p.ref())) then patternClass.ref()=c else DeleteAttributeReferences(p) p.ref()=null end if if (r has changed) then MarkForChanges(r) end if end for In order to accelerate the algorithm, the migration model, the evolved metamodel, the target meta-model of the transformation and the source meta-model are cached, along with the references to temporary model elements in the transformation. Moreover, the elements that are not in the target model and/or denoted as to be deleted in the migration model are also cached. After the caching, a traversal of the transformation is performed, which takes each rule, and executes the ProcessRule algorithm. The structural part of the rule is composed of (i) pattern classes that are references to meta-model classes in the input and output meta-models of the transformation, (ii) connections referencing the associations in the input and output meta-model, and (iii) references to temporary classes and associations that store non-persistent information during the transformation. Moreover, the rules can contain attribute transformations, which query and set the attributes of the transformed model. The attributes and their types are determined by the meta-model classes referenced by the pattern classes. The algorithm takes each pattern class, and distinguishes four cases. (i) If the meta-model class referenced by the pattern class is to be deleted, then the attribute transformations are scanned, and if they reference the attributes provided by the removed class, they are commented out and a warning is emitted. (ii) If the referenced class is in the migration model, the class must be migrated as described in Section 3. If there is only one mapsTo relationship, we redirect the references to the new class, and we update the attribute transformations according to the migration rule. If there are multiple mapsTo relationships originating from the class to be migrated, we cannot resolve this ambiguous situation in the rule, thus, we emit a warning. If there are only wasMappedTo relationships, we fall back on the next case. (iii) If we can transfer the reference to the new

A Novel Approach to Semi-automated Evolution

39

model with name-based match, we do it, emitting a warning that the assignment should be described in the migration model. (vi) If none of the cases above solve the migration, we treat the referenced class as if it were to be deleted, emitting a warning that this should also be a rule in the migration model. Note that we never delete a pattern class, because it would lose the structure of the original rule. On deletion of the referenced class, the referencing pattern class is made to point to null. Because the transformation references the meta-model elements, the references in the source meta-model should be changed to point to the elements of the evolved meta-model. This is also the simplest scenario: if the source metamodel and the evolved meta-model are models with diﬀerent locations, but containing the same model elements, the references are redirected to the evolved meta-models. This redirection is performed by matching the names of the model elements. Because the algorithm traverses the rules, if a meta-model element that is not referenced by the rules is added, we will not give a warning that it should be included in the evolved transformation.

6

Conclusion

There are several reasons why DSMLs evolve. With the evolution of the language the infrastructure must also evolve. We have developed a method for cases in which the modeling language evolves in small steps, as opposed to sudden, fundamental changes. Interpreters are huge investments when creating a DSML-based environment. In this paper, we contributed a method for interpreter evolution under certain circumstances. The discussed transformation operations and their categories are depicted in Table 1. Table 1. Summary of the Evolved Transformation Steps Fully Automated

Partially Automated

Fully Semantic

Rename an element Delete class Add new element Change stereotype Delete connection Add attributes Rename attribute Subtyping Change attribute type Delete attribute

We investigated avionics software applications, and we found that these circumstances hold for the industrial use cases. The algorithms have been implemented in the GME/GReAT toolset, and have been tested in an industrial environment. The drawbacks of the method include the following. Sometimes the changes might be too abrupt for MCL. In this case our tool set still provides the fall back

40

T. Levendovszky et al.

to the general model transformation method. If the interpretation semantics of the existing elements change, the transformation created by the automatic pass must be modiﬁed. When too many new elements are added to the transformation, it means a signiﬁcant amount of manual work. Future work is devoted to providing tool support for the addition of the missing semantic information. Firstly, we identify the most prevalent scenarios, and collect them into a pattern catalog. Secondly, we create a tool that detects the applicability of the pattern and oﬀers its application. Obviously, human interaction is always needed in the discussed cases, but the eﬀort can be minimized by oﬀering complete alternatives for the most frequent use cases.

References [AALK+ 09]

[AKNK+ 06]

[BV06]

[BvKK+ 08]

[EE08]

[EEPT06]

[Kar00] [MBL+03]

[MV04]

Angyal, L., Asztalos, M., Lengyel, L., Levendovszky, T., Madari, I., Mezei, G., M´esz´ aros, T., Siroki, L., Vajk, T.: Towards a fast, eﬃcient and customizable domain-speciﬁc modeling framework. In: Proceedings of the IASTED International Conference, Innsbruck, Austria, February 2009, vol. 31, pp. 11–16 (2009) Agrawal, A., Karsai, G., Neema, S., Shi, F., Vizhanyo, A.: The design of a language for model transformations. Software and Systems Modeling 5(3), 261–288 (2006) Balogh, A., Varr´ o, D.: Advanced model transformation language constructs in the VIATRA2 framework. In: ACM Symposium on Applied Computing — Model Transformation Track (SAC 2006), pp. 1280–1287. ACM Press, New York (2006) Balasubramanian, D., van Buskirk, C., Karsai, G., Narayanan, A., Neema, S., Ness, B., Shi, F.: Evolving paradigms and models in multiparadigm modeling. Technical Report ISIS-08-912, Institute for Software Integrated Systems (December 2008) Ehrig, H., Ermel, C.: Semantical correctness and completeness of model transformations using graph and rule transformation. In: Ehrig, H., Heckel, R., Rozenberg, G., Taentzer, G. (eds.) ICGT 2008. LNCS, vol. 5214, pp. 194–210. Springer, Heidelberg (2008) Ehrig, H., Ehrig, K., Prange, U., Taentzer, G.: Fundamentals of Algebraic Graph Transformation. Monographs in Theoretical Computer Science. An EATCS Series. Springer, Heidelberg (2006) Karsai, G.: Why is XML not suitable for semantic translation. Research Note, Nashville, TN (April 2000) Magyari, E., Bakay, A., Lang, A., Paka, T., Vizhanyo, A., Agrawal, A., Karsai, G.: Udm: An infrastructure for implementing domain-speciﬁc modeling languages. In: The 3rd OOPSLA Workshop on DomainSpeciﬁc Modeling, OOPSLA 2003, Anahiem, California (October 2003) Mosterman, P.J., Vangheluwe, H.: Computer automated multiparadigm modeling: An introduction. Simulation: Transactions of the Society for Modeling and Simulation International 80(9), 433–450 (2004); Special Issue: Grand Challenges for Modeling and Simulation

A Novel Approach to Semi-automated Evolution [NLBK09]

[NNZ00]

[PP96]

[Roz97]

[Spr03] [Tae04]

41

Narayanan, A., Levendovszky, T., Balasubramanian, D., Karsai, G.: Automatic domain model migration to manage metamodel evolution. In: Sch¨ urr, A., Selic, B. (eds.) MODELS 2009. LNCS, vol. 5795, pp. 706– 711. Springer, Heidelberg (2009) Nickel, U., Niere, J., Z¨ undorf, A.: The fujaba environment. In: ICSE 2000: Proceedings of the 22nd international conference on Software engineering, pp. 742–745. ACM, New York (2000) Parisi-Presicce, F.: Transformation of graph grammars. In: 5th Int. Workshop on Graph Grammars and their Application to Computer Science, pp. 428–492 (1996) Rozenberg, G. (ed.): Handbook of graph grammars and computing by graph transformation. Foundations, vol. I. World Scientiﬁc Publishing Co., Inc., River Edge (1997) Sprinkle, J.: Metamodel Driven Model Migration. PhD thesis, Vanderbilt University, Nashville, TN 37203 (August 2003) Taentzer, G.: AGG: A graph transformation environment for modeling and validation of software. In: Pfaltz, J.L., Nagl, M., B¨ ohlen, B. (eds.) AGTIVE 2003. LNCS, vol. 3062, pp. 446–453. Springer, Heidelberg (2004)

Study of an API Migration for Two XML APIs Thiago Tonelli Bartolomei1, Krzysztof Czarnecki1, Ralf L¨ammel2, and Tijs van der Storm3 1 Generative Software Development Lab Department of Electrical and Computer Engineering University of Waterloo, Canada 2 Software Languages Team Universit¨at Koblenz-Landau, Germany 3 Software Analysis and Transformation Team Centrum Wiskunde & Informatica, The Netherlands

Abstract. API migration refers to adapting an application such that its dependence on a given API (the source API) is eliminated in favor of depending on an alternative API (the target API) with the source and target APIs serving the same domain. One may attempt to automate API migration by code transformation or wrapping of some sort. API migration is relatively well understood for the special case where source and target APIs are essentially different versions of the same API. API migration is much less understood for the general case where the two APIs have been developed more or less independently of each other. The present paper exercises a simple instance of the general case and develops engineering techniques towards the mastery of API migration. That is, we study wrapper-based migration between two prominent XML APIs for the Java platform. The migration follows an iterative and test-driven approach and allows us to identify, classify, and measure various differences between the studied APIs in a systematic way.

1 Introduction APIs are both a blessing and a curse. They are a blessing because they enable domainspecific reuse. They are a curse because they lock our software into concrete APIs. Each API is quite specific, if not idiosyncratic, and accounts effectively for a form of ‘software asbestos’ [KLV05]. That is, it is difficult to adapt an application with regard to the APIs it uses. We use the term API migration for the kind of software adaptation where an application’s dependence on a given API (the source API) is eliminated in favor of depending on an alternative API (the target API) with the source and target APIs serving the same domain. API migration may be automated, in principle, by (i) some form of source- or bytecode transformation that directly replaces uses of the source API in the application by corresponding uses of the target API or (ii) some sort of wrapping, i.e., objects of the target API’s implementation are wrapped as objects that comply with the source API’s interface. In the former case, the dependence on the source API is eliminated entirely. In the latter case, the migrated application still depends on the source API but no longer on its original implementation. M. van den Brand, D. Gaˇsevi´c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 42–61, 2010. c Springer-Verlag Berlin Heidelberg 2010

Study of an API Migration for Two XML APIs

43

Incentives for API Migration One incentive for API migration is to replace an aged (less usable, less powerful) API by a modern (more usable, more powerful) API. The modern API may in fact be a more recent version of the aged API, or both APIs may be different developments. For instance, a C# 3.0+ (or VB 9.0+) developer may be keen to replace the hard-touse DOM API for XML programming by the state-of-the-art API ‘LINQ to XML’. The above-mentioned transformation option is needed in this particular example; the wrapping option would not eradicate DOM style in the application code. Another incentive is to replace an in-house or project-specific API by an API of greater scope. For instance, the code bases of several versions of SQL Server and Microsoft Word contain a number of ‘clones’ of APIs that had to be snapshotted at some point in time due to alignment conflicts between development and release schedules. As the ‘live’ APIs grow away from the snapshots, maintenance efforts are doubled (think of bug fixes). Hence one would want to migrate to the live APIs at some possible synchronization point—either by transformation or by wrapping. The latter option may be attractive if the application should be shielded against evolution of the live API. Yet another incentive concerns the reduction of API diversity in a given project. For instance, consider a project that uses a number of XML APIs. Such diversity implies development costs (since developers need to master these different APIs). Also, it may imply performance costs (when XML trees need to be converted back and forth between the different object models of the APIs). Wrapping may mitigate the latter problem whereas transformation mitigates both problems. There are yet more incentives. For instance, API migration may also be triggered by license, copyright and standardization issues. As an example, consider a project where the license cost of a particular API must be saved. If the license is restricted to the specific implementation, then a wrapper may be used to reimplement the API (possibly on top of another similar API), and ideally, the application’s code will not be disturbed. The ‘Difficulty Scale’ of API Migration Consider API evolution of the kind where the target API is a backwards-compatible upgrade of the source API. In this case, API migration boils down to the plain replacement of the API itself (e.g., its JAR in the case of Java projects); no code will be broken. When an API evolves, one may want to obsolete some of its methods (or even entire types). If the removal of obsolete methods should be enforced, then API migration must replace calls to the obsoleted methods by suitable substitutes. In the case of obsoletion, the transformation option of API migration boils down to a kind of inlining [Per05]. The wrapping option would maintain the obsolete methods and implement them in terms of the ‘thinner’ API. Now consider API evolution of the kind where the target API can be derived from the source API by refactorings that were accumulated on an ongoing basis or automatically inferred or manually devised after the fact. The refactorings immediately feed into the transformation option of API migration, whereby they are replayed on the application [HD05, TDX07]. The refactorings may also be used to generate adapter layers (wrappers) such that legacy applications may continue to use the source API’s interface implemented in terms of the target API [S¸RGA08, DNMJ08].

44

T.T. Bartolomei et al.

Representing the evolution of an API as a proper refactoring may be hard or impossible, however. The available or conceivable refactoring operators may be insufficient. The involved adaptations may be too invasive, and they may violate semantics preservation in borderline situations in a hard to understand manner. Still, there may be a systematic way of co-adapting applications to match API evolution. For instance, there is work [PLHM08, BDH+ 09] that uses control-flow analysis, temporal logic-based matching, and rewriting in support of evolving Linux device drivers. Ultimately, we may consider couples of APIs that have been developed more or less independently of each other. Of course, the APIs still serve the same domain. Also, the APIs may agree, more or less, on features and the overall semantic model at some level of abstraction. The APIs will differ in many details however. We use the term API mismatch to refer to the resulting API migration challenge—akin to the impedance mismatch in object/relational/XML mapping [Amb06, Tho03, LM07]. Conceptually, an API migration can indeed be thought of as a mapping problem with transformation or wrapping as possible implementation strategies. The ‘Risk’ of API Migration The attempted transformations or wrappers for API migration may become prohibitively complex and expensive (say in terms of code size and development effort)—compared to, for example, the complexity and costs of reimplementing the source API from scratch. Hence, API migration must balance complexity, costs, and generality of the solution in a way that is driven by the actual needs of ‘applications under migration’. Vision API migration for more or less independently developed APIs is a hard problem. Consider again the aforementioned API migration challenge of the .NET platform. The ‘LINQ to XML’ API is strategically meant to revamp the platform by drastically improving the productivity of XML programmers. Microsoft has all reason to help developers with the transition from DOM to ‘LINQ to XML’, but no tool support for API migration has ever been provided despite strong incentive. Our work is a call to arms for making complex API migrations more manageable and amenable to tool support. Contributions 1. We compile a diverse list of differences between several APIs in the XML domain. This list should be instrumental in understanding the hardness of API migration and sketching benchmarks for technical solutions. 2. We describe a study on wrapper-based API migration for two prominent XML APIs of the Java platform. This migration is unique and scientifically relevant in so far that the various differences between the chosen APIs are identified, classified, and measured in a systematic way. The described process allows us to develop a reasonably compliant wrapper implementation in an incremental and test-driven manner.1 1

We provide access to some generally useful parts of the study on the paper’s website: http://www.uni-koblenz.de/˜laemmel/xomjdom/

Study of an API Migration for Two XML APIs

45

Limitations We commit to the specifics of API migration by wrapping, without discussing several complications of wrapping and hardly any specifics of transformation-based migration. We commit to the specifics of XML, particular XML APIs, and Java. We only use one application to validate the wrapper at hand. Much more research and validation is needed to come up with a general process for API migration, including guarantees for the correctness of migrated applications. Nevertheless, we are confident that our insights and results are substantial enough to serve as a useful call to arms. Road-Map §2 takes an inventory of illustrative API differences within the XML domain. §3 introduces the two XML APIs of the paper’s study and limits the extent of the source API to what has been covered by the reported study on API migration. §4 develops a simple and systematic form of wrapper-based API migration. §5 discusses the compliance between source API and wrapper-based reimplementation, and it provides some engineering methods for understanding and improving compliance. §6 describes related work, and §7 concludes the paper.

2 Illustrative Differences between XML APIs We identify various differences between three major APIs for in-memory XML processing on the Java platform: DOM, JDOM and XOM. The list of differences is by no means exhaustive, but it clarifies that APIs may differ considerably with regard to sets of available features, interface and contracts for shared features, and design choices. API migration requires different techniques for the listed differences; we allude to those techniques in passing only. In the following illustrations, we will be constructing, mutating and querying a simple XML tree for a (purchase) order such as this: <product>4711 <customer>1234

2.1 This-Returning vs. Void Setters Using the JDOM API, we can construct the XML tree for the order by a nested expression (following the nesting structure of the XML tree): // JDOM −− nested construction by method chaining Element order = new Element("order"). addContent(new Element("product"). addContent("4711")). addContent(new Element("customer"). addContent("1234"));

46

T.T. Bartolomei et al.

This is possible because setters of the JDOM API, e.g., the addContent method, return this, and hence, one can engage in method chaining. Other XML APIs, e.g., XOM, use void setters instead, which rule out method chaining. As a result, the construction of nested XML trees has to be rendered as a sequence of statements. Here is the XOM counterpart for the above code. // XOM −− sequential construction Element order = new Element("order"); Element product = new Element("product"); product.appendChild("4711"); order.appendChild(product); Element customer = new Element("customer"); customer.appendChild("1234"); order.appendChild(customer);

It is straightforward to transform XOM-based construction code to JDOM because this-returning methods can be used wherever otherwise equivalent void methods were used originally. In the inverse direction, the transformation would require a flattening phase—including the declaration of auxiliary variables. A wrapper with JDOM as the source API could easily mitigate XOM’s lack of returning this. 2.2 Constructors vs. Factory Methods The previous section illustrated that the XOM and JDOM APIs provide ordinary constructor methods for XML-node construction. Alternatively, XML-node construction may be based on factory methods. This is indeed the case for the DOM API. The document object serves as factory. Here is the DOM counterpart for the above code; it assumes that doc is bound to an instance of type Document. // DOM −− sequential construction with factory methods Element order = doc.createElement("order"); Element product = doc.createElement("product"); product.appendChild(doc.createTextNode("4711")); order.appendChild(product); Element customer = doc.createElement("customer"); customer.appendChild(doc.createTextNode("1234")); order.appendChild(customer);

It is straightforward to transform factory-based code into constructor-based code because the extra object for the factory could be simply omitted in the constructor calls. In the inverse direction, the transformation would be challenged by the need to identify a suitable factory object as such. A wrapper could not reasonably map constructor calls to factory calls because the latter comprise an additional argument: the factory, i.e., the document. 2.3 Identity-Based vs. Position-Based Replacement All XML APIs have slightly differing features for data manipulation (setters, replacement, removal, etc.). For instance, suppose we want to replace the product child of an order. The XOM API provides the replaceChild method that directly takes the old and the new product:

Study of an API Migration for Two XML APIs

47

// XOM −− replace product of order order.replaceChild(oldProduct, newProduct);

The JDOM API favors index-based replacement, and hence the above functionality has to be composed by first looking up the index of the old product, and then setting the content at this index to the new product. Thus: // JDOM −− replace product of order int index = order.indexOf(oldProduct); order.setContent(index, newProduct);

It is not difficult to provide both styles of replacements with both APIs. (Hence, a wrapper can easily serve both directions of API migration.) However, if we expect a transformation to result in idiomatic code, then the direction of going from position-oriented to identity-oriented code is nontrivial because we would need to match multiple, possibly distant method calls simultaneously as opposed to single method calls. 2.4 Eager vs. Lazy Queries Query execution returns some sort of collection that may differ—depending on the API—with regard to typing and the assumed style of iteration. Another issue is whether queries are eager or lazy. Consider the following XOM code that queries all children of a given order element and detaches (i.e., removes) them one-by-one in a loop: // XOM −− detach all children of the order element Elements es = order.getChildElements(); for (int i=0; i<es.size(); i++) es.get(i).detach();

The above XOM code is operational because XOM’s queries are eager, and hence the query results are fully materialized before the corresponding collection can be processed. Here is the apparent JDOM counterpart: // JDOM −− illegal detachment loop for (Object k : order.getChildren()) ((Element)k).detach();

Alas, the execution of this code will throw an exception because getChildren returns essentially a lazy iterator on the actual content list of order; changing that list invalidates the iterator. Hence, an operational JDOM counterpart must explicitly ‘snapshot’ the query result, say, in an extra object array as follows: // JDOM −− detachment loop with up−front snapshot Object[] es = order.getChildren().toArray(); for (Object k : es) ((Element)k).detach();

Arguably, this difference can be mitigated both by a transformation or in a wrapper. Of course, such semantic differences may go unnoticed for some time, and schemes of snapshotting may lead to noteworthy performance penalties.

48

T.T. Bartolomei et al.

2.5 Un-/Availability of API Capabilities When XML is used as a model in an MVC/GUI application, then an event system is likely needed. For instance, the DOM API allows us to register event listeners with different kinds of events. The following code fragment registers a listener with the order element, which invokes its handler for any sort of node insertion: // DOM −− register a listener for node insertion ((EventTarget)order).addEventListener( "DOMNodeInserted", // mutation type new EventListener() { public void handleEvent(Event evt) { // ... handle event ... } }, false);

Neither JDOM nor XOM provide an event system. More generally, we may face API couples where the target API misses some (nontrivial) capability of the source API. In some cases, the capability may be added by extension techniques (e.g., subclasses). In other cases, conservative extension techniques may be insufficient. For instance, the addition of an event system to an XML API would crosscut a considerable part of the API. 2.6 Less vs. More Strict Pre-conditions Typically, XML APIs make an effort to quietly handle exceptional situations as long as well-formedness of XML trees is not jeopardized and no other blatant programming error would go unnoticed. Still the APIs differ as to where to draw the line. Consider the following JDOM code fragment, which attempts to remove the product child of order twice: // JDOM −− exercise borderline case for node removal order.removeContent(product); // properly removes. order.removeContent(product); // quietly completes.

The above code will execute quietly because JDOM’s pre-condition is weak here: it does not insist that the argument node must be in the container on which removal is performed. In contrast, the following XOM code throws an (unchecked) exception: // XOM −− exercise borderline case for node removal order.removeChild(product); // properly removes. order.removeChild(product); // throws!

Such differences in pre-conditions (likewise for post-conditions) are challenging in API migration. If these differences are simply addressed by defensive programming techniques, then code bloat and inefficiency may be the result. In particular, in the case of the transformation option of API migration, it is not straightforward to produce idiomatic (concise) code.

3 The API Couple of the Study The reported study on API migration concerns the XOM and JDOM APIs, with the goal of reimplementing XOM in terms of JDOM.2 That is, JDOM is wrapped as XOM, 2

We use the current versions of those APIs: XOM 1.2.1 and JDOM 1.1.

Study of an API Migration for Two XML APIs

49

Table 1. Packages of the XOM & JDOM APIs

API package #Types #Throwable NCLOC nu.xom 50 18 15783 nu.xom.canonical 1 1 716 nu.xom.converters 2 0 606 nu.xom.xinclude 3 11 1070 nu.xom.xslt 6 1 550 62 31 18725

API package #Types #Throwable NCLOC org.jdom 21 6 3802 org.jdom.adapters 8 0 416 org.jdom.ﬁlter 7 0 328 org.jdom.input 6 1 1088 org.jdom.output 7 0 1915 org.jdom.transform 3 1 418 org.jdom.xpath 2 0 238 54 8 8205

meaning that types with the original XOM interfaces are implemented as wrappers with JDOM objects as wrappees. XOM and JDOM are two prominent XML APIs for the Java platform. They have been developed independently, say, by different software architects, in different code bases, and based on different design rationales.3 The main reason why our study considers migrating from XOM to JDOM, rather than v.v., is the availability of a comprehensive API test suite for XOM. Although wrapping an older API (JDOM) as a newer one (XOM) might appear counter-intuitive at first, such scenario is plausible in practice since migration drivers such as legal issues do not necessarily follow technical criteria. In the sequel, we present some basic metrics and architectural details about the two APIs. We also describe the scope and some limitations of the migration and the available means for test-driven development. 3.1 API Package Structure Table 1 lists XOM’s and JDOM’s packages. For each package, the second column gives the total number of declared types (i.e., classes and interfaces) except any descendants of Throwable. The third column is concerned with the latter, i.e., it gives the number of exception classes. The last column lists NCLOC (‘Non-Comment Lines of Code’) per package as an indication of the size (code complexity) of the packages and the APIs. Let us look at XOM’s packages first. The nu.xom package is XOM’s core package (the core API). All the other packages cover specialized feature themes: canonical XML, DOM and SAX interoperability, XInclude support, and XSLT integration. Our study only covers the core API; we omit the discussion of all other themes (packages) in the present paper. JDOM’s core resides in the org.jdom package; it matches roughly the types and features of XOM’s core, but we will discuss the correspondence more precisely below. The remaining packages cover, again, specialized feature themes: DOM interoperability, content filters for query functionality, advanced de-/serialization support, and XSLT and XPath integration.

3

See http://www.artima.com/intv/jdom.html for background on the design rationales.

50

T.T. Bartolomei et al. Table 2. Metrics on the core XOM/JDOM classes

nu.xom #Implementations Attribute 20 Attribute.Type 4 Builder 15 Comment 9 DocType 18 Document 15 Element 38 Elements 2 Namespace 9 Node 8 NodeFactory 11 Nodes 8 ParentNode 8 ProcessingInstruction 11 Serializer 35 Text 9 XPathContext 5 Core Total 225

org.jdom #Implementations Attribute 29 CDATA 6 Comment 6 Content 9 Document 41 Element 76 JDOMFactory 25 Namespace 7 ProcessingInstruction 15 Text 12 input.SAXBuilder 39 output.XMLOutputter 47 Core Total 312

3.2 Core API Features Table 2 lists all types of XOM’s core and the corresponding JDOM types that were needed for XOM’s reimplementation. XOM’s core is mainly matched by JDOM’s core, but two additional types from the packages org.jdom.input and . . . .output are needed; c.f., the right-hand side of Table 2. This is mainly because de-/serialization is part of XOM’s core, whereas JDOM has designated packages for these functions. We omit exception types as well as package-private types in the table entirely. For each type (row), we show the number of methods that the type explicitly implements. This metric can be seen as a proxy for the effort needed in API migration. In our study, for example, each such implementation required roughly one corresponding method implementation in the wrapper. In some situations, we may want to consider additional metrics, however. One such example is an interface complexity metric, defined as the number of methods a type understands (possibly including inherited or abstract methods). The inclusion of abstract methods is of particular interest to framework APIs, which may declare operations with no framework-provided implementations. Yet other metrics could take into account the fact that polymorphic implementations of the source API may need to be migrated differently depending on the specific receiver type. For instance, a given method implementation of the source API may have different pre- and post-conditions for different receiver types. Also, a given method declaration of the source API may be implemented on a base type, whereas the target API’s class hierarchy requires implementations on derived types. Such issues break the regularity of a wrapper’s implementation. In the study, the impact of these issues was limited. The #Implementations numbers of Table 2 give an idea of the feature complexity of the core API and the relative contribution of the different API types. It is immediately obvious that XOM has fewer methods than JDOM. In fact, JDOM is known to

Study of an API Migration for Two XML APIs

51

Table 3. Metrics on XOM’s test suite TestCase #Tests #Assertions AttributeTest 38 137 AttributeTypeTest 3 70 BaseURITest 76 98 BuilderTest 152 364 CommentTest 17 52 DocTypeTest 46 103 DocumentTest 23 98 ElementTest 68 233 LeafNodeTest 3 2 NamespacesTest 53 110 NodeFactoryTest 43 95 NodesTest 10 33 ParentNodeTest 15 79 ProcessingInstructionTest 19 85 SerializerTest 135 194 TextTest 18 50 Total: 719 1803

The list of test classes maps roughly to the core API classes. There are 685 additional test cases for the omitted themes of the XOM API. The TestCases are JUnit test classes with the shown number of test methods. Each test method tends to involve a small number of tests as evident from the number of assertions. Finally, we should mention that XOM also comes with a separate harness of basic benchmarks to test the speed and memory footprint of XOM programs. We have not used these benchmarks in any manner, but it would be interesting to systematically compare XOM’s performance with the one of a wrapper-based reimplementation.

provide many ‘convenience methods’, which explains this difference. Interestingly, the NCLOC numbers of the core packages in Table 1 clarify that XOM is substantially more complex than JDOM (in terms of code size). This difference involves several factors— also including incidental ones such as programming style. Most importantly, however, XOM is known to make a considerable effort to guarantee XML well-formedness. It pursues this goal by means of heavy checking, which directly affects the NCLOC metric. 3.3 XOM’s Test Suite The study uses test-driven development to push for compliance of the wrapper-based reimplementation of XOM with the original XOM API. We use the excellent XOM test suite to this end. JDOM’s test suite does not have any role in this effort. Table 3 describes XOM’s test suite in more detail.

4 Wrapper-Based API Migration We will describe a simple and systematic form of wrapper-based API migration. In particular, we reimplement XOM in terms of JDOM. Hence, application code can be completely preserved because it may continue to depend on the interface of XOM. 4.1 API Mapping We begin a wrapper-based API migration by mapping each source type and method to a suitable target type and method. Such mapping requires domain knowledge; types and methods are compared at the level of domain concepts and their operations.

52

T.T. Bartolomei et al. Table 4. Metrics on the XOM/JDOM mapping

#regular #irregular methods methods Attribute Attribute 23 5 Attribute.Type java.lang.Integer 1 3 Builder input.SAXBuilder 11 4 Comment Comment 11 2 DocType DocType 20 2 Document Document 23 4 Element Element 39 12 Elements java.util.List 2 0 Node 0 2 NodeFactory JDOMFactory 0 11 Nodes java.util.List 8 1 ParentNode 0 0 ProcessingInstruction ProcessingInstruction 16 1 Serializer output.XMLOutputter 12 4 Text Text; CDATA 11 2 XPathContext 0 5 177 58 nu.xom

org.jdom

The table misses one core type; see Table 2 for the full list. That is, Namespace is omitted because it is only used by the original XOM implementation.

When mapping source types, we distinguish regular vs. irregular types. We say that a type is regular if it corresponds to a single target type; otherwise, the type is irregular. Indeed, some source types may need to be associated with multiple target types; yet other source types may lack a counterpart. When mapping source methods, again, we distinguish regular vs. irregular methods. We say that a method is regular if it corresponds to a single target method provided by (one of) the target type(s); otherwise, the method is irregular. Table 4 summarizes the API mapping for the XOM/JDOM study. We obtained the mapping posteriori by inspecting the wrapper types and methods. 75% of all source methods provided by the wraper are regular. There are 4 irregular source types. For instance, JDOM does not provide a common base class like XOM’s Node; some of its polymorphic methods have their counterparts implemented in multiple JDOM types instead. Please note that the number of source methods per type in Table 4 slightly deviates from Table 2 because the wrapper places some of the method implementations at different levels in the class hierarchy when compared to the original XOM implementation. 4.2 Wrapper Implementation We begin with an ‘empty’ reimplementation of the source API as follows. Each interface of the source API is reused as is by the reimplementation. Each class of the source API is reimplemented with the same interface, but with ‘empty’ (exception-throwing) method implementations. This empty reimplementation is compilable by construction, and any application of the API’s original implementation remains compilable. Applications can be redirected to the new implementation by replacement of the API’s JAR, by aspect-oriented programming, or by (manually) changing package references.

Study of an API Migration for Two XML APIs

53

The next step is to turn the empty types into proper wrapper types. Here we systematically apply the design pattern for object adapters, where we implement the API mapping (c.f., §4.1) as follows. Each wrapper class (i.e., each class of the reimplementation of the source API) is set up, if possible, as an object adapter with an object of the target API as the adaptee (also called the wrappee). For instance, the different Element types of XOM and JDOM would engage in a corresponding wrapper class as follows: package nu.xom; public class Element { private org.jdom.Element wrappee; // implement interface of wrapper in terms of wrappee }

A few special cases should be mentioned in passing. First, abstract wrapper types may not need any wrappee type. Second, when we implement the wrapper class for a source type with multiple associated target types, the wrappee type might need to be an imprecise upper bound, such as Object, and methods may need to perform type dispatch (e.g., via instanceof) to invoke methods on the wrappee. We speak of a minor wrapping disorder if a single wrappee object per wrapper object is fundamentally insufficient for reimplementation. This could happen, for example, if the source API intrinsically assumes a richer state than the target API. For instance, a reimplementation of DOM in terms of XOM or JDOM would need to maintain extra state in order to provide an event system; c.f., §2.5. Such disorders may be encountered late during implementation efforts, and they may trigger amendments of the API mapping; c.f., §4.1. We speak of a major wrapping disorder if method invocations on the source API (handled by the wrapper) may need to be deferred or even rejected because there is yet state missing for the corresponding invocations on the target API. For instance, a reimplementation of XOM or JDOM in terms of DOM is challenging because XOM/JDOM’s constructors are not implementable in terms of DOM’s factory methods; c.f., §2.2. The XOM/JDOM study involves only one minor wrapping disorder. The type nu.xom.Serializer receives a writer through a constructor argument, whereas the associated type org.jdom.output.XMLOutputter receives the writer through method calls. Hence, the XOM type must store the writer throughout. 4.3 Levels of Adaptation Ir-/regularity of a source method is based solely on the number of its associated target methods. There is a richer scale of adaptation levels that usefully classifies reimplemented methods, however. In the following, we define the different adaptation levels for a given source method m. Adaptation level 1. m is a regular method with m as the associated target method. The reimplementation of m only performs basic delegation of m to m on the wrappee (including wrapping and unwrapping). Argument positions may also be filled in by defaults. this-returning may be turned into void methods and v.v.; c.f., §2.1. Adaptation level 2. Additional adaptations are involved in comparison to level 1. That is, arguments may be pre-processed (converted or checked); results may be

54

T.T. Bartolomei et al. Table 5. Adaptations per level for XOM/JDOM

nu.xom Attribute Attribute.Type Builder Comment DocType Document Element Elements Node NodeFactory Nodes ParentNode ProcessingInstruction Serializer Text XPathContext

1 2 3 4 other 16 2 0 7 12 15 23 2 0 0 6 0 14 2 4 4 107

7 2 14 2 8 8 16 0 0 0 0 0 1 1 7 0 66

1 0 0 3 1 2 10 0 0 0 3 0 2 8 1 0 31

4 0 1 1 1 2 2 0 2 11 0 0 0 1 1 1 27

0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 4

Basic delegation (level 1) sufﬁces for a bit less than half of all methods; more than a quarter requires some pre-/post-processing (level 2); the remainder needs to be composed from other methods (level 3) or developed from scratch (level 4). It turns out, however, that all level 4 methods were not at all complex and could be implemented without problems. There are a few methods of the Serializer class that are not associated with an adaptation level. These methods were not implemented because there was no straightforward way of doing so, and the sample application used in the study did not exercise these methods.

post-processed (c.f., §2.4); exceptions may be translated; error codes may be converted into exceptions and v.v.; the delegation may also be conditional, subject to simple tests of the arguments; c.f., §2.6. Adaptation level 3. m is an irregular method. Its implementation may invoke any number of target methods, but without reimplementing any functionality of the target API. In informal terms, a level 3 method is one that is effectively missing in the target API but which can be recomposed from other methods of the target API. Adaptation level 4. The level 3 condition of ‘not reimplementing any methods of the target API’ must be violated. In informal terms, level 4 methods violate the ‘intention of reuse’ for reimplementing the source API in terms of the target API. Table 5 shows the methods per type and adaptation level for the study. We have assigned these levels manually (by categorizing the implementation) and recorded them through method annotations on the wrapper types. The shown numbers depend on a ‘judgement call’ for the required compliance of the wrapper as discussed in the next section. The more one pushes for full compliance, the more methods would be pushed upwards on the level scale; also, the more complex some method implementations would get. We would like to generally avoid method implementations at the adaptation level 4. That is, any substantial violation of the ‘intention of reusing’ the target API runs fundamentally counter the motivation of API migration. Likewise, we would like to avoid complicated or inefficient method implementations at the adaptation levels 2–3.

5 API Compliance In simple terms, the wrapper-based reimplementation of the source API should be ‘fully compliant’ with the original (implementation of the) source API. Compliance could

Study of an API Migration for Two XML APIs

55

be interpreted in the sense of contract-based equivalence for the original implementation and the wrapper. In practice, APIs often lack comprehensive contracts (pre-/postconditions and invariants). Hence, test-based methods are needed. Using such test-based methods, ‘compliance issues’ are gradually discovered, and possibly resolved. In the following, we clarify the process for discovering compliance issues; we categorize these issues; and we defend the idea that some issues may remain unresolved. The XOM/JDOM study continues to serve as the running example. 5.1 Test Suite-Based Compliance A strong test suite for the source API appears to be a reasonable tool in establishing compliance of the original API and the wrapper-based reimplementation. However, an important insight of our work is that it may be prohibitively expensive to achieve full compliance with regard to such a test suite (because it may approximate contract-based compliance at a very detailed, idiosyncratic level). Indeed, in the study, we have ultimately accepted partial compliance with approx. 40 % of all test cases not producing the expected result with the wrapper: – # XOM test suite – all test cases: 697 – # XOM test suite – compliant test cases: 417 – # XOM test suite – non-compliant test cases: 280 In general, a strong test suite for the source API may be the initial driver in pushing the wrapper towards some basic compliance. Such a test suite is even more useful if it clearly identifies mainstream API-usage scenarios that must not be disturbed by noncompliance. To limit effort, one would initially concentrate on a smaller core API and important API-usage scenarios, indeed. In the study, initially, we used a considerably smaller core of XOM. For instance, we left out Serializer because XOM has already a serialization capability through its toXml method. Also, we left out DocType (i.e., DTD) support because it seemed difficult to provide such support in the view of JDOM’s lack of comprehensive DocType support. Ultimately, API migration is driven by the actual ‘application under migration’. The application may call for an extension of the initially covered API and for the inclusion of more API-usage scenarios. In the study, we picked an application under migration by searching the SourceForge repository for an application that both makes substantial use of XOM and references XOM in (say, JUnit-based) test cases. The best fit was CDK.4 In general, one needs to push the wrapper towards full compliance with the application’s test suite—potentially balancing the wrapper development effort and the degree of automation of migration. In the study, we reached full compliance without any need for manual adaptations of the application except for 3 test cases whose dependence on the order of XML attributes had to be relaxed. The following numbers only cover CDK’s test cases that use XOM. 4

Chemistry Development Kit (CDK) is a Java library for structural chemo- and bioinformatics; c.f., http://sourceforge.net/apps/mediawiki/cdk/. The used checkout of CDK does not pass all of its test suite even with the original XOM implementation. We have only looked into compliance for test cases that passed with the original XOM implementation.

56

T.T. Bartolomei et al.

– # CDK test suite – all test cases: 752 – # CDK test suite – compliant test cases: 752 – # CDK test suite – non-compliant test cases: 0 One of the reasons of compliance with the application’s test suite vs. non-compliance with the API’s test suite is of course that any given application will exercise the source API only in a limited manner. However, this may be even true for a reasonable test suite of an API. Consider the following numbers that we determined in the study: – # all implementations of the wrapper: 277 – # XOM test suite – exercised method implementations: 156 – # CDK test suite – exercised method implementations: 35 Hence, about 3/5 of all method implementations where exercised by the API’s test suite, and only about 1/10 were exercised by the application’ test suite. Inspection reveals that the API’s test suite specifically misses many of the more trivial methods (such as getters and setters and diversely overloaded constructors). 5.2 Compliance Levels It is now a central question whether or not the application runs into any of the compliance issues manifested by the API’s test suite. The following method can be applied in this context. Each API method can be associated with a compliance level relative to any test suite as follows: – – – –

always: it is exercised in compliant test cases only. sometimes: it is exercised in both compliant and non-compliant test cases. never: it was exercised but never in compliant test cases. unused: it is not exercised at all in any test cases.

The status of each method with regard to the application’s test suite can now be compared with its status with regard to the API’s test suite. This comparison is visualized for the study in Table 6. The table illustrates that several methods with compliance Table 6. Compliance levels in the XOM/JDOM study nu.xom #always #sometimes #never #unused Attribute 13 / 3 [ ,11] 4 [1, 3] 11 / 25 Attribute.Type 3 [ ,3] 1/4 Builder 1 / 2 [ ,1] 7 [2, 5] 7 / 13 Comment 7 [ ,7] 2 [0, 2] 4 / 13 DocType 8 [ ,8] 5 [0, 5] 9 / 22 Document 7 / 1 [ ,7] 12 [1, 11] 8 / 26 Element 15 / 21 [ ,9] 28 [13, 15] - 8 / 30 [2, ] Elements 0 / 2 [ ,0] 2 [2, 0] Node 2/2 NodeFactory 4 [0, 4] 1 [0, 1] 6 / 11 Nodes 2 / 2 [ ,2] 3 [2, 1] 4/7 ParentNode ProcessingInstruction 9 [ ,9] 1 [0, 1] 7 / 17 Serializer 3 / 3 [ ,3] 8 [3, 5] 3 [0, 3] 2 / 13 Text 7 [ ,7] 1 [0, 1] 5 / 13 XPathContext 0 / 1 [ ,0] 5 / 4 [1, ] Total 75 / 35 [ ,67] 77 [24, 53] 4 [0, 4] 79 / 200 [3, ]

XOM/CDK: The ﬁrst number in each cell shows the compliance level for XOM’s test suite. The number after the slash (if any) shows the compliance level for CDK’s test suite. Note that all CDK test cases succeed; hence there are no methods at levels #sometimes or #never. [moves to #always, moves to #unused]: The numbers in square brackets (if any) describe the moves between the levels with the ‘initial’ position deﬁned by XOM’s test suite and the ‘ﬁnal’ position deﬁned by CDK’s test suite. For example, Attribute had 11 methods moved from #always to #unused, 1 from #sometimes to #always, and 3 from #sometimes to #unused.

Study of an API Migration for Two XML APIs

57

Table 7. Samples of compliance issues in the XOM/JDOM study Type Methods Attribute toXML()

Issue type Domain Status Post Serialization resolved

Attribute Attribute(String,String) Pre

Element detach()

Invariant

Element addAttribute(Attribute) Throws

Element setBaseURI(String)

Pre

BaseURI

Element getBaseURI()

Post

BaseURI

Comment JDOM’s escaping is different from XOM’s resolved XOM allows colonized names in the first argument whereas JDOM does not resolved A root element must always remain attached. resolved XOM throws MultipleParentException if argument is parented whereas JDOM throws IllegalAddException unresolved XOM agressively checks URI for well-formedness and throws accordingly unresolved In XOM the result is absolutized and converted from IRI to URI if needed

issues with regard to the API’s test suite are used without problems in the application. Incidentally, there are even implementations that were not exercised by the API’s test suite but are exercised (and found compliant) by the application’s test suite. (See the numbers in bold face in the table for both of these effects.) 5.3 Discovery of Compliance Issues In the test-driven process of pushing the wrapper towards compliance, one could simply focus on the number of compliant test cases. However, such plain focus would provide little insight into the underlying causes for failing test cases and the actual API mismatch. Also, it would provide no guidance with regard to the prioritization of non-compliant test cases. Instead, test-driven development is to be refined such that non-compliant test cases are incrementally examined and some API method is to be ‘blamed’ to have a compliance issue. Table 7 shows a few samples of documented compliance issues in the study. The format of these entries will be clarified gradually. All discovered issues are recorded by means of method annotations on the wrapper types. As an issue is discovered, a decision must be made whether or not effort is to be spent (immediately) on its resolution. If the issue was discovered through an ambitious test suite for an API, then it may be reasonable to refuse resolution—because the issue is considered either a) less relevant for actual applications, or b) too complicated for an automated approach, calling for a case-by-case migration instead. Table 8 summarizes all resolved and unresolved issues in the study. This relatively small number of issues was indeed discovered incrementally, and about half of the issues remained unresolved, while the ‘application under migration’ is still fully compliant. 5.4 Generic Compliance Issues Compliance issues can be caused by differences in pre-/post-conditions, invariants, and throwing behavior. We call these issues generic in the sense that they are meaningful for APIs of any domain. The following definitions assume two APIs α and α with identical

58

T.T. Bartolomei et al. Table 8. Number of resolved and unresolved XOM/JDOM issues (a) #resolved

(b) #unresolved

Type #Pre #Post #Inv #Throws Attribute 3 1 4 Attribute.Type Builder Comment 2 DocType 1 Document 6 4 Element 5 1 8 Elements Node NodeFactory Nodes ParentNode ProcessingInstruction Serializer Text 2 XPathContext 16 1 2 18

Type #Pre #Post #Inv #Throws Attribute Attribute.Type Builder 5 1 7 Comment 1 DocType 7 1 1 Document 1 Element 3 4 Elements Node NodeFactory 1 Nodes ParentNode ProcessingInstruction Serializer Text 1 XPathContext 1 15 10 1 8

interface. In the wrapping context, α is the original implementation of the source API, whereas α is the wrapper (at a given stage of development). We say that method m has a PRE issue if its pre-condition is stronger in α than in α. If we think of α as the intended replacement of α, then such an issue violates designby-contract rules. The opposite situation also needs to be considered: we also say that m has a PRE issue if its pre-condition is weaker in α than in α. In this case, no violation of design-by-contract rules is present, but α is more (too) permissive than α. In the latter case, the issue can be addressed by adding extra checked assertions to the too permissive implementation. In the former case, a more complex implementation may be needed. Table 7 shows two examples of PRE issues in the study. In fact, the one on Attribute is about a too strong pre-condition (because JDOM rejects colonized names where XOM does not); the one on Element is about a too weak pre-condition (because JDOM checks less for well-formedness than XOM). As it is clear from the table, one of the issues was not resolved—well-formedness checking is particularly difficult to add to JDOM without leading to code bloat and possibly adaption level 4. Likewise, we say that m has a POST issue if its post-condition in α is weaker than the one in α. Further, we say that class c has an INV issue if the invariant of c in α does not imply the one in α. Both kinds of issues violate design-by-contract rules. Yet another kind of generic compliance issue concerns exceptions. We say that m has a THROWS issue if for the case that the implementations α and α agree on whether or not to throw, the thrown exceptions are different (in terms of their types or observable content). This kind of issue happens when source and target APIs use API-specific exception types or differ in the use of reusable exception types. 5.5 Domain-Specific Compliance Issues The generic categories are designed to fully cover all possible compliance issues. In any given API migration project, one may be able to categorize the nature of an issue at the

Study of an API Migration for Two XML APIs

59

domain level. This categorization might help in stating arguments in favor of or against resolving certain issues, based on the given category’s relevance to the application being migrated. In the sequel, we sketch two of the categories of domain-specific issues that we discovered in the study; c.f., Table 7 for illustrations. Serialization. XML can be serialized in different, semantically equivalent ways. In particular, XOM and JDOM may produce serialization results that are equivalent under XML’s infoset semantics but different in terms of string-based comparison. These differences in serialization behavior are hard to neutralize by a wrapper or a transformation, but it is often easy to make applications (and their test cases) robust to such details by applying a sort of canonicalization or refraining from string-based comparison. BaseURI. XOM’s ‘base URI’ handling is considerably more advanced than JDOM’s handling. A full reproduction of XOM’s semantics on top of JDOM would account for complex method implementations. However, base URI handling is rarely used in XML processing code.5

6 Related Work Wrapping is an established technique, in software re-engineering in particular [SM98]; legacy software is often wrapped for use in a new architecture, such as SOA [CFFT08]. We make a contribution to wrapping in so far that we leverage an API-type mapping and classification schemes for method implementations and compliance issues. In the introduction, we already referred to related work on API migration, and our discussion was meant to reveal that all such previous work focused on API evolution in the sense of migrating from one version of an API to the next version. There has been effort to facilitate refactoring in API evolution [HD05, Per05, TDX07, S¸RGA08, DNMJ08]. Some of these approaches use wrapping (adapters) as an implementation technique [S¸RGA08, DNMJ08]. Those wrappers are straightforwardly derived from refactorings; in contrast, our wrappers are the actual representations of relatively heterogeneous API mappings. Several approaches go beyond the limits of refactoring by providing some general means of transformation [CN96, KH98, BTF05, PLHM08]. Again, the showcases for all these approaches concern API evolution or migration between very much similar APIs. For instance, [BTF05] describes a rewriting-based approach for API migration that has been applied to the types Vector and ArrayList of the Java Core API, where the latter type is essentially a ‘careful redesign’ of the former. Nevertheless, the transformation techniques from such previous work are important ingredients of a general approach to API migration. Our efforts to gather metadata about APIs, such as API-type mappings or compliance issues, are well in line with other recent efforts on understanding APIs at an ontology level [RJ08]. We are also inspired by other related uses of metadata in program comprehension, reverse engineering and re-engineering [BCPS05, BGGN08]. 5

Among all of the 43 SourceForge projects that use Subversion as repository and that use XOM, there is apparently only a single project that performs nontrivial base URI handling.

60

T.T. Bartolomei et al.

7 Conclusion We have researched API migration with specific interest in couples of source and target APIs that were developed independently of each other. We have engineered the process of API migration in this context and reported on one study concerning two popular XML APIs of the Java platform. The various differences between the chosen APIs were identified, classified, and measured in a systematic way. Our work shows that API migration for independently developed APIs may be manageable. Despite the many semantical and contractual differences, despite different features and designs, one can construct a reasonably compliant wrapper for API migration in a systematic, incremental, and test-driven manner. The use of a strong test suite for the API and a useful test suite for the application under migration are indeed critical. Our experiments substantiate that a wrapper-based reimplementation of an API may lack full compliance with the API’s test suite, while it can be still fully compliant with the test suite of the application under migration. One area of future work concerns the provision of a more general wrapping technique that can deal with all forms of subtyping, callbacks, and extensions points in APIs (and frameworks). We also need to generalize the described approach by applying it to other domains such as GUI or database programming. Further, we would like to abstract from the low-level approach of specifying API migrations as metadata-annotated wrapper implementations. That is, we seek an appropriate transformation language that can perhaps even be executed in two manners: either as a source-code transformation or as a wrapper generator. Finally, any resolved issue, say for a given method m, adds complexity to the API migration. A wrapper seems to hide that complexity ‘inside’, except perhaps for the implied performance penalty. Worse, the transformation option of API migration incurs the added complexity for every call to m. Hence, it is important to find an effective way of deciding on whether or not a given compliance issue needs to be dealt with for a given source location that calls m. Acknowledgements. This work is partially supported by IBM Centers for Advanced Studies, Toronto.

References [Amb06] [BCPS05]

[BDH+ 09]

[BGGN08]

Ambler, S.W.: The Object-Relational Impedance Mismatch (2006), http://www.agiledata.org/essays/impedanceMismatch.html Bruno, M., Canfora, G., Di Penta, M., Scognamiglio, R.: An Approach to support Web Service Classification and Annotation. In: 2005 IEEE International Conference on e-Technology, e-Commerce, and e-Services (EEE 2005), Proceedings, pp. 138–143. IEEE Computer Society, Los Alamitos (2005) Brunel, J., Doligez, D., Hansen, R.R., Lawall, J.L., Muller, G.: A foundation for flow-based program matching: using temporal logic and model checking. In: Proceedings of the 36th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2009, pp. 114–126. ACM, New York (2009) Br¨uhlmann, A., Gˆırba, T., Greevy, O., Nierstrasz, O.: Enriching Reverse Engineering with Annotations. In: Czarnecki, K., Ober, I., Bruel, J.-M., Uhl, A., V¨olter, M. (eds.) MODELS 2008. LNCS, vol. 5301, pp. 660–674. Springer, Heidelberg (2008)

Study of an API Migration for Two XML APIs [BTF05]

[CFFT08]

[CN96]

[DNMJ08]

[HD05]

[KH98] [KLV05] [LM07]

[Per05]

[PLHM08]

[RJ08]

[SM98]

[S¸RGA08]

[TDX07]

[Tho03]

61

Balaban, I., Tip, F., Fuhrer, R.: Refactoring support for class library migration. In: OOPSLA 2005: Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming, systems, languages, and applications, pp. 265–279. ACM, New York (2005) Canfora, G., Fasolino, A.R., Frattolillo, G., Tramontana, P.: A wrapping approach for migrating legacy system interactive functionalities to Service Oriented Architectures. Journal of Systems and Software 81(4), 463–480 (2008) Chow, K., Notkin, D.: Semi-automatic update of applications in response to library changes. In: ICSM 1996: Proceedings of the 1996 International Conference on Software Maintenance, p. 359. IEEE Computer Society, Los Alamitos (1996) Dig, D., Negara, S., Mohindra, V., Johnson, R.: Reba: refactoring-aware binary adaptation of evolving libraries. In: ICSE 2008: Proceedings of the 30th International Conference on Software Engineering, pp. 441–450. ACM, New York (2008) Henkel, J., Diwan, A.: CatchUp!: capturing and replaying refactorings to support API evolution. In: ICSE 2005: Proceedings of the 27th International Conference on Software Engineering, pp. 274–283. ACM, New York (2005) Keller, R., H¨olzle, U.: Binary component adaptation. In: Jul, E. (ed.) ECOOP 1998. LNCS, vol. 1445, pp. 307–329. Springer, Heidelberg (1998) Klusener, A.S., L¨ammel, R., Verhoef, C.: Architectural modifications to deployed software. Science of Computer Programming 54(2-3), 143–211 (2005) L¨ammel, R., Meijer, E.: Revealing the X/O impedance mismatch (Changing lead into gold). In: Backhouse, R., Gibbons, J., Hinze, R., Jeuring, J. (eds.) SSDGP 2006. LNCS, vol. 4719, pp. 285–367. Springer, Heidelberg (2007) Perkins, J.H.: Automatically generating refactorings to support API evolution. In: PASTE 2005: Proceedings of the 6th ACM SIGPLAN-SIGSOFT workshop on Program Analysis for Software Tools and Engineering, pp. 111–114. ACM, New York (2005) Padioleau, Y., Lawall, J.L., Hansen, R.R., Muller, G.: Documenting and automating collateral evolutions in linux device drivers. In: Proceedings of the 2008 EuroSys Conference, pp. 247–260. ACM, New York (2008) Ratiu, D., Juerjens, J.: Evaluating the Reference and Representation of Domain Concepts in APIs. In: 16th International Conference on Program Comprehension (ICPC 2008), pp. 242–247. IEEE Computer Society, Los Alamitos (2008) Sneed, H.M., Majnar, R.: A case study in software wrapping. In: International Conference on Software Maintenance (ICSM 1998), Proceedings, pp. 86–93. IEEE Computer Society, Los Alamitos (1998) S¸avga, I., Rudolf, M., G¨otz, S., Aßmann, U.: Practical refactoring-based framework upgrade. In: GPCE 2008: Proceedings of the 7th international conference on Generative Programming and Component Engineering, pp. 171–180. ACM, New York (2008) Taneja, K., Dig, D., Xie, T.: Automated detection of API refactorings in libraries. In: ASE 2007: Proceedings of the twenty-second IEEE/ACM international conference on Automated Software Engineering, pp. 377–380. ACM, New York (2007) Thomas, D.: The Impedance Imperative: Tuples + Objects + Infosets = Too Much Stuff! Journal of Object Technology 2(5), 7–12 (2003)

Composing Feature Models Mathieu Acher1 , Philippe Collet1 , Philippe Lahire1 , and Robert France2 1

University of Nice Sophia Antipolis, I3S Laboratory (CNRS UMR 6070), 06903 Sophia Antipolis Cedex, France {acher,collet,lahire}@i3s.unice.fr 2 Computer Science Department, Colorado State University, Fort Collins, CO 80523, USA [email protected] Abstract. Feature modeling is a widely used technique in Software Product Line development. Feature models allow stakeholders to describe domain concepts in terms of commonalities and diﬀerences within a family of software systems. Developing a complex monolithic feature model can require signiﬁcant eﬀort and restrict the reusability of a set of features already modeled. We advocate using modeling techniques that support separating and composing concerns to better manage the complexity of developing large feature models. In this paper, we propose a set of composition operators dedicated to feature models. These composition operators enable the development of large feature models by composing smaller feature models which address well-deﬁned concerns. The operators are notably distinguished by their documented capabilities to preserve some signiﬁcant properties.

1

Introduction

Clements et al. deﬁne a software product line (SPL) as "a set of softwareintensive systems that share a common, managed set of features satisfying the specific needs of a particular market segment or mission and that are developed from a common set of core assets in a prescribed way" [1]. SPL engineering involves managing common and variable features of the family during diﬀerent development phases (requirements, architecture, implementation), to ensure that family instances are correctly conﬁgured and derived [2]. In this context, Model-Driven Engineering is gaining more attention as a provider of techniques and tools that can be used to manage the complexity of SPL development. In model-based development of SPLs, feature models (FMs) [3, 4] are widely used to capture SPL requirements in terms of common and variable features. From an early stage (e.g. requirements elicitation) to components and platform modeling, FMs can be applied to any kind of artefacts (code, documentation, models) and at any level of abstraction. As a result, FMs can play a central role in managing variability and product derivation of SPLs (e.g., see [5, 6, 7]).

This work was partially funded by the French ANR TL FAROS project.

M. van den Brand, D. Gašević, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 62–81, 2010. c Springer-Verlag Berlin Heidelberg 2010

Composing Feature Models

63

Like other model-based approaches, SPL engineering now faces major scalability problems and FMs with thousands of features are not uncommon [8, 9]. Creating and maintaining such large FMs can then be a very complex activity [10, 11, 12, 13, 14, 15]. This problem indicates a need for tools that developers can use to better manage complexity. One way that this can be done is to provide the means to separate the concerns or the business domains in an SPL. Our work focuses on an approach that puts FMs at the center of SPL management. The separation of concerns approach we propose enables stakeholders to manage and maintain FMs that are speciﬁc to a business domain, a technological platform or a crosscutting concern. In this paper, we propose generic composition operators to compose FMs in order to produce a new FM. The proposed operators have been determined through a classiﬁcation of possible manipulations when composing elements of two FMs. This classiﬁcation is inspired by the similar distinctions made when composing models (introduction, merging, modiﬁcation, extension) [16]. The proposed insert operator supports diﬀerent ways of inserting features from a crosscutting FM into a base FM. Depending on the inserted and targeted feature nodes, we determine whether the insertion preserves the set of conﬁgurations determined by the input FMs. This preservation property is called the generalization property. We also propose a merge operator that is capable of combining matching features in two input FMs. This operator is deﬁned using the insert operator and similar properties are also determined. The remainder of this paper is organized as follows. Section 2 describes the motivation for separating and composing FMs through an example. Section 3 sets out the rationale behind the design of the proposed composition operators and discusses properties that are used to characterize the provided operators. Section 4 and Section 5 detail the insert and merge operators and illustrate their use on the example presented in Section 2. Section 6 discusses related work. Section 7 describes future work and concludes this paper.

2

Motivation

The plethora of feature deﬁnitions [17] suggests that FMs can be used at diﬀerent stages of the SPL development, from high-level requirements to code implementation. In this paper, FMs are considered from a general perspective in that FMs are not restricted to a speciﬁc development phase. As a result, a FM can just as well describe a family of software programs, a family of requirements or a family of models. 2.1

Feature Model

FMs organize a hierarchy of features while explicitly specifying the variability [18]. Features of a FM are nodes of a tree represented by strings and related by various types of edges [19]. The edges are used to progressively decompose features into more detailed subfeatures. (The tree structure starts from the root

64

M. Acher et al.

feature, which is then the parent of its child features and so on.) Some mechanisms are also used to express variabilities in a FM. Hence, a group of child features can form an And -, Xor -, or Or -groups. Features in an And -group can be either mandatory or optional subfeatures of the parent feature. There are some rules to determine whether a FM is well-formed or not. For example, there cannot be an And, Or or Xor -group with only a single child. In Fig. 1, the concept of person is represented as a FM, whose root feature is Person. Information associated to a person includes housing, transport and telephone, which are mandatory features. The transport feature consists of either a car or an other kind of transport. These child features are mutually exclusive and thus are organized in a Xor-group. The housing feature is composed of any combination of an address, a street name or a street number feature. Since their original deﬁnition by Kang et al. [3], several FM notations have been proposed [19]. The FM language used throughout this paper supports standard structures previously described, but we do not consider directed acyclic graph structures and do not deal with constraints deﬁned across features, whether they are internal or between several FMs. Nevertheless, taking into account constraints on FM is part of our future work (see Section 7).

Fig. 1. A feature model representing the concept of person

A FM is a representation of a family and describes the set of valid feature combinations. Every member of a family is thus represented by a unique combination of features1 . In the remainder of the paper, a combination of selected features is called a configuration of a FM. A conﬁguration is valid if all features contained in the conﬁguration and the deselection of all other features is allowed by the FM. The validity of a conﬁguration is determined by the semantics of FM that prevents the derivation of illegal conﬁgurations. A FM is a characterization of a set of valid conﬁgurations. The semantics of a FM can be expressed in terms of the following rules: i) if a feature is selected, its parent must also be selected. The root feature is thus 1

A member of a family can be an “instance”, a “product”, a “program”, etc. All these terms are equivalent. Their uses depend on the kind of family represented.

Composing Feature Models

65

always included in any conﬁguration; ii) If a parent is selected, all the mandatory features of its And group are selected; iii) If a parent is selected, exactly one feature of its Xor-group must be selected; iv) If a parent is selected, at least one feature of its Or-group must be selected (it is also possible to select more than one feature in its Or-group). A valid conﬁguration of the FM depicted in Fig. 1 follows: {P erson, housing, telephone, transport, address, streetN ame, areaCode, car} 2.2

A Running Example

We use the following example to illustrate the FM composition operators described in this paper. The example is complex enough to illustrate composition needs. In Fig. 2, the concept of person is designed from a general perspective and described as a FM. It acts as a base or primary model that may not provide all the elements required by an application or system of a person, that is, it may be augmented with other features describing diﬀerent aspects of a person. We explain how this base model can be composed incrementally with other FMs describing diﬀerent aspects of features in the base model. These other FMs are called aspects. Let us take a ﬁrst aspect called Service Provided, which deals with the services that may be oﬀered to a person and another aspect called Transport, which addresses the kinds of transport that may be used by a person. These two aspects are orthogonal to the concept of person. Furthermore, they are not particularly applied to the concept of person and thus can be composed with other base models, e.g., representing an hotel or a nursing home. Additionally, the concept of person is enriched using two other aspects. The ﬁrst aspect describes features that provide information about the living Service Provided aspect

Transport aspect

Economical aspect

Living environment aspect

+

General perspective

Fig. 2. Integrating several feature models

Integrated view

66

M. Acher et al.

environment of a person while the second aspect describes features that deﬁned its economic characteristics. These aspects may be considered as diﬀerent viewpoints that represent the concept of person from the perspective of stakeholder’s interests. Fig. 2 shows the four aspects to be composed with the base FM depicted in Fig. 1. The Service Provided and Transport aspects are orthogonal to the concept of person whereas the Economical and Living Environment aspects are additional facets of the concept of person. 2.3

Requirements

The example presented above highlights the need for compositional operators that can i) add information (e.g. subset features of a FM) to an existing feature, ii) reﬁne some features with more detailed information, and iii) merge the contents of several features. The operators should work at the feature level to enable a modeler to compose only part of a FM with another FM. This should also enable reuse of part of an input FM when creating a larger composed FM. Additionally one may need to reuse more than one part of a FM or the same part several times. One should also be able to preselect some of the features of one aspect before the composition is performed. The running example shown in Fig. 2 illustrates a sequence of introduction and merging of features. These requirements mean that composing two models can correspond to a wide range of situations, from the single use of one operator on the root of two models to be merged, to multiple uses of one or several operators on various features of these aspects. In addition, taking into account the expressiveness of FMs, there are several ways to introduce one feature into another one or to merge them. Previous work has pointed out that dealing with large, monolithic FMs is problematic, in particular, FM maintenance is a diﬃcult process [11, 12, 10]. As in our running example, an appealing approach is rather to use multiple FMs during the SPL development. A ﬁrst challenge is to allow diﬀerent stakeholders or software suppliers, at diﬀerent stages of the software development, to focus on their expertise and integrate their speciﬁc concerns. Another challenge is to manage the evolution of FMs [13, 14, 15]. In order to ensure that software products are well maintained, some relevant properties of the models have to be preserved during time. The primary issue of all this work is to deﬁne some compositional mechanisms. But, to the best of our knowledge, they do not i) provide a set of composition operators, ii) deﬁne the semantics of these operators according to the expressed conﬁgurations, iii) propose a systematic technique to implement them.

3

Rationale

In order to meet the requirements above, we ﬁrst identify some relevant semantic properties regarding composition operators. Then we discuss our main design choices regarding the proposed operators. These operators aim to compose two

Composing Feature Models

67

concerns represented in two FMs. We then distinguish the aspect concern from the base concern. The result of the composition is described according to the set of conﬁgurations of the base concern. 3.1

Characterizing the Result of a Compositional Operator

Let f and f be FMs and f and f denote their respective set of conﬁgurations. Let op be the operator which transforms a base FM f into f using an aspect FM g. The semantics of the operator op is expressed in terms of the relationship between the conﬁguration sets of the input models (f and g) and the resulting model f (i.e. in terms of the relationship between the conﬁguration sets of f , f and g). In [14], the authors distinguish and classify four FM adaptations2 : a refactoring : no new conﬁgurations are added and no existing conﬁgurations are removed : f = f ; a specialization : some existing conﬁgurations are removed and no new conﬁgurations are added : f ⊂ f ; a generalization : new conﬁgurations are added and no existing conﬁgurations removed : f ⊂ f ; an arbitrary edit : a change that is not a refactoring, a specialization or a generalization. The classiﬁcation proposed in [14] covers all the changes a designer can produce on a FM and the formalization provided in [14] is a sound basis for reasoning about these changes. We rely on these four categories of FM adaptations in order to characterize the semantics of the insert and merge operators (see Section 4 and 5). 3.2

Main Design Choices

The composition of an aspect and a base concern may correspond either to the single use of the two proposed compositional operators (insert or merge), or to any combination of these two operators. Any of the two compositional operators ensure that the result of a successful composition is a well-formed FM (see Section 2). Scope of an operator. An operator speciﬁes what feature(s) g of the aspect concern is to be composed with features in the base concern, and where (i.e. which feature in the base model f ) it is going to be inserted or merged with3 . All features of the aspect concern not included in the hierarchy starting with g are not involved in the composition process and are not included in its result. 2

3

The author use the term “edits” because the focus seems to be on local edits on FM. An example of edit given in the paper is “moving a feature from one branch to another”. To choose the root feature is equivalent to consider the whole FM.

68

M. Acher et al.

An aspect concern is either strongly or loosely related to the base concern. It can participate to the description of the same concept but can consider another facet of the information (another viewpoint), or its purpose is orthogonal to the concept described in the base concern. For example, the concern dealing with the economical information of a person corresponds to the ﬁrst case whereas the kind of transport that may be oﬀered in general (i.e. not only to a person) corresponds to the second case. Let us now address how to compose FMs g with f and let us emphasize why both insert and merge are needed. The insert operator makes it possible to specify any applicable FM operators (i.e. And-, Xor-, or Or-groups) to compose g and f . It is more suited to the case of loosely connected aspects. Merge determines the FM operator to be used and it corresponds to the composition of two views of the same concept. Merge is higher level and we show that it may be implemented thanks to the insert operator (see Section 4 and 5). Renaming. When two features are merged, two typical cases may occur: two features with the same name (resp. diﬀerent names) in both the base and aspect model may not address the same meaning (resp. correspond to the same meaning). We provide an operator rename that allows the user to align the two FMs before composition. For the sake of brevity the renaming operator is not detailed in this paper. Limits. We might have included more operators as it is proposed in several approaches coming from the Aspect-Oriented Modeling community [20]. Mainly they deal with two other kinds of operators : replace and delete. We choose not to do so but not for the same reasons. Instead of proposing a new operator for deleting features in the base model4 , we propose that i) the semantics of merge may rely either on the semantics of the intersection (to only keep the common features) or union (to keep all features) and ii) more generally an operator may perform some deletion according to its semantics and to guarantee that the resulting FM is well-formed. We consider replace only as a special case of merge with some possible renamings before composition.

4

Insert Operator

The insert operator aims at introducing newly created elements into any base element or inserting elements from the aspect model into the base model. For example, a stakeholder can extend the transport feature associated to a Person (left part of Fig. 3(a)) by including the urban transport information, represented in an aspect FM (right part of Fig. 3(a)). The dotted arrow indicates that the feature urbanTransport is inserted below the feature transport; it does not indicate how the feature tree will be inserted (e.g. which variability information will be associated to the feature tree). The 4

According to what had been said at the beginning of the section, there is no need to use such operators for the aspect concern.

Composing Feature Models

(a) Insertion of the Urban transport aspect

69

(b) A possible resulting FM

Fig. 3. Example of insertion of FM

stakeholder needs syntactic mechanisms to deﬁne precisely how the insertion is achieved. 4.1

Syntactic Definition

The insert operator is syntactically deﬁned as follows: insert (aspectFeature: Feature, joinpointFeature: Feature, operator: Operator) It takes three arguments: the feature to be inserted (a feature in the aspect model), the targeted feature (a feature in the base model) where the insertion needs to be done, and the operator (e.g. Xor -group) speciﬁed by the user. The precondition of the insert operator requires that the intersection between the set of features of the base FM and the one of the aspect FM is empty. This condition preserves the well-formed property of the composed FM which states that each feature’s name is unique. The insert’s parameters allow the stakeholder to control the insertion addressing the three following issues: Where will the aspect FM be inserted into the base FM? The joinpointF eature is a feature of the base FM and describes where the aspectF eature should be inserted into the base FM. What feature(s) of the aspect FM will be inserted into the base FM? The aspectF eature feature is inserted and comes with its child features. If the aspectF eature feature is the root of an aspect FM, the aspect FM is entirely inserted into the base FM. Otherwise only the subtree starting at aspectF eature is inserted. How will the insertion be done? What are the eﬀects on the properties of the composed model? According to the third argument operator (e.g. Xor group) and the group (e.g. Or ) of joinpointF eature in the base FM , it can change the group of the aspectF eature to be inserted. The remainder of this section deﬁnes the semantics and the rules to implement it.

70

M. Acher et al.

4.2

Semantics

The semantics of the insert operator is represented by the relationship that exists between the new composed model and the base/primary model, so that it refers to the properties preserved or not by the composed model according to its set of conﬁgurations. The insert operator should respect one (or more) properties deﬁned in Section 3.1 (generalization, specialization, refactoring or none of these) considering the composed model and the base model. A stakeholder can thus anticipate the changes to the base model while applying the insertion. Intuitively, if an aspect model is added somewhere in a base model Base, the set of conﬁgurations of Base should grow. The new version of Base which results from applying the insert operation can produce a generalization: new conﬁgurations are added and no existing conﬁgurations are removed. But the situation corresponding to an arbitrary edit may also happen depending on the operator that is passed as parameter of insert : some new conﬁgurations are added while some others are removed. The reﬁnement of a FM can indeed alter the existing conﬁgurations such as they become deprecated. According to their deﬁnition (see Section 3.1), specialization and refactoring are not possible because they correspond to situations that are not compatible with the meaning of an insertion. This simply follows the rationale behind the insert operator, which is to add details and to populate the base model with additional information. In the remainder of this section, Base FM corresponds to the (sub-)tree of the base FM whose root is joinpointF eature while Aspect FM corresponds to the (sub-)tree of the aspect FM whose root is aspectF eature. More formally the semantics of insert is deﬁned as follows: – The set of conﬁgurations of the FM after insertion (Result ) is at least the set of conﬁgurations of Base FM. This can be expressed as follows: Base ⊂ Result

(I1 )

– or the set of conﬁgurations of Result is at least the set of conﬁgurations of the cross product of Base and Aspect. This can be expressed as follows: Base ⊗ Aspect ⊆ Result

(I2 )

where the cross product is deﬁned as (A and B being a set of sets): A ⊗ B = {a ∪ b | a ∈ A, b ∈ B} The two relations (I1 ) and (I2 ) deﬁne the semantics. The former states that Result FM is a generalization of Base FM. The latter ensures that each conﬁguration of Base FM is supplemented by the features of Aspect FM. The insert operator may, in some situations, respect i) only one of the relation (i.e. (I1 ) or (I2 )) or ii) both of them (i.e. (I1 ) and (I2 )). A supporting tool can easily exploit this information to produce appropriate warnings when an insertion only preserves one relation and thus assist modelers in reasoning during composition. As an example, let us consider the set of conﬁgurations of the base FM included in the left part of Fig. 3(a), Base,

Composing Feature Models

71

Base = {{P erson, transport, car} , {P erson, transport, other}} the set of conﬁgurations of the aspect FM included in the right part of Fig. 3(a), Aspect, Aspect={{urbanT ransport, bike} , {urbanT ransport, twoW heeledV ehicle}}

and the set of conﬁgurations of the composed FM corresponding to an insertion using the Xor operator is described in Fig. 3(b), Result: Result = {{P erson, transport, car} , {P erson, transport, other} , {P erson, transport, urbanT ransport, bike} , {P erson, transport, urbanT ransport, twoW heeledV ehicle}} The relationships between Base, Aspect and Result respect only the relation (I1 ). As a result, the composed FM of Fig. 3(b) is a generalization of the base FM from the left part of Fig. 3(a). 4.3

Rules

In this subsection, we describe rules associated with an insertion. They deﬁne when and how the operator passed as an argument preserves (or not) the previously described properties on the base FM. The rules are given on a base model called Base, which has a root feature B and one or several children B1, B2, ..., Bn. The model to be inserted has a root feature A and its child features are A1, A2, ..., An and is called Aspect.

(a) Base FM

(b) Aspect FM

(c) One possible resulting FM

Fig. 4. Rule for insertion of FM

Let us consider the insertion of Aspect (Fig. 4(b)) into the Base (Fig. 4(a)). If the operator passed to insert is an “And with the mandatory status”, the feature A is inserted as a child feature of B with the mandatory status (Fig. 4(c)). For this example, the sets of conﬁgurations of Base, Aspect, and Result are: Base = {{B, B1, B2} , {B, B2}} Aspect = {{A, A1} , {A}} Result = {{B, A, B1, B2, A1} , {B, A, B2, A1} , {B, A, B1, B2} , {B, A, B2}}

72

M. Acher et al.

Consequently, the relation (I1 ) does not hold. For instance, {B, B1, B2} is not a member of Result. Nevertheless, the relation (I2 ) is satisﬁed and the resulting FM is an arbitrary edit to the Base FM. On the contrary, if the stakeholder wants to preserve the (I1 ) property, the feature A should be inserted as a child feature of B with the optional status. Overview of the table of rules. The result of an insertion of a given feature only depends on i) the operator passed as argument of insert and ii) the operator associated to the feature where the insertion is made. All combinations are given in Table 1. We distinguish the cases where no FM operator is associated to a feature of the base FM (it is a leaf) and those where there is either And, Or or Xor operators. Insert may accept the following operators : And with mandatory (resp. optional) sub-features, Or and Xor. The table summarizes the properties that are veriﬁed by Result FM for each combination. When “=” is set, this means that the set of conﬁgurations of Result FM is strictly equal to Base⊗Aspect. Note that the insertion of one single feature with an Or or Xor operator into a leaf feature is forbidden, as it would generate badly-formed FMs. Nevertheless, this is possible when insertion deals with a set of features of the aspect model (i.e. parameter aspectFeature is a set and not a single feature). Table 1. Insertion rules Base / Operator Leaf And Xor Or

5

And-Mandatory = I2 = I2 = I2 = I2

And-Optional I1 and I2 I2 and I1 I1 and I2 I1 and I2

Xor I1 I1 I1 I1

I1 I1 I1 I1

Or and and and and

I2 I2 I2 I2

Merge Operator

When two FMs share several features and are diﬀerent views of an aspect of a system, it is necessary to merge the overlapping parts of the two FMs to obtain a single model that presents an integrated view of the system. Let us consider the example of a base FM (left part of Fig. 5(a)). The root feature is the Person feature which has a child feature transport with two alternatives features car and other. The aspect FM (right part of Fig. 5(a)) describes the concept of Person from another perspective. In that case, a person has also the feature meansOfTransport but the set of alternatives is structured in an Or-group, addressing also additional features such as bike, publicService and twoWheeledVehicle. The merge operator can then be used to unify the two viewpoints from the FMs. A mapping can be speciﬁed by the stakeholder (e.g. to relate the feature transport of the base FM and the feature meansOfTransport of the aspect FM). More important, the merged FM should verify some properties such as the preservation of conﬁgurations. This requires to solve some of the variability issues in each FM. For example, in Fig. 5(a), features car and other cannot be

Composing Feature Models

73

(a) Base and Aspect FMs to be merged

(b) Merged FM

Fig. 5. Merging of two FMs

concurrently selected in the Base FM whereas the selection of both of them is allowed by the Aspect FM. 5.1

Syntactic Definition

The merge operator is syntactically deﬁned as follows: merge (aspectFeature: Feature, baseFeature: Feature, mode: Mode) It takes three arguments: the feature to be merged (a feature of the aspect model), the feature in the base model where the merge is done, and the mode speciﬁed by the user. This mode indicates how the merge has to be done in terms of union or intersection of conﬁgurations (see below). Like for the insert operator, the merge’s parameters allow the stakeholder to answer the three same questions: Where are the features of the aspect FM and the base FM such as the two FMs match? To merge FMs we thus need to ﬁrst identify match points (similar to joinpoints in aspect terminology). The stakeholder can thus specify the feature aspectF eature of the aspect FM and the feature baseF eature of the base FM. They are not necessary the root of the FMs.

74

M. Acher et al.

What are the features of the aspect FM and base FM that will appear in the merged model? Two FMs are merged by applying the operator recursively to their subtrees, starting from the match points (aspectF eature and baseF eature). If two features have been merged, the whole process proceeds with their children features. If not, they are inserted as separate child features. The variability information associated to features in the merged model should also be set. How features are merged by the operator? It uses a name-based matching: two features match if and only if they have the same name. If so, they are merged to form a new feature. Features with diﬀerent names can be bound to each other thanks to an explicit renaming (see Section 3). Finally, a set of rules resolves possible variability mismatches between features of the two FMs according to the mode (i.e. the third argument of the merge operator). 5.2

Semantics

Like for the operator insert, the semantics of merge is deﬁned according to the relationship which exists between the FM resulting from the merging and the two input FMs. It is based on the union or the intersection of the two conﬁguration sets. Union: When transport is merged with meansOfTransport (see Fig. 5), original information from the base model must be preserved while adding information from the aspect model. The set of conﬁgurations of the base and aspect FMs should then be preserved in the merged FM. The union of two FMs, Base and Aspect, is a new FM where each conﬁguration that is valid either in Base or Aspect, is also valid. More formally, the result of a merge in the union mode has the following properties: – The set of conﬁgurations of the FM after merging (Result ) is at least the set of conﬁgurations of Base FM (i.e. Result FM is a generalization or a refactoring of Base FM). This can be expressed as follows: Base ⊆ Result

(M1 )

– The set of conﬁgurations of Result is at least the set of conﬁgurations of the Aspect FM (i.e. Result FM is a generalization or a refactoring of Aspect FM). This can be expressed as follows: Aspect ⊆ Result

(M2 )

Note that if the relations (M1 ) and (M2 ) are met, the following relationship holds: Base ∪ Aspect ⊆ Result

Composing Feature Models

75

This means, the merged FM may allow some conﬁgurations that are not included in the set of conﬁgurations of the base or in the one of the aspect FMs. In order to restrict these conﬁgurations, we propose to reinforce the constraints on the merged FM with an additional property (see (M3 )). It states that the set of conﬁgurations of Result is at least the set of conﬁgurations of the cross product of Base and Aspect. This can be expressed as follows: Base ⊗ Aspect ⊆ Result

(M3 )

(M3 ) can hold concurrently with (M1 ) and (M2 ), individually or not at all. Intersection. When transport is merged with meansOfTransport (see Fig. 5), only common information of the base model and the aspect model is retained: The intersection of two FMs, Base and Aspect, is a new FM where each conﬁguration that is valid both in Base and Aspect, is also valid. In the intersection mode, the relationship between the merged FM Result, the base FM Base and the aspect FM Aspect can be expressed as follows: Base Aspect = Result (M4 ) Besides, if the following condition holds: Base Aspect = ∅

(M5 )

the FM Result then deﬁnes no conﬁguration at all and can be considered as an inconsistent or an unsatisﬁable FM [8]. 5.3

Merging Rules

We now describe rules for merging FMs. These rules aim at resolving variabilities in each FM such as the expected properties are met. For example, in Fig. 5, features car and other do not exhibit the same variability as they belong to a Xor-group in the base FM whereas they belong to an Or-group in the aspect FM. Not surprisingly, the sets of conﬁgurations of the base FM and the aspect FM are not the same, and some conﬁgurations are valid in the base FM but not valid in the aspect FM. For example, {P erson, meansOf T ransport, car, housing} is only valid in the aspect FM (since the feature housing is included in all its conﬁgurations). Yet, the merged FM should be able to express the set of conﬁgurations of both FMs. To tackle this issue, we propose i) to make an explicit diﬀerence between common and non common features of the two FMs and ii) to (re-)use the insert operator at each step of the merge. As the common features of the two FMs can belong to a diﬀerent group, a new variability operator has to be chosen in accordance with the intended semantics properties (i.e. merge in the union or

76

M. Acher et al. Table 2. Merge in union mode - relations (M1 ) and (M2 ) are satisﬁed Base / Aspect And-Mandatory And-Mandatory And-Mandatory And-Optional And-Optional Xor Or Or Or

And-Optional Xor Or And-Optional Or Or And-Optional And-Optional And-Optional And-Optional Xor Or And-Optional Or Or

Table 3. Merge in intersection mode - relation (M4 ) is satisﬁed Base / Aspect And-Mandatory And-Optional Xor Or

And-Mandatory And-Optional Xor Or And-Mandatory And-Mandatory And-Mandatory And-Mandatory And-Mandatory And-Optional Xor Or And-Mandatory Xor Xor Xor And-mandatory Or Xor Or

Fig. 6. Merging example

intersection mode). We thus propose to organize rules to compute the variability operator into predominance tables. Tables 2 and 3 make the assumption that the same set of features are shared by the base and aspect FMs. In Fig. 6, features car and other are child features of transport. They belong either to a Xor-group in Base FM or to an Or-group in Aspect FM. In this case, the predominant operator is an Or-group, that is, the features car and other can be both selected at the same time (i.e. (M1 ) is respected), or car and other can be selected alone (i.e. (M2 ) is respected). As a result, the relations (M1 ) and (M2 ) truly hold for the merged FM depicted in the left bottom part of Fig. 6. Moreover, the relation (M3 ) holds too.

Composing Feature Models

77

Merging in the intersection mode the features car and other of the aspect FM (which belong to an Or-group) with the features car and other of the base FM (which belong to a Xor-group) gives the predominant operator Xor (see right bottom part of Fig. 6). The relation (M4 ) truly holds. Algorithm 1. Merging algorithm merge (aspectFeature: Feature, baseFeature: Feature, mode: Mode) begin if ¬matching(aspectF eature, baseF eature) then “error” fi new := newF M (newF eature (baseF eature.getN ame() )) predominanceOp := computeOperator (baseF eature, aspectF eature, mode) base := extractChild (baseF eature) aspect := extractChild (aspectF eature) foreach N ∈ (base aspect) do res := merge (aspectF eature :: N, baseF eature :: N, mode) / ∗ recursively ∗ / stackF eatures.push (res) / ∗ pushes the merged f eature ∗ / od / ∗ insert the set of f eatures of the stack ∗ / insertmulti (stackF eatures, new, predominanceOp) / ∗ f ollowing loops are notexecuted in the intersection mode ∗ / foreach N ∈ (base (base aspect)) do insert (N, new.getRoot(), predominanceOp) od foreach N ∈ (aspect (base aspect)) do insert (N, new.getRoot(), predominanceOp) od return new end

We deﬁne an algorithm for the merge that implements the principles above (see Algorithm 1). As an illustration, let us consider the merge of the Base Model and the Aspect Model depicted in top of Fig. 6. The merge operator is used with the ﬁrst parameter “ transport feature” of the base FM, the second parameter “ transport feature” of the aspect FM and the third parameter being the union mode. Algorithm for the merge. First, a new FM is created with one single feature called “transport”, which becomes its root, and acts as a temporary FM where the features of the base and aspect FMs will be incrementally inserted. The predominant operator is computed using the predominance table corresponding to the mode. In the example, we obtain an Or -group with the union table (see bottom left part of Fig. 6). The common features of the two FMs (i.e. car and other) are merged recursively. Then, they are inserted all together with the predominant operator. At this stage, the connection between the transport root feature of the temporary FM and its group of children car and other is an Orgroup. The next step is to insert the non common features urbanTransport and publicService with the Or-operator into the root feature of the temporary FM, transport. The insertion of a feature with an Or-operator into a feature which is

78

M. Acher et al.

connected to its group of children by an Or-group respects (I1 ) and (I2 ). As a result, urbanTransport and publicService also belong to an Or-group. In the intersection mode, the algorithm is executed when the condition (M5 ) does not hold. Only the set of common features are considered. In the example, only the features car and other are merged. The result is depicted in bottom right part of Fig. 6. The predominant operator is the Xor-group.

6

Related Work

Several previous works consider some forms of composition for FMs. Alves et al. motivate the need to manage the evolution of FMs (or more generally of an SPL) and extend the notion of refactoring to FMs [13]. The authors provide a catalog of sound FM refactorings, which has been veriﬁed in Alloy by automatically checking properties of resulting FMs [21]. Although their work is focused on refactoring single FMs, they also suggest to use these rules to merge FMs. Our proposal goes further in this direction by providing mechanisms to implement the merge and by clarifying the semantics (as in [14], our terminology is to consider the unidirectional refactoring as a generalization and a bidirectional refactoring as a refactoring). Segura et al. provide a catalogue of visual rules to describe how to merge FMs [15]. The authors emphasize the need to provide a formal semantics to their approach. To the best of our knowledge, their rules implement the merge in the union mode while the the merge in the intersection is not taken into account. Schobbens et al. identify three operations to merge FMs – intersection, union (a.k.a. disjunction) or reduced product of two FMs [19] but do not provide mechanisms to implement the merging. Czarnecki et al. propose to construct FM from propositional formulas and suggest to use their algorithm to merge FMs, but without further detail [22]. Computing the intersection or union at the propositional logic level is not without problems. It is necessary to generate a FM from the new propositional formula and a major issue is then to take additional structuring information into account. In [23], a feature is represented by a FST (Feature Structure Tree), roughly a stripped-down abstract syntax tree. The authors propose to use superimposition to compose features. A FM is a “hierarchy of features with variability” [18] and can be seen as a FST plus variability. As a result, the superimposition mechanism has to be adapted to resolve variabilities mismatch. In SPL engineering, reusable software assets must be composed to derive speciﬁc products according to a particular set of features. An approach is to use FMs to specify the variability and then to relate FMs to architectural or design models (e.g. UML models) [6, 24, 7, 5]. A conﬁguration of the FM can correspond to the removal or the activation of some elements of a model [5, 6]. Another option is to associate each feature to some model artefacts which are then inserted in a primary design model [7] or composed together [25,6,24]. Our work focuses strictly on the composition of the variability models, i.e. FMs. Our proposal is not incompatible with the approaches described as the composed FM can be related to other models and thus be used during the derivation process.

Composing Feature Models

79

Aspect-Oriented Modeling (AOM) allows developers to isolate and address separately several aspects of a system by providing techniques to achieve separation and composition of concerns [20]. Existing AOM approaches notably focused on the composition of UML models such as UML class diagrams (e.g. [26]) or UML state and sequence diagrams (e.g. [27]). To the best of our knowledge, no existing approach proposes to compose FMs.

7

Conclusion and Future Work

In this paper, we proposed two main operators to compose feature models (FMs). Each operator is described by stating where it is applied, what features will be composed and how the composition is made. Each composition is deﬁned by rules that formally describe the structure of the resulting FM. Depending on the composed and the targeted features, some properties regarding the expressed set of conﬁgurations are made explicit for each operator. A ﬁrst insert operator enables developers to insert features from a crosscutting FM into a base FM. Each insertion can then be characterized by its ability to preserve or not the set of conﬁgurations expressed by the base FM. Building on this operator, the proposed merge operator makes possible to put together features from two separated FMs, when none of the two clearly crosscuts the other. The result is also characterized through the set of expressed conﬁgurations, and is parameterized to enable developers to choose between union or intersection of the conﬁgurations. The two operators cover diﬀerent use cases but always ensure the wellformedness of the resulting FM. When using the provided operators, developers can choose to make insertion or merge while preserving the expression of the original set of conﬁgurations. This enables them to compose FMs at a large scale. On the contrary, when the need to make more important changes appears, developers can then use all presented forms of insertion and merge, while being aware of whether the original semantics of the base FM is preserved or not. Future work aims at tackling current restrictions and at getting validation of the scalability and usability of the proposed operators. These operators are currently under validation with the construction and usage of a large SPL which is dedicated to medical imaging services on the grid. The services are part of a service-oriented architecture in which data-intensive workﬂows are built to conduct numerous computations on very large set of images [28, 29]. This SPL is decomposed into several FMs, which are then to be composed using the proposed operators. Moreover, some of the designed FM are planned to be reused in another SPL that deals with video surveillance systems [30]. Some features related to QoS and imaging are likely to be common. The two case studies and SPLs are intended to be complementary and yet diﬀerent to determine in what sense the merging operators can actually help to scale feature modeling (from the users’ perspective). They can also help to determine whether an arbitrarily decomposed FM can be relevant to all stakeholders or not. Another interest is to quantify the amount of information needed to apply merging operators in order to assess their easiness of use. To achieve these goals, we will raise the limitation

80

M. Acher et al.

on the hierarchy regularity of the composed FMs. Currently the considered FM cannot include any constraints between features, e.g. selecting a feature constrains that another one must be or not be selected. Taking into account such constraints will oblige us to tackle issues on how to reuse consistency checking in a modular way. But as a result, this should also solve some of the scalability issues that FM checking techniques currently face [8, 9].

References 1. Clements, P., Northrop, L.M.: Software Product Lines: Practices and Patterns. Addison-Wesley Professional, Reading (2001) 2. Pohl, K., Böckle, G., van der Linden, F.J.: Software Product Line Engineering: Foundations, Principles and Techniques. Springer, Heidelberg (2005) 3. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, S.: Feature-Oriented Domain Analysis (FODA) Feasibility Study. Technical Report CMU/SEI-90-TR-21, Software Engineering Institute (November 1990) 4. Czarnecki, K., Eisenecker, U.: Generative Programming: Methods, Tools, and Applications. Addison-Wesley Professional, Reading (2000) 5. Czarnecki, K., Antkiewicz, M.: Mapping features to models: A template approach based on superimposed variants. In: Glück, R., Lowry, M. (eds.) GPCE 2005. LNCS, vol. 3676, pp. 422–437. Springer, Heidelberg (2005) 6. Sanchez, P., Loughran, N., Fuentes, L., Garcia, A.: Engineering languages for specifying Product-Derivation processes in software product lines. In: Software Language Engineering (SLE), pp. 188–207 (2008) 7. Voelter, M., Groher, I.: Product line implementation using aspect-oriented and model-driven software development. In: SPLC 2007: Proceedings of the 11th International Software Product Line Conference, pp. 233–242. IEEE, Los Alamitos (2007) 8. Batory, D., Benavides, D., Ruiz-Cortés, A.: Automated analysis of feature models: Challenges ahead. Communications of the ACM (December 2006) 9. Mendonca, M., Wasowski, A., Czarnecki, K., Cowan, D.: Eﬃcient compilation techniques for large scale feature models. In: GPCE 2008: Proceedings of the 7th international conference on Generative programming and component engineering, pp. 13–22. ACM, New York (2008) 10. Reiser, M.O., Weber, M.: Multi-level feature trees: A pragmatic approach to managing highly complex product families. Requir. Eng. 12(2), 57–75 (2007) 11. Czarnecki, K., Helsen, S., Eisenecker, U.: Staged Conﬁguration through Specialization and Multilevel Conﬁguration of Feature Models. Software Process: Improvement and Practice 10(2), 143–169 (2005) 12. Hartmann, H., Trew, T.: Using feature diagrams with context variability to model multiple product lines for software supply chains. In: SPLC 2008: Proceedings of the 2008 12th International Software Product Line Conference, pp. 12–21. IEEE, Los Alamitos (2008) 13. Alves, V., Gheyi, R., Massoni, T., Kulesza, U., Borba, P., Lucena, C.: Refactoring product lines. In: GPCE 2006: Proceedings of the 5th international conference on Generative programming and component engineering, pp. 201–210. ACM, New York (2006) 14. Thüm, T., Batory, D., Kästner, C.: Reasoning about edits to feature models. In: Proceedings of the 31th International Conference on Software Engineering (ICSE 2009). IEEE Computer Society, Los Alamitos (2009)

Composing Feature Models

81

15. Segura, S., Benavides, D., Ruiz-Cortés, A., Trinidad, P.: Automated merging of feature models using graph transformations. In: Lämmel, R., Visser, J., Saraiva, J. (eds.) Generative and Transformational Techniques in Software Engineering II. LNCS, vol. 5235, pp. 489–505. Springer, Heidelberg (2008) 16. Lahire, P., Morin, B., Vanwormhoudt, G., Gaignard, A., Barais, O., Jézéquel, J.M.: Introducing Variability into Aspect-Oriented Modeling Approaches. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 498–513. Springer, Heidelberg (2007) 17. Classen, A., Heymans, P., Schobbens, P.: What’s in a Feature: A Requirements Engineering Perspective. In: Fiadeiro, J.L., Inverardi, P. (eds.) FASE 2008. LNCS, vol. 4961, pp. 16–30. Springer, Heidelberg (2008) 18. Czarnecki, K., Kim, C.H.P., Kalleberg, K.T.: Feature models are views on ontologies. In: SPLC 2006: Proceedings of the 10th International on Software Product Line Conference, pp. 41–51. IEEE Computer Society, Los Alamitos (2006) 19. Schobbens, P.Y., Heymans, P., Trigaux, J.C., Bontemps, Y.: Generic semantics of feature diagrams. Comput. Netw. 51(2), 456–479 (2007) 20. Aspect-Oriented Modeling Workshop Series, http://www.aspect-modeling.org/ 21. Gheyi, R., Massoni, T., Borba, P.: A theory for feature models in alloy. In: Proceedings of First Alloy Workshop, pp. 71–80 (2006) 22. Czarnecki, K., Wasowski, A.: Feature diagrams and logics: There and back again. In: SPLC 2007: Proceedings of the 11th International Software Product Line Conference, pp. 23–34 (2007) 23. Apel, S., Lengauer, C., Möller, B., Kästner, C.: An algebra for features and feature composition. In: Meseguer, J., Roşu, G. (eds.) AMAST 2008. LNCS, vol. 5140, pp. 36–50. Springer, Heidelberg (2008) 24. Perrouin, G., Klein, J., Guelﬁ, N., Jézéquel, J.M.: Reconciling automation and ﬂexibility in product derivation. In: SPLC 2008: Proceedings of the 2008 12th International Software Product Line Conference, pp. 339–348. IEEE, Los Alamitos (2008) 25. Jayaraman, P.K., Whittle, J., Elkhodary, A.M., Gomaa, H.: Model composition in product lines and feature interaction detection using critical pair analysis. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 151–165. Springer, Heidelberg (2007) 26. Reddy, Y.R., Ghosh, S., France, R.B., Straw, G., Bieman, J.M., McEachen, N., Song, E., Georg, G.: Directives for composing aspect-oriented design class models. In: Rashid, A., Aksit, M. (eds.) Transactions on Aspect-Oriented Software Development I. LNCS, vol. 3880, pp. 75–105. Springer, Heidelberg (2006) 27. Kienzle, J., Al Abed, W., Jacques, K.: Aspect-oriented multi-view modeling. In: AOSD 2009: Proceedings of the 8th ACM international conference on Aspectoriented software development, pp. 87–98. ACM, New York (2009) 28. Acher, M., Collet, P., Lahire, P.: Issues in Managing Variability of Medical Imaging Grid Services. In: Olabarriaga, S., Lingrand, D., Montagnat, J. (eds.) MICCAIGrid Workshop (MICCAI-Grid), New York, NY, USA (September 2008) 29. Acher, M., Collet, P., Lahire, P., Montagnat, J.: Imaging Services on the Grid as a Product Line: Requirements and Architecture. In: Service-Oriented Architectures and Software Product Lines - Putting Both Together (SOAPL 2008), associated workshop issue of SPLC 2008. IEEE, Los Alamitos (2008) 30. Acher, M., Lahire, P., Moisan, S., Rigault, J.P.: Tackling High Variability in Video Surveillance Systems through a Model Transformation Approach. In: MiSE 2009: Proceedings of the International Workshop on Modeling in Software Engineering at ICSE 2009, Vancouver, Canada. IEEE Computer Society, Los Alamitos (2009)

VML* – A Family of Languages for Variability Management in Software Product Lines∗ Steffen Zschaler1, Pablo Sánchez2, João Santos3, Mauricio Alférez3, Awais Rashid1, Lidia Fuentes2, Ana Moreira3, João Araújo3, and Uirá Kulesza3 1

Computing Department, Lancaster University, Lancaster, United Kingdom {zschaler,awais}@comp.lancs.ac.uk 2 Dpto. de Lenguajes y Ciencias de la Computación, Universidad de Málaga, Málaga, Spain {pablo,lff}@lcc.uma.es 3 Computer Science Department, Universidade Nova de Lisboa, Lisbon, Portugal {jps,mauricio.alferez,amm,ja}@di.fct.unl.pt, [email protected]

Abstract. Managing variability is a challenging issue in software-product-line engineering. A key part of variability management is the ability to express explicitly the relationship between variability models (expressing the variability in the problem space, for example using feature models) and other artefacts of the product line, for example, requirements models and architecture models. Once these relations have been made explicit, they can be used for a number of purposes, most importantly for product derivation, but also for the generation of trace links or for checking the consistency of a product-line architecture. This paper bootstraps techniques from product-line engineering to produce a family of languages for variability management for easing the creation of new members of the family of languages. We show that developing such language families is feasible and demonstrate the flexibility of our language family by applying it to the development of two variability-management languages. Keywords: Software Product Lines, Family of Languages, Domain-specific Languages, Variability Management.

1 Introduction Software Product Lines Engineering (SPLE) is seen as a promising approach to increasing the productivity and quality of software, especially where essentially similar software needs to be provided for a variety of contexts and customers each requiring customizations and variations for their specific conditions [1-2]. In SPLE, features [3] are used to capture commonalities or discriminate among products, i.e. capture variabilities, in an SPL. SPL features are often modelled using feature models [3-4]. Management of variability throughout the product line is a key challenge in SPLE. An important part of variability management is to make explicit the relation between the variability model (e.g., the feature models referred to in the previous ∗

The work reported in this paper was supported by the EC FP7 STREP project AMPLE: Aspect-Oriented Model-Driven Product Line Engineering (www.ample-project.net).

M. van den Brand, D. Gašević, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 82–102, 2010. © Springer-Verlag Berlin Heidelberg 2010

VML* – A Family of Languages for Variability Management in Software Product Lines

83

paragraph) and other models and artefacts of the SPL. Once this relation has been explicitly represented, it can be used for a number of purposes, most importantly to automatically derive product instances based on product-configuration specifications, but also for other purposes such as trace-link generation and consistency checking of SPL models. Due to its relevance, this topic is currently an area of intensive research and a number of approaches have been proposed [5-9]. Initial research focused on using general-purpose model transformations to encode product derivation [10-11]. Later it was argued that this placed too heavy a burden on SPL engineers, as they would now also have to learn the intricacies of model transformations. Consequently, a number of approaches that hide the model transformations from the SPL engineers have recently been developed [6-7, 12]. Czarnecki et al and Heidenreich et al [6-7] propose generic techniques that associate features with arbitrary combinations of model elements and generate a standard model transformation for product derivation from this. In contrast, we have argued before [12] [13] that transformation actions that are specific to the types of models used for describing the SPL are more useful, as they provide a terminology already known to SPL engineers, allow consideration of model semantics in the definition of transformations, and allow avoiding some inconsistencies (e.g., dangling references) in product models by design. This requires new languages to be developed for each type of model that may be used in describing an SPL—a costly and error prone task. To make development of such languages feasible, this paper proposes VML*1, a family of languages—or a language product line—for variability management, showing that developing such languages is a feasible goal. Individual members of the family are described using a domain-specific language (DSL). Based on such a specification, a generator produces the complete infrastructure for the specified language. Such a generative approach has the added benefit of making it easier to support other evaluations beyond product derivation: they can be implemented in additional code generators from the language specification. The key contribution of this paper is, thus, in the domain of software-language engineering, where it applies ideas from SPLE and model-driven development to the development of VML* languages. This enables us to efficiently build new VML* languages for new SPL contexts, and thus improves over our previous work [12], which was limited to copy-and-paste-based reuse, limiting efficiency and increasing error-proneness of language development. A secondary contribution is that this new approach to language development allows us to support additional evaluations for VML* languages, such as generation of trace links or SPL consistency checking. Section 2 further discusses the motivation for building custom languages instead of one generic language and derives a set of challenges to be overcome to enable efficient development of such languages. Section 3 then presents how we applied SPLE techniques to construct a family of languages for variability management and is followed by Sect. 4, which shows how concrete languages have been developed based on our approach. Section 5 reviews some related work and Sect. 6 concludes the paper and points out directions for future work.

1

For Variability Management Languages.

84

S. Zschaler et al.

2 Motivation This section describes the motivation that led to the creation of the VML* family of languages. First, we provide some background on VML languages and then we present the motivation of this paper. 2.1 Managing Variability Using Target-Model–Specific Languages This section explains why we choose to model SPL variability using target-model– specific languages rather than a single generic language. We use as an example an architectural model of a lock control framework for a Smart Home Software Product Line (SPL) [1, 14]. Smart Home applications aim at automating and controlling houses and buildings in order to improve the comfort and security of their inhabitants. The lock control is placed on doors of rooms whose access must be controlled. Several options are available to end users acquiring a specific Smart Home software installation: - Different authentication mechanisms can be used: identification cards, fingerprint scanners or a simple numeric keypad. - Doors are opened manually and users have a time period to authenticate before triggering the alarms. Optionally, it is possible to select a computer-controlled door lock control (Automatic Lock), which will be released upon successful authentication. - Automatic sliding doors can also be used (Door Opener). This option requires that the Automatic Lock control of the door lock be selected.

class [ LockControl ] <> Keypad Reader

<> Fingerprint Reader

<> Card Reader

IAccess

IRegister

<> LockControlMng

IDoor

ILockControl

<> Door Actuator <> Lock Control

IVerify <> KeypadAuth

<> FingerprintAuth

<> CardAuth

Fig. 1. A software architecture for the lock control framework

VML* – A Family of Languages for Variability Management in Software Product Lines

85

Figure 1 depicts a software architectural design for this lock control framework. This architectural design is comprised of three different parts, which are explained in the following. Firstly, variability inherent to the domain is expressed using a feature model [4, 15] (Fig. 1 (top)). This feature model represents variability specification or problem space. It specifies which features of the system are variable and the reasons why. For instance, the AuthenticationDevice to be used is a variable feature because there are several alternative devices available but only one must be selected. AutomaticLock and DoorOpener are variable features because they are options that may be included in a specific lock control application or not. Secondly, once variability has been identified, the software architecture is designed using the component model of UML 2.0 (Fig. 1 (bottom)). This represents variability realization or solution space. The mechanism selected for supporting variability in the architectural design is plugin components. The LockControlMng component is the central component of this architecture. Each alternative for authentication is designed as a pair of plugin components: one for controlling the physical device that serves to authenticate users (e.g. KeypadReader); and the other one encapsulating the logic of the authentication algorithm (e.g. KeypadAuth). These plugin components communicate with the LockControlMng through the IAccess interface, in the case of reader components, and the IVerify interface, in the case of authenticator ones. All plugin components must register in the LockControlMng component using the interface IRegister. The LockControlMng receives data from the reader components and, with the data received, it calls the authenticator component. The latter is in charge of checking if the user has access to the room or not. If the user is authentic, the LockControlMng component invokes the LockControl component, which releases the lock. This invocation is placed only if the automatic lock control option has been selected. If the door is a sliding one, the LockControlMng should also invoke the DoorActuator component for automatic opening of the door. Thirdly, we must specify the links between variability specification and variability design, or problem space and solution space, indicating how the components of the architectural model must be composed according to the selected features. In our case, for instance, when a specific authentication device is selected, the corresponding reader component must be connected to the LockControlMng through the IAccess interface. In the same way, the LockControlMng component must be connected, to the corresponding authenticator component though the IVerify interface. Both the authenticator and the reader components must also be connected to LockControlMng through the IRegister interface. The components corresponding to non selected alternatives must simply be removed. Similarly, the DoorActuator and LockControl components are adequately connected if the corresponding optional features are selected; otherwise, they should be removed. These relationships can be expressed using general purpose model transformation languages, such as demonstrated in [10-11]. Nevertheless, as previously discussed in [10], these have the following shortcomings: - Metamodel Burden. A model transformation language is often based on abstract syntax manipulations. According to Jayaraman et al. [16], “Most model

86

S. Zschaler et al.

Table 1. Part of the VML4Arch Specification for Smart Home 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16

import features <"/SmartHome.fmp">; import core <"/SmartHome.uml">; ... variant for FingerprintScanner { connect("FingerprintReader","LockControlMng","IAccess"); connect("FingerprintReader","LockControlMng","IRegister"); connect("FingerprintAuth","LockControlMng","IRegister"); connect("LockControlMng","FingerprintAuth","IVerify"); } // Fingerprint scanner variant for not (FingerprintScanner) { remove('FingerprintReader'); remove('FingerprintAuth'); } // not FingerprintScanner

developers do not have this knowledge. Therefore, it would be inadvisable to force them to use the abstract syntax of the models”. - Language Overload and Abstraction Mismatch. There are different kinds of model transformation languages [16], and each of them is based on a specific computing model. They range from rule-based languages (e.g. ATL [17]) to expression-based languages (e.g. xTend [18]) and graph-based languages (e.g. AGG [19]). When employing a model transformation language, software product line engineers must also understand the underlying computing style (e.g. rule-based) and learn the language syntax. As a result, software product line engineers are forced to rely on abstractions that might not be naturally part of the abstraction level at which they are working. To overcome these shortcomings, we proposed [12] to create dedicated languages, for specifying product derivation processes; that is, for specifying how features map to software models. These dedicated languages must follow a very basic computation style, where based on a selection of features, small sequence of simple commands are executed. These commands, moreover, must use syntax familiar to the modeler, using concepts of the concrete syntax of the model rather than their abstract syntax. These user-friendly high-level specifications are then translated into a set of low-level general purpose model transformations, which support the automation of the product derivation process. So, the SPL engineer can enjoy the benefits of using model-driven techniques but without paying the associated cost, i.e. without needing to learn the intricacies of model transformation languages. Table 1 provides an example of such a dedicated language for manipulating UML component models. This specification establishes that whenever the Fingerprint option is selected (lines 06-11), the KeypadAuth and KeypadReader components must be connected to the LockControlMng component through the corresponding interfaces, as previously described. The connect operator is an intuitive composition mechanism to specify that two components must be connected using the interface specified as a parameter. The first parameter of the connect operator is the component that requires the interface while the second parameter is the component that provides it. In the case

VML* – A Family of Languages for Variability Management in Software Product Lines

87

where the Fingerprint variant is not selected (lines 03-16), the FingerprintAuth and the KeypadReader components are removed from the architecture, using the remove operator. 2.2 Automating the Generation of New VML Languages Beyond the language from Figure 1, a wide range of languages for managing variability in any kind of target modeling language need to be constructed. For instance, we need to develop a dedicated language with specific operators for managing variability in use cases models, activity models, business process models or any other kind of architectural description language. Developing such languages is cost-intensive and error-prone, especially as so far there is no support for reuse between different such languages beyond a copy-and-paste approach. This is a serious barrier to the adoption of our approach in SPL projects. To make developing such languages feasible, we need to solve the following three challenges: 1. Support of reuse between different languages. The support infrastructure should be easily reused for new languages. Reuse should not be based on copying an existing language implementation and adjusting it, removing unneeded actions and adding new actions. Otherwise, if errors are found and fixed in the infrastructure for one language, these corrections would have to be manually transferred into all other language infrastructures. The same would be true for new features of the infrastructure, for example, new evaluations of specifications other than product derivation. 2. Allow the type of variability models to vary. Different approaches to modelling variability have been proposed: very often, feature trees [4] or cardinality-based feature models [20] are used. However, DSLs have also been used to represent variability [21]. Any variability management language should be easily adapted to any type of variability model. 3. Support for easy customisation of target-model element access. Target-model model elements need to be accessed from a specification based on a textual reference (e.g., their fully qualified name or some pattern matching a number of names). Depending on the target model different forms of such textual references may be useful. The evaluation of such textual references should be implemented separately from the individual actions to allow for easy exchange and customisation of this feature. In this work, we present a generative infrastructure for creating new VML languages for a concrete target model that tackles these issues.

3 The VML* Family of Languages In response to the challenges identified in the previous section, we propose to bootstrap SPLE techniques using a model-driven and generative approach for creating the infrastructure (e.g., parser, editor, evaluation engine) for a specific VML* language. To this end, we have developed the VML* family of languages, which consists of:

88

S. Zschaler et al.

Fig. 2. Common metamodel for VML languages. Variation points have been highlighted in dark grey.

1.

A common metamodel for VML* languages including variation points that can be customised for describing specific VML* languages. This provides the concepts common to all VML* languages. 2. A DSL for specifying the choices a specific language makes for each variation point. 3. A generator-based infrastructure that can instantiate all custom elements of the process from [12] for any VML* language. A working prototype of this system is available as a set of Eclipse plugins [22]. 3.1 A Common Metamodel for VML* Languages Figure 2 shows the general concepts required for expressing variability in product-line models. This metamodel has been developed as a generalisation of the metamodels of VML4Architecture, or simply VML4Arch [12-13] and VML4Requirements, or simply VML4RE [23-24], two variability management languages we have previously developed. VML4Arch is a language for relating feature models and UML2.0 architectural models of an SPL. VML4RE is a language for relating feature models and UML2.0 use case and activity models. These languages have been developed in parallel, but independently. They have a number of differences, but they also share a large number of commonalities, enabling us to derive a common metamodel for VML* languages. The metamodel shown in Figure 2 is independent of both the specific models used for variability modelling (e.g., feature models, domain-specific languages) and the specific target models (e.g., UML, architecture description models, generation workflow models). Consequently, a number of concepts are abstract in this metamodel. To

VML* – A Family of Languages for Variability Management in Software Product Lines

89

apply the metamodel for a specific combination of target model and variability model, these concepts (highlighted in dark grey in Figure 2) need to be specialised (how to specify such specialisations will be discussed in the next section). In the following, we discuss each of the metamodel concepts in more detail. VMLModel. A VML model relates a variability model and a target model, using a set of variants to describe how the target model needs to vary as each of the concerns of the variability model is selected or unselected. VariabilityModel. A variability model is the central artefact in variability modelling. VariabilityModel and Variability Unit serve as adapters to the specific form of variability modelling employed in a specific scenario. Variability Unit. These are the units of variability in variability modelling. A variability model describes what variability units a potential product may have and what constraints govern the selection of combinations of variability units for individual products. From the perspective of variability management, we are mainly interested in the name of a variability unit and whether it has been selected for a specific product configuration. Notice that for the purposes of our metamodel we do not care about how variability units are expressed in a variability model. They may be represented as explicit features in a feature model [4] or more implicitly in a DSL [21], or in any other form that is convenient for modelling variability in a specific project. To enable our metamodel to relate to all these different kinds of representations, we standardise on the common notion of Variability Unit and require adapters that extract these from any of the representations discussed above. TargetModel. Target models describe a product line. There are a large number of potential target models—for example, requirements models, architecture models, or code-generation-workflow models. ModelElement. Model elements represent arbitrary elements of the target model. This concept serves as an adapter to actual model elements and needs to be specialized for each kind of target model (thereby defining the concrete model elements available). The model elements are typed using metaclasses imported from the target metamodel. Variant. A variant describes how the target models must be varied when a certain combination of variability units is selected or unselected. Notice that for product derivation it is sufficient to provide a variant for each non-mandatory variability unit, as we can assume the unvaried target model to represent the model for all the mandatory variability units. For some other evaluations (e.g., trace-link generation), however, a variant must be provided for each variability unit including mandatory ones. Each variant defines two sets of actions for its variability units: a set of onSelect actions defines how to vary the target model when the variability units are selected; a set of onUnSelect actions defines what to do when the variability units are not selected. ConcernExpression. For certain use cases it is not sufficient to map variability units directly onto modifications of the target model, as has also been previously discussed in the literature [6-7]. Therefore, we define variants for so-called concern expressions, logic expressions over variability units. We support And, Or, and Not expressions as well as atomic terms. VariantOrdering. Sometimes the order in which the actions of different variants are executed during product derivation is important, as actions for one variant may rely on model elements created by actions for another variant. VariantOrdering

90

S. Zschaler et al.

provides SPL developers with a means of defining a partial order of execution over variants using pairs of variants. The infrastructure will guarantee that all actions of the first variant in a pair are executed before any action of the second variant of that pair is executed. Action. Actions are used to describe modifications to the target model. These need to be customised for each kind of target model, depending on the kinds of variations that make sense at the level of abstraction the target model covers. For example, if the target model is a use case model, one particular action may be to connect an actor and a use case, while for an architectural model a possible action could be to connect two components. Actions may add, update or remove model elements in the target model and may create, update or remove links between existing or newly added model elements. PointcutExpression. A pointcut expression is an expression that identifies a model element or a set of model elements. It is constructed from atomic designators, pointcut references and combining operators (Not, And, and Or). Pointcut. A pointcut identifies a model element or set of model elements. The model elements are denoted by a pointcut expression. The main purpose of the Pointcut concept is to allow particular pointcut expressions to be named. A named Pointcut can then be reused using a PointcutReference. PCOperator. Operators enable the construction of pointcut expressions combining the set of elements returned from more than one element pointcut. Here, we define only two operators, namely and and or, which represent intersection and union of the sets of model elements of their element expressions, respectively. Designator. A designator is a piece of text that is used to identify a model element or a set of model elements. It may be a name (possibly qualified), a signature, a wildcard expression, or anything else that makes sense in the target model. As resolution of designator text into actual model elements is specific to the target model, the designator concept needs to be customised for each target model. 3.2 A DSL for Specifying Individual VML* Languages To enable succinct description of the specificities of a certain VML* language, we have defined a metamodel and concrete syntax for language-instance description. Figure 3 shows the key concepts. Based on an instance of this metamodel—a VML* language description—we can then generate an appropriate infrastructure customised for that specific VML* language. The individual concepts in the language-description metamodel are: LanguageInstanceModel. The central metaclass of VML* language descriptors, binding together the other parts of a VML* language descriptor. VariabilityModelImport. This provides information about the type of variability model to be supported by the VML* language. The key interface between VML* and a variability model is the set of features defined. The language descriptor, therefore, contains a snippet of model-query code2 that serves as an adapter between the 2

Our prototype uses openArchitectureWare’s (oAW) xTend language to express model queries and model transformations. These xTend snippets can be kept as operations in a separate xTend file and referenced from the language instance descriptor, allowing language designers to take full advantage of oAW’s checking capabilities.

VML* – A Family of Languages for Variability Management in Software Product Lines

91

TargetModelImport 1

LanguageInstanceModel

VariabilityModelImport

ActionDescriptor *

1

1 *

EvaluationAspect

TracingAspect

TransformationAspect

ActionTransformation *

1

ConfigurationImport

Fig. 3. Metamodel for VML* language instance descriptions

variability model and a VML* specification. This snippet is the only place where knowledge about the variability-model metamodel is located in a VML* language descriptor. TargetModelImport. This provides information about the type of target model to be supported by the VML* language. Mainly, this defines how pointcut designators should be evaluated for a specific target model. Depending on the specific kind of target model, different pointcut designators may be required. While, for example, usecase models require only simple qualified names (possibly using wildcards for quantification) to identify individual actors, use cases, or activities, architectural models may additionally require pointcut designators for operation signatures or component provided or required interfaces. Therefore, both the syntax of pointcut designators and their interpretation is specific to the kind of target model. In all VML* languages, pointcut designators are syntactically represented as simple string values. They are then passed to a piece of model-query code interpreting them to return a set of model elements from a given target model. This piece of code is defined for a specific VML* language using TargetModelImport. ActionDescriptor. Each action descriptor provides general syntactic information about one action. This includes the name of the action and the number of parameters it takes. The concrete syntax for action invocation in the generated VML* language will be ‘ (param1, ..., paramn)’. For each parameter, users of the VML* language will be able to provide a pointcut expression. EvaluationAspect. Every evaluation aspect describes one form of evaluation of a VML* specification. The VML* family can be extended with a number of these evaluation aspects (currently only one aspect—product derivation—has been implemented, but we are working on an implementation for trace-link generation and are planning to work on consistency evaluation), which can be supported for every

92

S. Zschaler et al.

concrete VML* language, but not all VML* languages will need support for all evaluation aspects. A VML* language description can, therefore, include only those evaluation aspects that are actually required for this VML* language, providing an additional opportunity for optimisation. Notice that making such a selection manually based on the architecture presented in the previous subsection can be very difficult, as the different evaluation aspects actually overlap in some elements of the architecture (for example, in plugin configuration files). The model-driven approach not only allows a selection of one aspect or another, it additionally allows this selection to be changed flexibly, even experimentally. TransformationAspect. If present, it enables product-derivation for target models. For each ActionDescriptor this defines an ActionTransformation specifying the model transformation encapsulated by this action. Furthermore, a ConfigurationImport defines an adapter for configuration models. ConfigurationImport. For the construction of models for specific products, the VML* infrastructure requires access to the set of features selected in a specific product configuration. To avoid polluting the VML* infrastructure with knowledge about the inner structure of product configurations, ConfigurationImport provides a snippet of model-query code that serves as an adapter to product-configuration specifications by extracting the set of selected features from a product configuration. ActionTransformation. Provides additional information for an action pertaining to the transformation of target models by this action. For every ActionDescriptor there needs to be a corresponding ActionTransformation instance. In particular, this includes a snippet of model-transformation code that implements the action. In this code, the parameters can be referenced as ‘param1’ thru ‘paramn’. The type of each parameter is defined in the ActionTransformation. TracingAspect. If present, it enables the generation of trace links from a VML specification. Such trace links connect selected features and added or removed model elements of the target model. The tracing aspect is specified by naming the modeltransformation operations that create or remove model elements; wildcards may be used to provide these names. VML* will then generate an aspect for the model transformation that advises these operations and creates appropriate trace links using the AMPLE Tracing Framework (ATF) [25]. 3.3 Generation of VML* Language Infrastructure Instances of this metamodel can be defined using a textual concrete syntax. Table 2 shows an excerpt of the language descriptor for VML4RE (cf. Sect. 4). Mapping this concrete syntax to the abstract syntax discussed above is rather straightforward so that we will not discuss it in any more detail here. It is worth noting, though, that this language descriptor does not contain complicated model-transformation code; all that is specified are the names of some functions. These functions with the actual modeltransformation code are contained in an external file3, allowing standard editors and error highlighting to be used when writing the code. Including the fully qualified

3

An oAW xTend file for our prototype.

VML* – A Family of Languages for Variability Management in Software Product Lines Table 2. Excerpt from the language descriptor for VML4RE 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

vml instance vml4req { // Define a new language called vml4req // This section defines the type of variability model and // how to access it features { metamodel "/bin/fmp.ecore" // Extracts all variability units from a variability model function "getAllFeatures" } // This section defines the type of target model and how to // access it target model { metamodel "UML2" type "uml::Package" // Metamodel type of a model // Function to interpret pointcut designators function "dereferenceElement" } // Importing plugins and external specifications bundles: "unl.vml4req", "ca.uwaterloo.gp.fmp", ... extensions: "unl::vml4req::library::umlUtil" // Syntactical definition of available actions actions: createInclude { params "List[uml::UseCase]" "List[uml::UseCase]" } insertUseCase { params "String" "uml::Package" } ... // Definition of available evaluation aspects aspects: transformation { // Evaluation for product derivation // Defines adapter for product-configuration access features { type "String" function "getAllSelectedFeatures" } // Definition of the semantics of actions as // model transformations createInclude { function "createIncludes" } insertUseCase { function "createUseCase" } ... } }

93

94

S. Zschaler et al.

name of the external file in the list after the “extensions” keyword ensures that the extension can be accessed from all relevant places in the generated code. Similarly, the “bundles” keyword lists other plugins that should be made available to any generated plugins. Here we include the plugin project containing our extension and the FMP plugin [26] providing support for cardinality-based feature models. Furthermore, we have developed a generator that takes language descriptors such as shown in Table 2 and generates a set of Eclipse plugins containing the infrastructure for this language. The operational prototype can be obtained from [27]. The code generated by this generator is based on the work previously presented in [12]. The generation is completely automatic; the only manual input provided by language developers is the language instance descriptor and the implementations of the actions provided in a separate file. The complete infrastructure for editing, compiling, and executing specifications of the new VML language is encapsulated in the generator and can, thus, be reused for each new language.

4 Example Languages from the VML* Family We have re-implemented both VML4Arch and VML4RE based on our new infrastructure. As VML4Arch has already been discussed extensively in [12], here we will focus on VML4RE. For VML4Arch we will only give a brief discussion of what needed to be changed to make it compatible with VML*. Both implementations can be downloaded from [24]. 4.1 VML4RE Requirements are most recurrently documented in a multi-view fashion [28-29]. Their description is typically based on considerably heterogeneous languages, such as use cases, activity diagrams, goal models, and natural language. Initial work on compositional approaches for early development artefacts does not clearly define composition operators for combining common and varying requirements based on different views or models. Therefore, a key problem in SPLE remains how to specify and apply the composition of elements defined in separated and heterogeneous requirements models. With the Variability Modelling Language for Requirements (VML4RE) [23] we propose an initial solution for this problem by introducing a new requirements composition language for SPLs. VML4RE is a textual language with two main goals: (i) to support the definition of relations between SPL features expressed in feature models and requirements expressed in multiple views (based on a number of UML diagram types, such as use case diagrams and activity diagrams); and (ii) to specify the compositions of requirements models for specific products of a SPL. VML4RE supports composition operators for UML use cases and activity models. It has been applied to case studies in domains such as home automation [23] and Mobile Applications [30]. It has shown great flexibility to specify composition rules and references to different kinds of elements in heterogeneous requirements models. The results of these experiments are encouraging and comparable with other approaches that support semi-automatic generation of trace-links relationships and composition between model elements in SPLs.

VML* – A Family of Languages for Variability Management in Software Product Lines

95

Table 3. Selected VML4RE actions for Use Case Models

Action Signature insertUseCase (String name, Package p) insertPackage (String name, Package p) createActorToUseCaseLink ( List[Actor] actors, List[UseCase] usecases) createInclude ( List[UseCase] source, List[UseCase] target)

Description A new use case named name is inserted into package p. A new package named name is inserted into package p. A new connection is created between each of the actors and each of the use cases. A new <> dependency is created between each of the source use cases and each of the target use cases.

Table 3 shows an overview of some of the available actions of the VML4RE language for use cases. A more complete list can be found in [23]. VML4RE provides another set of actions for activity models, which are not shown here due to space restrictions. Table 2 shows an excerpt from the language descriptor for VML4RE. It has been defined to map from feature models expressed using the FMP metamodel [26] to UML2 use case and activity models. This is expressed in the two sections named ‘features’ and ‘target model’, respectively, which also reference the functions to adapt to the feature model and to dereference pointcut designators in the target model. The real dereferencing code is implemented in the extension referenced through the ‘extensions’ keyword. The full language descriptor also specifies a tracing aspect. This is not shown in Table 2 for lack of space. Table 4. Part of the VML4RE Specification for Smart Home 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

import features <"/SmartHome.fmp">; import core <"/SmartHome.uml">; ... variant Security { insertPackage ("Security", ""); insertUseCase ("SecureTheHouse", "Security"); insertUseCase ("ActivateSecureMode", "Security"); createActorToUseCaseLink ( "Inhabitant", "Security::.*"); createInclude ( "Security::SecureTheHouse", or ( "Notification::SendSecurityNotification", "WindowsManagement::" _ "OpenAndCloseWindowsAutomatically")); }

96

S. Zschaler et al.

Finally, Table 4 shows an excerpt of a VML4RE specification for the Smart Home case study [23]. Lines 7 to 20 show the additional use cases needed when the Security feature is selected in a product configuration. Notice the use of wildcards on Line 13 to select all use cases in a package. If, and what, wildcards are supported and how they are evaluated is defined in the dereferenceElement operation invoked from the language instance descriptor in Table 2 on Line 16. Further, notice the use of a slightly more complex pointcut expression on Lines 16 to 19 of Table 4. This pointcut expression results in a set of two use cases: Notification::SendSecurityNotification and WindowsManagement::OpenAndCloseWindowsAutomatically. 4.2 VML4Arch Re-implementing VML4Arch based on the VML* infrastructure proved surprisingly easy. However, as any product line requires a certain amount of stream-lining between individual products to maximise reuse, there were some minor adjustments we had to make to fit VML4Arch into the family of languages. These adjustments, however did not affect the functionality provided by VML4Arch. In detail, we had to: • Adjust the syntax of some VML4Arch operators. VML4Arch originally had some operators like connect c1, c2 using interface i, which used a concrete syntax slightly different from the standard concrete syntax for VML* operators. We had to adjust the concrete syntax of these operators to fit the standard scheme generated by VML*. For example, the connect operator from above now is expressed as connect (c1, c2, i). • Extend some operator definitions to allow for the use of pointcut expressions as parameters. VML4Arch originally used direct references to model elements rather than pointcut expressions. This meant that we had to modify some of the operator definitions so that they would be able to deal with receiving sets of model elements as parameters rather than individual model elements only.

5 Related Work The work presented in this paper is related to work in two areas of research: systematic development of families of languages and support for variability management in SPLE. As the main focus of this paper is on constructing a family of languages, we will begin by discussing literature from this area. A number of research projects—for example, CAFÉ, Families, or ESAPS—have explored the notion of software system families (or product lines). In this work, we are extending these ideas to families of software languages, specifically for the case of VML languages. Families of languages have been presented in the research literature for a range of domains: Voelter presents an approach for a family of languages for architecture design at different levels of abstraction [31], Akehurst et al. [32] present a redesign of the Object Constraint Language as a family of languages of different complexity, Visser et al. [33] present WebDSL, a family of interoperating languages for the design

VML* – A Family of Languages for Variability Management in Software Product Lines

97

of web applications. All approaches, including ours presented in this paper, use very different kinds of technologies for their specific case: Voelter uses conditional compilation to construct an appropriate infrastructure, Akehurst et al. use a special parser technology that enables modular language specification, Visser et al. use rewriting of abstract syntax trees and our approach generates a monolithic infrastructure for each language. Equally, all approaches focus on different purposes of the language family: the different members of the family presented by Voelter are architectural languages at different levels of abstraction. The family presented by Akehurst et al. modularises different features of the OCL language, so that specific languages can be constructed as required for a project. WebDSL is a set of interoperating languages with purposes ranging from data modelling to workflow specification. The family of languages presented in our paper consists of languages that share a common set of core concepts, but adapt these to different languages with which they interface. At this point, an overview of the different potential uses of families of languages begins to emerge. What is needed next, is research into systematic development of such language families beyond individual examples. Ziadi et al. [10] and Botterweck et al. [11] both propose the implementation of product derivation processes as model transformations. Their proposal relies on the realization of product derivations via a model transformation language. This strategy requires SPL engineers to deal with low-level details of model transformation languages. Our approach provides syntax and abstractions familiar to the SPL engineers. This eliminates the burden of understanding the intricacies associated with model transformation languages and metamodels. A VML* specification is automatically compiled into an implementation of the product derivation process in a model transformation language, but SPL engineers need not be aware of this generation process. In [12] we have presented a process for developing variability management languages. The structure of these languages has some similarities to the languages developed using VML*, in fact VML4Arch was previously developed based on this process. However, focusing on process rather than infrastructure, [12] falls short of solving the issues discussed in the introduction. In particular, reuse between individual languages is only possible based on a copy-and-paste approach, variability-model and target-model access are closely intertwined with the other infrastructure code making it difficult to modify them independently. In contrast, in this paper we have presented an infrastructure, which tackles all of these issues. The code generated for a specific VML* language is partially based on code developed for VML4Arch following the process from [12]. Czarnecki and Antkiewicz [6] present an approach similar to ours based on using feature models to model variability. They create a template model, which models all products in the product line. Elements of this model are annotated with so-called presence conditions. Given a specific configuration, each presence condition evaluates to true or false. If a presence condition evaluates to false, its associated model elements are removed from the model. Thus, such a template-based approach is specific to negative variability, which might be critical when a large number of variations affect a single diagram. Our approach can also support positive variability by means of actions such as connect or merge. Moreover, presence conditions imply introducing

98

S. Zschaler et al.

annotations into the SPL model. Therefore, the actions associated with a feature selection are scattered across the model, which could also lead to scalability problems. In our approach, they are well-encapsulated in a VML* specification, where each variant specifies the actions to be executed. FeatureMapper [7] is another approach similar to that of Czarnecki and Antkiewicz and our approach, avoiding the pollution of the SPL model with variability annotations. FeatureMapper is generic for all EMF-based models and generically integrates into GMF-based editors. In contrast, our approach is based on languages that are specific to a kind of feature model and a kind of target model. Genericity is achieved through a generative approach to creating the infrastructure for these languages from a set of common core concepts. The actual variability model in FeatureMapper is created implicitly by the designer selecting model elements in an editor and associating them with so-called feature expressions determining when the model element should be present in a product model. Negative variability is easily supported by this approach, as model elements can be easily removed if their feature expression is not satisfied by a specific configuration. Positive variability is more difficult to implement: instead of mapping features to target model elements, they need to be mapped to elements of a model transformation, again requiring SPL designers to have sufficiently detailed knowledge of that model-transformation language and the metamodels involved. In contrast, in our approach, designers of a specific VML* language can provide powerful actions that can support both negative and positive variability (or any mixture of the two) in a systematic manner. Finally, Haugen et al. [34] define the common variability language (CVL), which is a generic extension to DSLs for expressing variability. It provides three generic operators, but using these to express variability can lead to comparatively complex models. On the flip side, a VML* language is potentially less flexible than the two other approaches discussed in this paragraph, as it can only support the variability mechanisms for which a corresponding action has been defined. A completely different approach to SPLE is followed in the feature-oriented software development community. Here, features are directly related to separate modules implementing each feature, where these feature modules can be understood as program or model transformations (e.g., [35]). This implies that no mapping from features to target models is required. Instead, the programming or modelling language must be sufficiently powerful to support modularizing of features as coherent wellencapsulated units of compositions. In another publication [36], we have presented a feature-oriented approach towards SPL development. In this context, we also noted that a pure feature-oriented approach can lead to a large number of small feature modules negatively impacting scalability and comprehensibility of the approach, especially where features are often associated with small-grain changes to the architecture or implementation. Thus, for such cases, an approach with an explicit mapping may be beneficial. Generally, all SPL approaches face the problem of ascertaining that only consistent and well-formed product models and implementations can be constructed. This problem becomes even worse when several interconnected types of models representing different views of the system are used—for example, activity diagrams and class diagrams. As a consequence, there is a need to analyse the changes of each view and

VML* – A Family of Languages for Variability Management in Software Product Lines

99

the inconsistencies that these may cause with other views when instantiating a product model. In our work on VML*, we have not discussed this issue so far, but some previous work on this topic exists from other groups—for example, [37-38].

6 Conclusions This paper presented a generative approach to building a family of languages for specifying the relationship between variability models and other models in softwareproduct-line engineering. Our experience shows that the proposed infrastructure is powerful enough to support generating different language instances (in addition to the two languages presented here, we are currently developing VML* languages for mapping to openArchitectureWare workflows as well as a number of project-specific DSLs) and that it can reduce the effort required to learn about the support infrastructure for such languages. Specifically, regarding the challenges we identified in Sect. 2.2, our generative approach to the family of VML* languages provides the following solutions: reuse is substantially improved over a copy-and-paste approach as all reusable parts of the infrastructure are encoded in the generator and all variable parts are explicitly configured through language descriptors (Challenge 1). Because all dependencies on varying variability and target models have been made explicit in the language descriptor, model access code could be completely disentangled from the actual model manipulation code (Challenges 2 and 3). In implementing our prototype, we identified a need for aspect-oriented code generation beyond what is offered by current code-generation engines. Our system is structured such that the code generators for the basic VML* infrastructure and for each evaluation aspect are kept in separate modules. This is sensible because evaluation aspects can be included or excluded from a specific VML* language as required. For some files generated (for example, for plugin descriptors contained in plugin.xml files) there is a conflict between code generators for the evaluation aspects: each evaluation aspect needs to contribute to the final contents of the file. Using separate code-generation templates for each evaluation aspect would result in a file containing only the contributions from one evaluation aspect. Aspect-oriented code generation could provide a solution here: it effectively allows the results of two or more different generators to be merged into one output file. However, all current aspect-oriented code generators [18, 39] only support asymmetric aspect orientation. This requires one template to be declared as the base template while the other templates are aspect templates. These aspect templates can then manipulate generation rules in the base template, providing before, after, and around advice for code generation. For our purposes this is not appropriate; because evaluation aspects may be included or excluded as required, we cannot rely on any one of them being present. Consequently, no template defined for an evaluation aspect can be made into the base template. As the basic VML* generator does not provide a template for plugin.xml, this can also not be designated as the base template. For our prototype, this problem has been solved by breaking the encapsulation of evaluation-aspect

100

S. Zschaler et al.

code generators in a controlled way. However, a cleaner solution using a more symmetric approach to aspect-oriented code generation remains for future work.

References [1] Pohl, K., et al.: Software Product Line Engineering: Foundations, Principles and Techniques. Springer, Berlin (2005) [2] Clements, P., Northrop, L.M.: Software Product Lines: Practices and Patterns. AddisonWesley, Boston (2002) [3] Czarnecki, K., Eisenecker, U.W.: Generative Programming: Methods, Tools, and Applications. ACM Press/Addison-Wesley Publishing Co. (2000) [4] Kang, K., et al.: Feature-Oriented Domain Analysis (FODA) Feasibility Study. Software Engineering Institute, Technical report, CMU/SEI-90-TR-0211990 [5] Alférez, M., et al.: A Model-Driven Approach for Software Product Lines Requirements Engineering. In: Proceedings of the 20th International Conference on Software Engineering and Knowledge Engineering, San Francisco Bay, USA, July 2008, pp. 779–784 (2008) [6] Czarnecki, K., Antkiewicz, M.: Mapping Features to Models: A Template Approach Based on Superimposed Variants. In: Glück, R., Lowry, M. (eds.) GPCE 2005. LNCS, vol. 3676, pp. 422–437. Springer, Heidelberg (2005) [7] Heidenreich, F., et al.: FeatureMapper: mapping features to models. Presented at the Companion of the 30th international conference on Software engineering, Leipzig, Germany (2008) [8] Batory, D., Azanza, M., Saraiva, J.: The Objects and Arrows of Computational Design. In: Czarnecki, K., Ober, I., Bruel, J.-M., Uhl, A., Völter, M. (eds.) MODELS 2008. LNCS, vol. 5301, pp. 1–20. Springer, Heidelberg (2008) [9] Soares, S., et al.: Supporting software product lines development: FLiP - product line derivation tool. Presented at the Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications, Nashville, TN, USA (2008) [10] Ziadi, T., Jézéquel, J.M.: Software Product Line Engineering with the UML: Deriving Products. In: Software Product Lines 2006, pp. 557–588 (2006) [11] Botterweck, G., et al.: Model-Driven Derivation of Product Architectures. In: Proceedings of the 22nd International Conference on Automated Software Engineering (ASE), Atlanta, Georgia, USA, November 2007, pp. 469–472 (2007) [12] Sánchez, P., et al.: Engineering Languages for Specifying Product-Derivation Processes in Software Product Lines. Presented at the Software Language Engineering 2008, Toulouse, France (2008) [13] Loughran, N., Sánchez, P., Garcia, A., Fuentes, L.: Language Support for Managing Variability in Architectural Models. In: Pautasso, C., Tanter, É. (eds.) SC 2008. LNCS, vol. 4954, pp. 36–51. Springer, Heidelberg (2008) [14] Voelter, M., Groher, I.: Product Line Implementation using Aspect-Oriented and ModelDriven Software Development. In: Procceedings of the 11th International Software Product Line Conference (SPLC), Kyoto, Japan, September 2007, pp. 233–242 (2007) [15] Czarnecki, K., Eisenecker, U.W.: Generative Programming: Methods, Tools, and Applications. Addison-Wesley, Reading (2000)

VML* – A Family of Languages for Variability Management in Software Product Lines

101

[16] Jayaraman, P., Whittle, J., Elkhodary, A.M., Gomaa, H.: Model Composition in Product Lines and Feature Interaction Detection Using Critical Pair Analysis. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 151–165. Springer, Heidelberg (2007) [17] Jouault, F., Kurtev, I.: Transforming Models with ATL. In: Bruel, J.-M. (ed.) MoDELS 2005. LNCS, vol. 3844, pp. 128–138. Springer, Heidelberg (2006) [18] OpenArchitectureWare, http://www.openarchitectureware.org/ [19] Taentzer, G.: AGG: A Graph Transformation Environment for Modeling and Validation of Software. In: Pfaltz, J.L., Nagl, M., Böhlen, B. (eds.) AGTIVE 2003. LNCS, vol. 3062, pp. 446–453. Springer, Heidelberg (2004) [20] Czarnecki, K., Helsen, S., Eisenecker, U.W.: Staged Configuration Using Feature Models. In: Nord, R.L. (ed.) SPLC 2004. LNCS, vol. 3154, pp. 266–283. Springer, Heidelberg (2004) [21] Volter, M., Stahl, T.: Model-Driven Software Development. Wiley, Glasgow (2006) [22] VML* Download, http://www.steffen-zschaler.de/publications/vmlstar/ [23] Alférez, M., et al.: A Metamodel for Aspectual Requirements Modelling and Composition, AMPLE D1.3 (2007), http://ample.holos.pt/gest_cnt_upload/editor/File/public/ AMPLE_WP1_D13.pdf [24] Alférez, M., et al.: Multi-View Composition Language for Software Product Line Requirements. In: Proceedings of the 2nd Int. Conference on Software Language Engineering (SLE), Denver, USA (2009) [25] Sousa, A.: AMPLE Traceability Framework Frontend Manual (2008), http://ample.di.fct.unl.pt/Front-End_Framework/ ATF%20Front-end%20Manual.pdf [26] Generative Software Development Group, U. Waterloo, Feature Modelling Plugin (FMP) for Eclipse, http://gsd.uwaterloo.ca/projects/fmp-plugin/ [27] VML* Download (2009) [28] Kotonya, G., Sommerville, I.: Requirements Engineering: Processes and Techniques. John Wiley, Chichester (1998) [29] Ian, S., Pete, S.: Requirements Engineering: A Good Practice Guide. John Wiley and Sons, Chichester (1997) [30] Young, T.: Using AspectJ to Build a Software Product Line for Mobile Devices. University of Waterloo (2005), http://www.cs.ubc.ca/grads/resources/thesis/Nov05/ Trevor_Young.pdf [31] Voelter, M.: A Family of Languages for Architecture Description. Presented at the Conference on Object-Oriented Programming, Systems, Languages, Orlando, Florida (2008) [32] Akehurst, D.H., et al.: Supporting OCL as part of a Family of Languages. In: Proceedings of the MoDELS 2005 Conference Workshop on Tool Support for OCL and Related Formalisms - Needs and Trends (2005) [33] Visser, E.: WebDSL: A Case Study in Domain-Specific Language Engineering. In: Lämmel, R., Visser, J., Saraiva, J. (eds.) Generative and Transformational Techniques in Software Engineering II. LNCS, vol. 5235, pp. 291–373. Springer, Heidelberg (2008) [34] Haugen, Ø., et al.: Adding Standardized Variability to Domain Specific Languages. In: Proceedings of the Conference on Software Product Lines (SPLC 2008), pp. 139–148 (2008)

102

S. Zschaler et al.

[35] Batory, D., et al.: Scaling Step-Wise Refinement. IEEE Transactions on Software Engineering, 355–371 (2003) [36] Fuentes, L., et al.: Feature-Oriented Model-Driven Software Product Lines: The TENTE approach. In: Proceedings of the Forum of the 21st International Conference on Advanced Information Systems (CAiSE), Amsterdam, The Netherlands (2009) [37] Thaker, S., et al.: Safe Composition of Product Lines. In: Proceedings of the 6th International Conference on Generative Programming and Component Engineering (GPCE), Salzburg, Austria, pp. 95–104 (2007) [38] Janota, M., Botterweck, G.: Formal Approach to Integrating Feature and Architecture Models. In: Fiadeiro, J.L., Inverardi, P. (eds.) FASE 2008. LNCS, vol. 4961, pp. 31–45. Springer, Heidelberg (2008) [39] MOFScript, http://www.eclipse.org/gmt/mofscript/

Multi-view Composition Language for Software Product Line Requirements Mauricio Alférez1, João Santos1, Ana Moreira1, Alessandro Garcia2, Uirá Kulesza1, João Araújo1, and Vasco Amaral1 1 New University of Lisbon, Caparica, Portugal Pontifical Catholic University of Rio de Janeiro, Brazil {mauricio.alferez,joao.santos,amm,uira,ja, vasco.amaral}@di.fct.unl.pt [email protected] 2

Abstract. Composition of requirements models in Software Product Line (SPL) development enables stakeholders to derive the requirements of target software products and, very important, to reason about them. Given the growing complexity of SPL development and the various stakeholders involved, their requirements are often specified from heterogeneous, partial views. However, existing requirements composition languages are very limited to generate specific requirements views for SPL products. They do not provide specialized composition rules for referencing and composing elements in recurring requirements models, such as use cases and activity models. This paper presents a multi-view composition language for SPL requirements, the Variability Modeling Language for Requirements (VML4RE). This language describes how requirements elements expressed in different models should be composed to generate a specific SPL product. The use of VML4RE is illustrated with UMLbased requirements models defined for a home automation SPL case study. The language is evaluated with additional case studies from different application domains, such as mobile phones and sales management. Keywords: Requirements Engineering, Software Product Lines, Variability Management, Composition Languages, Requirements Reuse.

1 Introduction Software Product Lines (SPLs) engineering [1, 2] is an increasingly-relevant approach to improve software quality and productivity. It encompasses the creation and management of systems’ families for a particular domain. Each system (product) in the family is derived from a shared set of core assets. Thus, a SPL product shares many features with the other products. SPL features are typically represented in domain analysis using feature models [3, 4]. A feature [3] is a visible system property or functionality that is relevant to some stakeholders. Features are either commonalities or variabilities used to distinguish products of an SPL. A feature model is used to capture such commonalities and variabilities of the products’ family in a SPL, and define their dependencies. M. van den Brand, D. Gašević, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 103–122, 2010. © Springer-Verlag Berlin Heidelberg 2010

104

M. Alférez et al.

Model-based development methods for SPLs [2, 5, 6] support the construction of different models to provide a better understanding of each SPL feature. However, features, which are modeled separately in partial views, must be composed to show the requirements of the target applications. Composing variable and common requirements is a challenging task. Requirements are the early software artifacts most frequently documented in a multi-view fashion. Their description is typically based on significantly heterogeneous languages, such as use cases [7] (coarse-grained operational view), interaction diagrams (fine-grained operational view) [8], goal models (intentional and quality view) [9, 10], and natural language. This varied list of requirements models is a direct consequence of requirements having to be understood by stakeholders with different backgrounds, from customers of specific products to SPL architects, programmers and testers. However, initial work on compositional approaches [2, 5, 6, 11] for requirements artifacts is rather limited in language support. These approaches do not offer composition operators for combining common and varying requirements based on different partial views. They are also often of limited scope and expressiveness [11]. Therefore, a key problem in SPL remains to be addressed: how to compose elements defined in separated and heterogeneous requirements models using a simple set of operators? This paper answers this question by proposing the Variability Modeling Language for Requirements (VML4RE), a requirements composition language for SPLs. VML4RE has two main goals: (i) to support the definition of relationships between SPL features expressed in feature models and requirements expressed in multiple models; and (ii) to specify the composition of requirements models for deriving specific SPL products using a simple set of operators. VML4RE provides a set of specialized operators for referencing and composing requirements elements of specific types, based on recognizable abstractions used in the domain of each requirements modeling notation or technique. Such operators can help SPL engineers to understand and choose the composition rules for requirements models. In contrast with conventional and general-purpose languages for model transformation, such as XTend [12], ATL [13] and AGG [14], VML4RE is tailored to requirements composition in a way that is accessible to requirements engineers. This is an important contribution of our work, as it addresses the problem of abstraction mismatch caused by such general-purpose model transformation languages [15, 16]. VML4RE prevent SPL designers from the burden of language intricacies that are not part of the abstraction level at which they are used to work. The remainder of this paper is organized as follows. Section 2 presents a set of criteria used when creating the requirements variability composition language. Section 3 describes a case study that is later used to illustrate the VML4RE composition language and creates an example specification. Section 4 presents VML4RE and Section 5 discusses its application to the case study. Section 6 presents the evaluation of the language and discusses its benefits and limitations. Section 7 examines related work and compares it with ours. Finally, Section 8 concludes the paper and points directions to future work.

Multi-view Composition Language for Software Product Line Requirements

105

2 Criteria to Design VML4RE SPL Requirements Engineering handles both common and variable requirements that enable the derivation of customized products of the family. Feature models are used to specify SPL commonalities and variabilities and feature model configurations are used as a driver during the process to derive product-specific requirements models. Requirements variability composition is the ability to customize requirements models for specific family products. The customization of model-based requirements implies a composition process where some elements are added, others are removed and, eventually, some are modified from the initial models. This section describes five criteria taken into account for the design of the VML4RE. These criteria raised from the needs for requirements models specification and composition for heterogeneous SPLs proposed by the industrial partners in the AMPLE project [17]: C1: Support Multi-View Variability Composition: Requirements are the early software artifacts most recurrently documented in a multi-view fashion. In this context, variability manifests itself in different kinds of requirements (e.g., functional requirements and quality attributes) and design constraints (e.g., different databases, network types or operating systems) [2]. Modeling the requirements using multiple views facilitates the understanding of the SPL’s variabilities and its specific products. This is particularly important in SPL development as it encompasses a number of stakeholders, from customers and managers of specific products, to core SPL architects, programmers and testers. C2: Provide Requirements-Specific Composition Operators: Requirements descriptions are typically based on significantly-heterogeneous languages. Specific composition operators for combining common and varying requirements based on elements used in different views or models facilitate the operators’ adoption by the SPL developers. General-purpose composition languages, such as XTend [12], ATL [13] and AGG [14], require a deep knowledge of the abstract syntax of the models to describe their composition. This highlights the problem of abstraction mismatch and the need for a composition language that does not require additional developer expertise. Requirements engineers should work at the same level of abstraction they are used to [15]. C3: Support Fine- and Coarse-Grained Composition: Requirements models can represent different levels of detail for a specific product. Coarse-grained modeling helps to define the scope of the system to be built by expressing the goals or the main functions of the product. Each coarse-grained element is often associated with a variety of fine-grained elements. The latter provide detailed requirements for what the system must do or sub-goals of the different parts of the system. For instance, UML provides coarse-grained model elements, such as packages and use cases, to organize the main subsystems and functions of the system to be built. Then, other models, such as activity diagrams, support further refinements of use cases. As a result, both finegrained and coarse-grained composition is required to address the different levels of abstraction employed in SPL requirements engineering. C4: Support Positive and Negative Variability: In general, there are three means to derive models for a specific SPL product: positive variability, negative variability

106

M. Alférez et al.

and a combination of both. Negative variability is the removal of optional elements from a given structure, whereas positive variability is the addition of optional parts to a given core [18]. Optional elements are related to optional and alternative features of the SPL and the core part encompasses features that are common to all the products. Sanchez et al. [15] presented a positive-negative modeling technique for variability management, but the composition operators are specific to architectural models. The flexibility provided by a positive-negative approach for composition is also advisable for requirements models. For example, in cases where the addition of a model element requires the removal of other(s) elements, as often happens when modeling mutuallyexclusive features. C5: Facilitate Trace Links Generation: Variability specification usually keeps implicit information governing the relationships between each SPL feature and respective requirements models. Composition methods could support explicit traceability of varying features through the generation of trace links from variability specifications. Hence, traceability information could be used to analyze system evolution properties, such as change impact analysis or requirements coverage. The five criteria described above formed the basis for the VML4RE design. The use of the VML4RE language assumes a process workflow, which is described in Figure 1. Domain Engineering encompasses the creation of a set of artifacts associated with the SPL. These artifacts are reused in application engineering to produce specific SPL products. VML4RE is useful at the first stage of domain engineering, called domain analysis. Variability identification and SPL requirements modeling are the most important activities, which are performed in parallel during domain analysis. During variability identification (Figure 1-A), distinction is made between core (common) SPL features and the features of specific products. SPL requirements modeling (Figure 1-B) tackles the detailed specification of features using different requirements modeling techniques and notations (related to C1). Composition specification (Figure 1-C) relies on requirements-specific composition rules to specify how to customize requirements models (related to C2). These rules can be based on operators that address both fine- and coarse grained compositions (related to C3). The reusable artifacts created in domain engineering are used in application engineering to derive specific product models through the definition of configurations. Existing product derivation tools like pure::variants [19] and Gears [20] mainly allow to derive the complete or partial source code of a product. The input to this derivation is the existing code artifacts produced for a SPL architecture. However, these tools do not provide language support for the derivation of requirements models for a specific product (related to C2). In a VML4RE-centric process, variability resolution (Figure 1-D) implies selection of the variable features to be included in the product. Finally, models derivation (Figure 1-E) is the actual composition of the different models of a specific product. This supports the addition and removal of elements from the initial models (related to C4). Additionally, when deriving the models, appropriate tool support can be able to generate the trace links (Figure 1-F) between the features chosen for the product and the different parts of the requirements models (related to C5). This paper focuses on the Composition Specification activity highlighted in grey in Figure 1. The next section presents the case study and introduces VML4RE as a way of addressing the five criteria just discussed.

Multi-view Composition Language for Software Product Line Requirements

DomainEngineering (DomainAnalysis) A.Variability Identification

B.MultiͲviewSPL RequirementsModelling

C.CompositionSpecificationforRequirements

107

Application Engineering (RequirementsSpecification) E.ModelsDerivation

F.TraceLinksGeneration

D.VariabilityResolution

Fig. 1. Key Steps of SPL Requirements Composition

3 Case Study: Home Automation Smart Home is a software product line for home automation being developed by Siemens AG [21]. For brevity and clarity, we rely on a subset of the Smart Home features. The left hand side of Figure 2 shows the partial feature model of the product line, while the middle of the figure presents one of its possible configurations, the “Economical Smart Home” (to create the models we use the FMP tool [22]). Some optional features are not included in such an Economical edition. Therefore, camera surveillance and internet as user interfaces are not part of the final product, for example. Hence, these features are not ticked in the product feature model (middle). Figure 3 presents the use case model of the Economical Home as an exemplar of the set of models that we intend to obtain after the composition process. The elements highlighted in grey are related to variable features selected to be included in the Economical Home, while the rest of the elements are related to common features. Table 2 gives an example of the relationships between features and parts of the models. Also, the following sections provide more details on how this model was composed. Smart Home inhabitants must be able to adjust the heater of the house to their preferred value (Manual Heating feature). In addition, the Smart Heating feature might be activated in a house. If so, a heater control will adjust itself automatically to save energy. For instance, the daily schedule of each inhabitant is stored in the Smart Home gateway. When the house is empty, the heater is turned off and later turned back on, on time to reach the desired temperature when the inhabitants return home. Smart Home can also choose to open or close windows automatically, to regulate the temperature inside the house as an option to save energy (Electronic Windows feature). Alternatively to the electronic windows, the inhabitants could always be able to open and close the windows manually (Manual Windows feature). There are different types of graphical user interfaces that allow monitoring and managing the different devices of the smart home as well as receive security notifications (GUI feature). The available GUIs alternatives are: touch screens inside the house (Touch Screen feature), or via internet through a website and a notifier (Internet feature). As far as the Security feature is concerned, inhabitants can initiate the secure mode by activating the glass break sensors or/and camera surveillance devices (Glass Break Sensors and Cameras features). If an alarm signal is sent by any of these devices, and according to the security configuration of the house, the Smart Home decides to (i) send a notification to the security company and the inhabitants via internet

108

M. Alférez et al.

and touch screens, (ii) Secure the house by activating the alarms (Siren and Lights features), and/or (iii) closing windows and doors (Electronic Windows feature). Next we introduce VML4RE and illustrate its use with this case study.

Fig. 2. (left) Smart Home Feature Model; (middle) Feature Model Configuration for the Economic Home; (right) Feature Model Notation

Heating <> Heater

Adjust HeaterValue <> Calculate Energy Consumption

<<extend>>

WindowsManagment

<> OpenAnd CloseWindows Manually

<<extend>>

OpenAndClose Windows

OpenAndClose Windows Automatically

Inhabitant

<>

Security

SecureThe House

ActivateSecure Mode Notification NotifyUsing TouchScreen

Control Temperature Automatically

<> <<extend>>

SendSecurity Notification

<> Thermostat

<> Window Actuator <> Window Sensor

<> Lights

<> Siren <> Glassbreak Sensor

Fig. 3. Smart Home Economical Edition Use Case Model

Multi-view Composition Language for Software Product Line Requirements

109

4 VML4RE This section outlines the VML4RE process, its main elements and its composition semantics. 4.1 VML4RE Process The VML4RE process is described by instantiating the requirements composition process outlined in Figure 1. Figure 4 shows the specific artifacts employed in each of the activities. For variability identification (Figure 4-A), we employ a feature model that specifies the common and variable features of the SPL, as well as their dependencies. For requirements modeling, we employ various requirements models. In particular, we chose use cases whose detailed behavior is modeled using activity models. This mimics what often happens in mainstream UML-based methods, such as RUP [23]. The further elaboration of use cases with activity models; in contrast to freeformat textual descriptions, facilitate the adoption of model-driven generation tools. This alternative provides models that conform to a metamodel (i.e., the metamodel of UML activity diagrams), thereby reducing the ambiguity in the specifications [2]. The detailed specification of use cases as activity models also enables customizations of use cases realizing specific SPL configurations. During requirements modeling, other models, such as goal models [9, 10], can be used to specify interactions between functional and non-functional requirements. Such models also allow studying the actors and their dependencies, thus encouraging a deeper understanding of the business process. In addition, goal models can be used as a way to introduce intentionality in the elicitation and analysis of requirements. As a consequence, these goals allow the underlying rationale to be exploited in the selection of variants in the application development process [24]. The VML4RE specification (Figure 4-C) references the requirements models and specifies composition rules (also called actions). The VML4RE interpreter (Figure 4E and F) receives as input the SPL REquirements (RE) models (Figure 4-B), the feature model configuration (Figure 4-D) and the VML4RE specification (Figure 4-C). As output, the interpreter generates: (i) use cases of a product; (ii) activity models that describe product usage scenarios; (iii) additional requirements models, such as, goal models (Figure 4-E); and (iv) the trace links between features and specific elements in the requirements models (Figure 4-F). 4.2 VML4RE Main Elements Each VML4RE specification is composed of three main kinds of elements: 1. Importing: it imports the set of requirements and feature model that are used in the VML4RE specification. This is accomplished using import sentences. 2. Commonalities: it defines the features that are mandatory to every product of a SPL. It is used to reference the parts of the requirements that are related to SPL common features. 3. Variabilities: it defines the variable (optional, variation points and variants) features of the SPL. Optional features are not mandatory and might not be included in some of the products of the SPL. A variation point identifies a particular concept

110

M. Alférez et al.

within the SPL requirements specification as being variable and it offers a number of variants. A variant describes a particular variability decision, such as a specific choice among alternative variants. The variabilities blocks are used to: (i) reference (sentences initiated by the keyword ref) the requirements related to each variable feature, and (ii) enclose operators used to compose the requirements related to each variable feature. A. Feature Model

B.SPLRE Models

C.VML4REModel modelSPLName { importSPL_RE_Models... commonX{

uc: UseCaseModel

//references...

} variationPoint Y{ variant Y1 { //references andactions...

select: //references andactions...

D. Feature Model Configuration

unselect:

a:Activity Models

//references andactions...

} variant Y2{ //samestructureasY1...

} //...

g:Other Models

} optionalZ{ select: //references andactions...

unselect:

Imports

//references andactions... } //...

}

Imports

VML4REToolSuite

Interpreter: SPLREmodelsandfeaturemodel configurationimporter E.REmodelscomposer F.Tracelinksgenerator

C.Editor of VML4RE specifications

Done using

Generates E. Product –Specific RE Models F. Trace Links betw een Features & RE models

Fig. 4. Artifacts and Composition Workflow

The VML4RE specification outline (in Figure 4-C) contains separated blocks for import sentences, common features like X, and variable features like Y, Y1, Y2 and Z. Each optional, variationPoint and variant blocks can have select and unselect subblocks. They indicate the set of references and actions that are taken into account if the feature was selected or not in the feature model configuration. Thus, given that Y and Y1 are selected in the feature model configuration, the actions and references

Multi-view Composition Language for Software Product Line Requirements

111

inside the select block of feature Y1 are executed. The actions and references inside the unselect block of the Y2 and Z features are also executed. 4.3 References and Composition Operators VML4RE provides references to indicate which elements in the requirements models are related to specific features. Also, it provides a set of specialized operators for composing requirements model elements like use cases, packages, activities or goals. The upper part of Table 1 summarizes the description of the structure of the elements related with the references. In VML4RE specifications, the ref statements allow creating a reference between the different common, optional and alternative features and specific parts of models. In the ref statements, it is possible to use designators (e.g., “.”, “equal”) and quantification (e.g., “*” that indicates all the elements inside a model element). Logical operators like “And” and “Or” can be used to create more complex query expressions over the models. Listing 1 provides examples of references to packages, activities and use cases that will be explained in the description of the Smart Home section. Table 1 also summarizes the structure of some composition operators. These include operators that are relevant to use case, activity and goal models (in particular, the strategic dependency model of the i* [10] goal-oriented approach). Analogous to the insert operators that add parts to the base model, we have replace and remove operators. The complete metamodel and grammar of the language can be found in [25]. The semantics of each VML4RE composition operator can be defined in terms of a model-to-model transformation. For instance, the “Insert Use Case Links” operator using the use case link type “associatedWith”, connects an actor and a use case using an association link (for example, insert(UCLinks_of_type: associatedWith{from actorD to useCaseModelA.PackageB.useCaseC});). The intended transformation of the use case model can be presented by the left hand side (LHS) and right hand side (RHS) graphs in Figure 5, where the inputs are a use case model, a use case, a use case’s package, and an actor. If there is already an association between the actor and the use case in the same package, the transformation is not applied to avoid duplicates. This is expressed with the cross in some elements in the LHS graph that act as negative application conditions (NAC). It means that any match against the LHS graph cannot have a packageB with any existing association between actorD and the useCaseC. In general, a graph transformation is a graph rule r: L —› R for LHS graph L to a RHS graph R. The process of applying r to a graph G involves finding a graph monomorphism, h, from L to G and replacing h(L) in G with h(R) [26]. The notation used to define this graph transformation is similar to the one used by [27] where the LHS and RHS patterns are denoted by a generalized form of object diagrams. However, for visual simplicity we added dashed lines between elements to represent any number of containments (in this case, package’s containments). We defer to [27] for the readers interested in details of this notation. Figure 6 illustrates the replace operator with the example “Replace use case”. A replace in this context includes to remove a use case and then insert a new use case linked in the place of the old use case (for example, replace (useCase useCaseModelA.useCaseB by UseCase useCaseC } ) ; ).

112

M. Alférez et al. Table 1. Some of the VML4RE elements

Element

Reference

Where Declaration

Expression

Element Insert Package Insert Use Case Insert Use Case Links

Description and Structure of Some Elements Related with References Identifies one or more requirements model elements. The references are made to specific types of elements in the models. This is expressed using the designator ofType that allows querying based on the type of model element (ElementType), e.g., UseCase, Activity, Actor, or Element when the referenced models elements are of different types. Reference : "ref" ref_name ofType ElementType"{"(RefExpression | ref_name2) WhereDeclaration?"}"; RefExpression : elementName (("." RefExpression)|".*")?; It is an optional part of a reference expression that allows querying a set of model elements based on their name. WhereDeclaration : "Where" "(" Expression ")"; Some of the possible designators are: equal, different, startsBy, finishesWith, and contains. They search for matches between a literal and the first letters, last letters or in any place in the names of the model elements of a specific type, respectively. Besides, the expressions can be combined with logical operators like and and or to create more complex queries. Expression : BooleanExpression(SubExpression)*; SubExpression : Operator BooleanExpression ; BooleanExpression : "contains" literal | "equal" literal | "different" literal | "startsBy" literal | "finishesWith" literal Description and Structure of Some Actions Insertion of a package in a use case model, or in another package "package" package_name “into” RefExpression; Insertion of a use case into a use case model or inside a package(s) "useCase" useCase_name "into" RefExpression; Insertion of different relationships between elements in a use case model "UClinks_of_type:" UseCaseLinkType "{" UCElementsLinkage+ "}" ;

Helps to factorize the insertion of relationships in a use case model (Insert Use Case Links) according to the UseCaseLinkType for a better organization of the actions. "from " RefExpression "to" RefExpression ("," RefExpression )*; Available relationships between use cases (inherits, extends, includes) and between actors and use cases Use Case (associatedWith and biAssociatedWith for bidirectional relationships). ("inherits" | "extends" | "includes" | "associatedWith" | Link Type "biAssociatedWith"); Insertion of an actor into a use case model or package Insert "actor" actorName "into" RefExpression ; Actor Insert Inserts an activity into an activity model "activity" (newActivityName "into" RefExpression) ; Activity Activity Helps to factorize the insertion of relationships in an activity model (InsertActivityLinks, not shown in this table) and optionally to add a guard condition. Elements "from" RefExpression "to" RefExpression ("with guard" guardCondition)?; Flow Replace Replaces a use case by a new one "useCase" RefExpression "by" "useCase" newUseCaseName; Use Case Replaces an activity by a new activity or a complete activity model. Replace "activity" RefExpression "by"(("activity" newActivityName)|("activityModel" Activity RefExpression); Inserts a goal of I* (indicated by the i in iGoal) in a strategic dependency model. Insert “iGoal” goalName "into" RefExpression ; iGoal Insertion of different dependencies relationships between elements in a strategic dependency model. Insert iGoal "IGoalDependencies_of_type:" iGoalDependencyType "{" iGoalElementsLinkage+ "}" dependencies ; iGoal The links between the nodes in the strategic dependency diagram go from depender to dependee through dependum. Elements "from " RefExpression "to" RefExpression "through" dependumName; Linkage iGoal Dependency ("resourceDependency"|"taskDependency"|"goalDependency"|"SoftGoalDependency"); Type UC Elements Linkage

The advantage of specifying model compositions with a pure graph transformation approach is its expressivity by allowing accessing all the elements of the metamodel. However, software modelers typically do not have the in-depth knowledge about intricacies of the requirements metamodels required to specify a graph rule [28]. The actions in VML4RE do not require any kind of knowledge about the details of the metamodels. They provide requirements-specific composition operators that facilitate the specification of the composition of the models.

Multi-view Composition Language for Software Product Line Requirements

LHS

RHS

useCaseModelA: Model

packageB:Package

actorD:Actor

useCaseModelA: Model

packageB:Package

actorD:Actor

Type useCaseC : UseCase

:Association

Type

113

Type useCaseC : UseCase

A_src_dst: Association

Type memberEnd

memberEnd

memberEnd memberEnd

:Property

:Property

Dst:Property

Src:Property

Aggregation= none

Aggregation= none

Aggregation= none

Aggregation= none

Fig. 5. Graph Rule to Insert an Association between actorD and useCaseC in PackageB

LHS

useCaseModelA : Model useCaseB : UseCase

RHS

useCaseModelA : Model useCaseC : UseCase

Fig. 6. Graph Rule to Replace UseCaseB by UseCaseC

5 Applying VML4RE This section illustrates the use of the references and some VML4RE actions for domain and application engineering. 5.1 VML4RE in Domain Engineering The Smart Home requirements were modeled with use case and activity models created with the UML Tools plug-in [29]. The FMP tool [22] was used to build a feature model to specify SPL commonalities and variabilities. This tool supports cardinalitybased feature models. The relations between the models are specified with VML4RE. The VML4RE editor is implemented using xText [30], a framework for the development of textual DSLs. It is part of the VML4RE tool suite [25] implemented in the Eclipse platform as a set of extensible plugins. It is based on openArchitectureWare [12], a model-driven development infra-structure, and the Eclipse Modeling Framework (EMF) [31]. Listing 1 shows a partial view of this specification. Initially, the different requirements and feature models are imported to be used in the specification (Lines 2-4). In the VML4RE specification, the modeler can create references to requirements models. For instance, it is possible to reference a specific element in a model, like an actor; this happens in “ref Heater ofType Actor {uc.Heater}” (line 10); or all the elements (e.g., use cases, packages, actors) inside one container element, e.g., “ref AllHeatingElementsInUCs ofType Element{uc.Heating.*}” (lines 13-15); or elements in different parts of the models according to a search condition, like “ref SurDev ofType

114

M. Alférez et al.

Activity {ams.* Where equal VerifyInstalledSurveillanceDevice} ” that searches in the set of activity models for activities with the name “VerifyInstalledSurveillanceDevice” (lines 51-52).

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

modelSmartHome{ importUseCaseModel'/UCModel.uml'as uc; importFeatureModel'/FeatModel.fmp' as fm; importActivityModels'/ActModels.uml' asams; /*moreimportssentences...*/ CommonHomeFunctions{//... CommonHeatingManagement{//... select: refHeatingofTypePackage{uc.Heating} refHeaterofTypeActor{uc.Heater}//... Common ManualHeating{/*...*/} OptionalSmartHeating{/*...*/} refAllHeatingElementsInUCs ofTypeElement{uc.Heating.*} } Common WindowsManagement{ select: refWindowsMngmtofTypePackage {uc.WindowsManagement} Common ManualWindows{/*...*/} Optional ElectronicWindows{/*...*/} } refAllWindowsMngmtElementsofType Element{uc.WindowsManagement.*} } OptionalSecurity{//... select: insert(packageSecurityintouc); refSecurity_PkgofType Package{uc.Security} Insert(useCaseSecureTheHouse into Security_Pkg);

33 ref SecHouseofTypeUseCase 34 {Security_Pkg.SecureTheHouse} 35 insert (useCase ActivateSecureMode 36 into Security_Pkg); 37 ref ActSecModeofTypeUseCase 38 {Security_Pkg.ActivateSecureMode} 39 VariationPoint Alarm{//... 40 Variant Siren{/*...*/ } 41 Variant Light{/*...*/ } 42 } 43 VariationPoint IntrusionDetection{//... 44 Variant GlassBreakSensors{//... 45 select: 46 insert (actor GlassBreakSensorintouc); 47 ref GlassSen_AofTypeActor 48 { uc.GlassBreakSensor} 49 insert (UClinks_of_type: associatedWith 50 { from GlassSen_Ato SecHouse}); 51 ref SurDevofTypeActivity {ams.*Where 52 equal VerifyInstalledSurveillanceDevice} 53 replace (SurDevbyActivity 54 VerifyInstalledGlassBreakSensors); 55 } 56 Variant Cameras{ 57 select: 58 insert (actor Camerasinto uc); 59 ref Camera_AofTypeActor{uc.Cameras}//.. 60 } 61 } 62 } 63 Common GUI{/*...*/} 64 }

Listing 1. VML4RE Model for the Smart Home

The VML4RE specification also employs actions to specify how variable requirements model elements are composed with common requirements model elements. Listing 1 presents several actions to be applied in activity and use case models. For example, the insertion of the Security package into the use case uc (line 28), or the insertion of the SecureTheHouse use case in the Security package (line 31-32) and the insertion of an association between the GlassbreakSensor actor and the use case SecureTheHouse (lines 49-50). 5.2 VML4RE in Application Engineering In application engineering, the feature model configuration is used as a driver during the process to derive product-specific requirements models. Figure 2 (middle) shows

Multi-view Composition Language for Software Product Line Requirements

115

the feature model configuration of an Economical Smart Home. The Economical Smart Home does not have camera surveillance or use internet to send security notifications. The VML4RE interpreter processes the SPL requirements models and the feature model configuration to derive a product-specific requirements model. During this process, we can use a positive, negative, or a mixture of positive and negative variability transformation strategies. Our interpreter first includes all the model requirements elements related to mandatory features by processing the respective ref statements specified inside of the common feature blocks. These elements are also called the core model in our approach, since they are included in every SPL instance. After that, the interpreter processes the ref statements and actions of the variabilities. In this example of the Smart Home, we will illustrate the use of the VML4RE in conjunction with a positive variability approach since we mostly used actions that add optional parts to the base model. Finally, product-specific requirements models are produced based on processing the VML4RE specification (Listing 1). Given the possibility of defining, in a unique VML4RE specification, the relationships between a feature model and several requirements models (e.g., use case and activity models), our interpreter produces different product-specific requirements models in just one-step. Our current implementation [25] supports the derivation of use case and activity models, and we are working to address other models (i* [10] and KAOS [32], for example). During the Economical Smart Home derivation process, the actions and references related to Internet and Cameras were not included; for instance, the reference and action related to the Cameras actor (Listing 1, lines 58-59). The result of the composition of the use case model is shown in Figure 3, where the elements added to the core model were highlighted in grey. In addition to the use case model, other requirements models were transformed according to the execution of VML4RE actions. Figure 7 shows the ActivateSecureMode activity model, related to the use case with the same name. When the Security optional feature is chosen, the actions contained in the “select” block of the Security variation point are performed (Listing 1, lines 28-38), and the ActivateSecureMode (Figure 7 (left)) activity model is included into the final requirements product models. During derivation of product-specific requirements models, some of the generic activities in the activity model can be replaced with others more specific to the product that is being configured. This happens in the Economical Home where the GlassBreakSensors are the only surveillance devices selected in the configuration. Hence, we could create a simple replacement of the VerifyInstalledSurveillanceDevices activity by the VerifyInstalledGlassBreakSensors activity as appears in Listing 1 (lines 4748). As there are probably other activities for verification of surveillance devices in the requirements models, we use the Where operator. The result of the replacement is shown in Figure 7 (right). If two (or more) variants from an OR feature are selected, such as the Intrusion Detection, our interpreter produces two (or more) different activity models, one for each instance. This strategy was developed to avoid conflicts in the transformation of a same activity model.

116

M. Alférez et al.

VerifyInstalled SurveillanceDevices

VerifyInstalled GlassBreakSensors

WaitForAlarmSignal

WaitForAlarmSignal

Fig. 7. Simplified Smart Home ActivateSecureMode Before and After a Replace Activity Action

VML4RE allows the product derivation of trace links between the features and elements in other models, such as use cases and activity diagrams. This derivation can be accomplished based on the ref sentences inside each of the common and variable blocks. Each ref in the VML4RE specification can determine several references between model elements from the feature model and the SPL requirements models. There may be also cases when an element in a requirements model is referenced by more than one feature. VML4RE specifications are processed automatically by our tool [25, 33] to generate all the set of links involving SPL requirements models (see Section 2, C5). Table 2 presents a partial set of the trace links relevant to the feature “Heating Management”. These links are created based on references, such as the ones in Listing 1, lines 9-10, 13-14. Lines 9-10 refer to the package “Heating” and the actor “Heater”, and lines 13-14 refer to any kind of element inside the “Heating” use case package like the use case “Control Temperature Automatically” and “Adjust Heater Value”. Table 2. Part of the Trace Links Generated by the References in the Heating Management Feature Feature

Element Name

Heating Management

Type

Heating Heater Control Temperature Automatically Adjust Heater Value Smart Heating

Package Actor UseCase UseCase ActivityModel

…

…

6 Evaluation and Discussion This section discusses the benefits and limitations of VML4RE based on our experience from the application of the language. We have evaluated the usefulness of VML4RE in three case studies [21], two of them proposed by partners of the European AMPLE project [17], the Smart Home proposed by Siemens A.G. [34], and a slice of the customer relationship management system, developed by SAP A.G [35] . The third case study is a product line for handling mobile media [36]. These three product lines are from different domains and exhibit different kinds of variability (e.g., options and variants). All of them encompassed textual requirements. Feature models and UML use cases were available for the Mobile Media and Sale Scenario and an activity model was also available for the latter. The activity models of Mobile

Multi-view Composition Language for Software Product Line Requirements

117

Media and Smart Home were translated from informal textual use case scenarios to activity models. The output models were validated by the original developers of the case studies. The goal models for the Sales Scenario system were produced by two teams of postgraduate students at Universidade Nova de Lisboa, based on the use case scenarios and market requirements provided by SAP A.G. We evaluate the VML4RE usefulness based on the criteria for requirements models composition defined in Section 2 and then we discuss on additional benefits and limitations of VML4RE. C1: Support Multi-view Variability Composition: Each feature block in VML4RE concentrates the actions related only with itself and that can transform models in multiple views of the requirements. VML4RE was initially designed to support the composition of two of the most commonly used requirements modeling techniques, such as use cases and activity models that address coarse and fine grained operational views of the requirements. We have also started using it with very different kinds of requirements modeling, like the goal-oriented modeling technique i* [10], that addresses a quality and intentionality view of the requirements, as happened in the case of the Sales Scenario. C2: Provide Requirements-Specific Composition Operators: as presented in Table 1, VML4RE provides specialized operators for composing requirements model elements of specific types, such as use cases, packages, activities or goals. The composition operators are simple and did not require from the modeler a deep knowledge on the relationships between the metamodel’s metaclasses. For instance, the UML2.0 metamodel for the use cases has metaclasses like property, association and classifier. These metaclasses are important on the design of the transformations, but they are not needed when writing compositions with VML4RE. The composition description was simple in the three case studies because it was based on a vocabulary used in the domain of each modeling technique (e.g., use case, associatedWith, package, dependency). C3: Support Fine and Coarse-Grained Composition: in the three case studies the coarse-grained composition was performed in terms of broadly-scoped elements, such as packages, use cases. The operators “remove package” and “insert…use case” are examples of such cases. VML4RE also addresses fine-grained composition when the actions are performed within coarse grained elements. The operators “insert activity”, “insert activity links” are examples of such cases. C4: Facilitate Trace Links Generation: As explained in Section 5.3, our approach supports the derivation of trace links. These links record relations between features specified in feature models and other requirements model elements pertaining to the SPL or to a specific product. This is accomplished with the reference sentences that are processed by the tool suite. We are currently exploring different traceability scenarios that process these relationships to expose useful information. This information could be exploited in many activities, such as discovering candidates of bad feature interactions and visualizing variations in different requirements models. Many of these traceability functionalities also facilitate the job of SPL architects as they are also valuable to analyze the design change impact when evolving SPL features and requirements. C5: Support Positive and Negative Variability: VML4RE offers operators to support positive variability (e.g., insert) and negative variability through remove or replace operators. Positive variability presents some advantages for modeling and composing requirements models. For example, requirements modeling is characterized by

118

M. Alférez et al.

the incremental knowledge acquisition about the system. In this sense, starting with a relatively small and easy to understand set of models seems to be a good starting point. Then, while the developer knows more about each feature of the SPL, s/he can incrementally specify how each new variable feature will modify the existing models. Positive variability also allows the management of variability in time. If the core model is created using generic requirements then, the requirements models are more flexible to include future specific requirements that instantiate the generic ones. Take for example Figure 7 (Right), it specifies that, at some point, it is necessary to “verify installed Surveillance devices”. Then this requirement not only allows its instantiation for current surveillance devices, like “GlassBreak Sensors” and “Cameras”, but it can also be instantiated by other unknown surveillance devices that were not initially considered in the SPL. While modeling with VML4RE we saw additional benefits of the composition of requirements models. For example: Testing and Understanding the Behavior of Specific Products: The automatic derivation of requirements models for a specific product is useful both to understand which requirements and features are involved in the development of an SPL product, and to support the testing and documentation activities. In particular, activity models are an example of requirements artifacts that are well suited for business process modeling and for modeling the logic captured by a single use case or scenario as happened during the modeling of the Sales Scenario. Activity models can provide a base to understand and validate the behavior of parts of a product of the SPL in the presence or absence of specific features. Also, using the goal-based modeling in the Sales Scenario allowed us to understand the dependencies between the actors, thus encouraging a deeper understanding of the business process. Consistency Checking between Feature Models and other Requirements Models: Modeling different models in large systems like SPLs can be a difficult, timeconsuming, and highly error-prone task, if appropriate supporting tools are not available. During the realization of our three case studies, we noticed that the generated trace links from VML4RE specifications, can be process by our traceability framework [33] to detect inconsistencies between features and requirements in different models. Examples of such inconsistencies are: (i) inexistence of features related to specific requirements; (ii) inexistence of requirements related to specific features; (iii) conflicts between features acting over the same requirements (which can be valid or not). The consistency management of the relationships between features and requirements models is also fundamental to help the functionalities’ tracing mentioned above, especially in SPL evolution scenarios. Finally, we came across with some issues during the application of the compositions. When creating the composition actions for each variation point with the core model, the modeler could have assumptions regarding the existence, position and name of the model elements. However, the models change after the application of each insertion, replacement or deletion of model elements and it could unable the application of some subsequent actions. It is necessary to determine the best precedence order for the application of the actions in each variation point and also for the application of each variation point after model modifications. Existing formal methods and model-checking techniques and tools like simulation, or critical pair analysis as introduced by Javaraman, et al.[16] may be the first solution candidates.

Multi-view Composition Language for Software Product Line Requirements

119

7 Related Work Most of the work on feature composition is focused on implementation, such as Ahead [37] and pure::variants [19]. There are a couple of languages that focus on the architecture level like VML4Architecture [38] and Koala [39]. Recently, some approaches were proposed to support the definition of relationships between SPL features and requirements and the composition of requirements models. Pohl [2] separates variability from functional information, in an orthogonal model. He proposes a variability metamodel that includes the following two relationships: the Artifact Dependency relationship (that relates variants with development artifacts), and VP Artifact Dependency (that relates variation points to development artifacts). These elements enable the definition of links between the variability model and other development artifacts. Nevertheless, this work is focused on documenting variability, rather than expressing how to specify the composition of requirements models. Czarnecki and Antkiewicz [6], and Bragança and Machado [40] create explicit relationships between features and requirements. Czarnecki and Antkiewicz propose a general template-based approach which enables the creation of relationships between elements (abstractions and relationships) from an existing model to the corresponding features through a set of annotations. The annotations are used mainly to indicate the presence of conditions of specific model elements or model templates according to feature occurrences. In contrast with VML4RE that allows positive, negative or positive-negative variability, Czarnecki and Antkiewicz [6] only employ negative variability. Bragança and Machado use a simplified feature model based on the one proposed by Czarnecki and Antkiewicz and employ UML notes in use case diagrams to indicate variability. These notes are linked to includes and extends relationships, providing variability data. The main disadvantage with these two approaches [6, 40] is that they fail to fully separate functional and variability information as they use intrusive graphical elements such as, presence conditions or notes in their models to indicate variability. Hence, variability information may be scattered and polluting the models, making them difficult to understand and maintain. Gomaa [5] extends UML-based modeling methods for single systems to address software product lines. He uses stereotypes (e.g., <>, <> or <>) to indicate variability, models use case packages as features in a feature model, and manually relates features with other model elements using matrixes. Variability stereotypes and other kinds of stereotypes are mixed in the same models reducing the understandability of the models. Although the previous approaches provide techniques to establish the relationships between feature and requirements models, they lack a language to specify the actual composition of different requirements models. Our work proposes a requirements specific language and tool support to deal with composition of requirements models for software product lines. There are other approaches that provide languages to create reference expressions and composition rules. XWeaver [18], for example, supports the composition of different architectural viewpoints. It composes crosscutting concerns encapsulated as aspect models into non-aspect-oriented base models, thus following an asymmetric composition approach (though this could be extended for symmetric approach with relatively little effort). XWeaver is similar to our approach because the composition is done based on matching names of elements in the aspect and the base model. It

120

M. Alférez et al.

employs an OCL-like expressions language [12] that play the role of the VML4RE’s references. However, it does not provide requirements specific composition operators. MATA [28] is an aspect-oriented approach to model composition based on graph rewriting formalisms that can be used to compose models of different SPL features to create models specific of products [16]. It employs graphical patterns that resemble the concrete syntax of specific kinds of UML models (e.g., state machines). In aspectoriented terminology, graphical patterns can be thought of as pointcuts and the composition operators can be thought of as the advices. Similarly, in VML4RE, references can be thought of as the pointcuts and the actions as the advices. VML4RE, in comparison to MATA, provides more simple operators that are especially tailored to facilitate writing composition of requirements models. However, VML4RE can complement MATA by providing concrete language support to express in the same code block of each feature, the references and composition rules for all the different requirements views. VML4RE together with similar variability composition languages focused on architecture like VML4Architecture [38] could be used as an alternative frontend for MATA. Apel et al. [11] employ superimposition of feature-related model fragments as a general models’ composition technique. We believe that this technique can be especially useful in requirements to compose coarse-grained models that keep a common structure in a positive-variability setting. However, to be more useful in a broader kind of requirements models, it requires language support to express also positivenegative variability, and to reference potentially multiple composition points for model fragments during fine-grained composition.

8 Conclusions and Future Work VML4RE address the question on how to compose elements defined in separated and heterogeneous requirements models using a simple set of operators. It was designed taking into account the five fundamental criteria discussed in Section 2. Section 6 reviewed how these criteria are addressed. VML4RE presents a contribution to the field of language support for composing SPL requirements due to its unique characteristics: (1) each feature block (e.g., common, variant) concentrates a cohesive set of actions that can transform models in multiple requirements views; (2) new composition operators are especially tailored for canonical requirements models and rely on a vocabulary familiar to requirements engineers; (3) there is an explicit separation between the modeling of variability and requirements, without forcing the intrusive inclusion of variability-related elements in requirements models; (4) the new operators that can add, remove or replace parts of the models, thus supporting both positive and negative variability; (5) the use of references to facilitate the creation of compositions and the generation of trace links. Currently, we are investigating the application of model-driven techniques to keep consistent the relationships between SPL variability and requirements models during models’ evolution. Also, we are studying an effective way to determine the best precedence order for the application of the actions in each variation point and also for the application of each variation point after model modifications. Finally, we are interested in showing the use of our language using other requirements views and improving the usability of VML4RE by including a graphical concrete syntax.

Multi-view Composition Language for Software Product Line Requirements

121

Acknowledgments. This work is supported by the European FP7 STREP project AMPLE [17].

References 1. Clements, P., Northrop, L.M.: Software Product Lines: Practices and Patterns. AddisonWesley, Boston (2002) 2. Pohl, K., Böckle, G., van der Linden, F.: Software Product Line Engineering: Foundations, Principles and Techniques. Springer, Berlin (2005) 3. Czarnecki, K., Eisenecker, U.W.: Generative Programming: Methods, Tools, and Applications. ACM Press/Addison-Wesley Publishing Co. (2000) 4. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, A.: Feature-Oriented Domain Analysis (FODA) Feasibility Study. Technical Report CMU/SEI-90-TR-021, Software Engineering Institute, Carnegie Mellon University (1990) 5. Gomaa, H.: Designing Software Product Lines with UML: From Use Cases to PatternBased Software Architectures. Addison-Wesley, Reading (2004) 6. Czarnecki, K., Antkiewicz, M.: Mapping Features to Models: A Template Approach Based on Superimposed Variants. In: Glück, R., Lowry, M. (eds.) GPCE 2005. LNCS, vol. 3676, pp. 422–437. Springer, Heidelberg (2005) 7. Alexander, I., Maiden, N.: Scenarios, Stories, Use Cases. Wiley, Chichester (2004) 8. Unified Modeling Language (UML) Superstructure, Version 2.1.2: 2007-11-02 9. Chung, L., Nixon, B., Yu, E., Mylopoulos, J.: Non-Functional Requirements in Software Engineering. Kluwer Academic Publishers, Dordrecht (1999) 10. i* an Agent-oriented Modelling Framework, http://www.cs.toronto.edu/km/istar/ 11. Apel, S., Janda, F., Trujillo, S., Kästner, C.: Model Superimposition in Software Product Lines. In: Paige, R.F. (ed.) ICMT 2009. LNCS, vol. 5563, pp. 4–19. Springer, Heidelberg (2009) 12. openArchitectureWare, http://www.openarchitectureware.org/ 13. Jouault, F., Kurtev, I.: Transforming Models with ATL. In: Bruel, J.-M. (ed.) MoDELS 2005. LNCS, vol. 3844, pp. 128–138. Springer, Heidelberg (2006) 14. Taentzer, G.: AGG: A graph transformation environment for modeling and validation of software. In: Pfaltz, J.L., Nagl, M., Böhlen, B. (eds.) AGTIVE 2003. LNCS, vol. 3062, pp. 446–453. Springer, Heidelberg (2004) 15. Sánchez, P., Loughran, N., Fuentes, L., Garcia, A.: Engineering Languages for Specifying Product-derivation Processes in Software Product Lines. In: Gašević, D., Lämmel, R., Van Wyk, E. (eds.) SLE 2008. LNCS, vol. 5452, pp. 188–207. Springer, Heidelberg (2009) 16. Jayaraman, P., Whittle, J., Elkhodary, A., Gomaa, H.: Model Composition in Product Lines and Feature Interaction Detection Using Critical Pair Analysis. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 151–165. Springer, Heidelberg (2007) 17. Ample Project, http://www.ample-project.net/ 18. Groher, I., Volter, M.: XWeave: Models and Aspects in Concert. In: 10th International Workshop on Aspect-oriented Modeling. ACM, Vancouver (2007) 19. pure: variants, http://www.pure-systems.com/Variant_Management.49.0.html 20. Gears, http://www.biglever.com/ 21. Morganho, H., Gomes, C., Pimentão, J.P., Ribeiro, R., Grammel, B., Pohl, C., Rummler, A., Schwanninger, C., Fiege, L., Jaeger, M.: Requirement Specifications for Industrial Case Studies. Technical Report, D5.2, AMPLE Project (2008)

122

M. Alférez et al.

22. Antkiewicz, M., Czarnecki, K.: FeaturePlugin: Feature Modeling Plug-in for Eclipse. In: 2004 OOPSLA workshop on eclipse technology eXchange, pp. 67–72. ACM Press, Vancouver (2004) 23. Kruchten, P.: The Rational Unified Process: An Introduction. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (2003) 24. González-Baixauli, B., Laguna, M.A., Leite, J.C.S.d.P.: Using Goal-Models to Analyze Variability. Variability Modelling of Software-intensive Systems, Limerick, Ireland (2007) 25. Variability Modeling Language for Requirements, http://ample.di.fct.unl.pt/VML_4_RE/ 26. Grzegorz, R. (ed.): Handbook of Graph Grammars and Computing by Graph Transformation. Foundations, vol. I. Foundations. World Scientific Publishing Co., Inc., River Edge (1997) 27. Markovic, S., Baar, T.: Refactoring OCL Annotated UML Class Diagram. In: Briand, L.C., Williams, C. (eds.) MoDELS 2005. LNCS, vol. 3713, pp. 280–294. Springer, Heidelberg (2005) 28. Whittle, J., Moreira, A., Araújo, J., Jayaraman, P., Elkhodary, A., Rabbi, R.: An Expressive Aspect Composition Language for UML State Diagrams. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 514–528. Springer, Heidelberg (2007) 29. MDT-UML2Tools, http://www.eclipse.org/uml2/ 30. Xtext Reference Documentation, http://www.eclipse.org/gmt/oaw/doc/4.1/r80_xtextReference.pdf 31. Eclipse Modeling Framework, http://www.eclipse.org/modeling/emf/?project=emf 32. Goal-Driven Requirements Engineering: the KAOS Approach, http://www.info.ucl.ac.be/~avl/ReqEng.html 33. Sousa, A., Kulesza, U., Rummler, A., Anquetil, N., Mitschke, R., Moreira, A., Amaral, V., Araújo, J.: A Model-Driven Traceability Framework to Software Product Line Development. In: 4th Traceability Workshop held in conjunction with ECMDA, Berlin, Germany (2008) 34. Siemens AG - Research & Development, http://www.w1.siemens.com/innovation/en/index.php 35. SAP A.G, http://www.sap.com/about/company/research/centers/ dresden.epx 36. Figueiredo, E., Cacho, N., Sant’Anna, C., Monteiro, M., Kulesza, U., Garcia, A., Soares, S., Ferrari, F.C., Khan, S., Filho, F.C., Dantas, F.: Evolving software product lines with aspects: an empirical study on design stability. In: ICSE 2008. ACM, Leipzig (2008) 37. AHEAD Tool Suite, http://www.cs.utexas.edu/users/schwartz/ATS.html 38. Loughran, N., Sánchez, P., Garcia, A., Fuentes, L.: Language Support for Managing Variability in Architectural Models. In: Pautasso, C., Tanter, É. (eds.) SC 2008. LNCS, vol. 4954, pp. 36–51. Springer, Heidelberg (2008) 39. Rob van, O., Frank van der, L., Jeff, K., Jeff, M.: The Koala Component Model for Consumer Electronics Software. Computer 33(3), 78–85 (2000) 40. Bragança, A., Machado, R.J.: Automating Mappings between Use Case Diagrams and Feature Models for Software Product Lines. In: SPLC, pp. 3–12. IEEE Computer Society, Kyoto (2007)

Yet Another Language Extension Scheme Anya Helene Bagge Bergen Language Design Laboratory Dept. of Informatics, University of Bergen, Norway [email protected]

Abstract. Magnolia is an experimental programming language designed to try out novel language features. For a language to be a ﬂexible basis for new constructs and language extensions, it will need a ﬂexible compiler, one where new features can be prototyped with a minimum of eﬀort. This paper proposes a scheme for compilation by transformation, in which the compilation process can be extended by the program being compiled. We achieve this by making a domain-speciﬁc transformation language for processing Magnolia programs, and embedding it into Magnolia itself.

1

Introduction

Implementing a compiler for a new programming language is a challenging but exciting task. As the language design evolves, the compiler must be updated to support the new design or to prototype the design of new features. Magnolia is both an experimental programming language, and a language for language experiments. We therefore need a compiler ﬂexible enough to keep up with changes in the language design, and with features that make implementation of experimental features easy. Use cases for a language extension facility include experimental features such as data-dependency based loop statements, embedding of domain-speciﬁc languages, restriction to sub-languages with stricter semantics and language implementation using a simple core language, and building the rest as extensions. In Magnolia, the programmer can express extra knowledge about abstractions as axioms. In the compiler, we would therefore like to preserve abstractions for as long as possible, in order to take advantage of axioms. Language extensions also provide abstractions, with knowledge we may also want to take advantage of. Desugaring extensions to lower-level language constructs at an early stage, as is done with syntax macros, discards any special meaning associated with the constructs, which could have been used for optimisation and extension-speciﬁc error checking. The Magnolia compiler is implemented in Stratego/XT [1], using compilation by transformation, where a sequence of transformation steps transform code in the source language to a target language (object code, or another programming M. van den Brand, D. Gaˇ sevi´ c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 123–132, 2010. c Springer-Verlag Berlin Heidelberg 2010

124

A.H. Bagge

language). It is therefore natural to make use of transformation techniques for describing language extension. This paper presents an extension of the Magnolia language with transformation-based meta-programming features, so that extensions to the Magnolia language can be made in Magnolia itself, rather than by extending the Stratego code of the compiler. This gives more independence from the underlying compiler implementation. The rest of this paper is organised as follows. First, we give a brief introduction to the Magnolia language, before we look at how to add language extension to it (Section 3). We have two extension facilities, macro-like operation patterns (Section 3.1) and low-level transforms (Section 3.2). We provide an example of two extensions, before discussing related work and concluding (Section 4).

2

The Magnolia Language

We will start by brieﬂy introducing the parts of Magnolia that are necessary to understand the rest of the paper. Magnolia is designed as a general-purpose language, with an emphasis on abstraction and speciﬁcation. Abstractions are described by concepts, which consist of abstract types, operations on the types, and axioms specifying the behaviour of the operations algebraically. Multiple implementations may be provided for each concept, and signature morphisms may be used to map between diﬀerences in concept and implementation. Operations can be either procedures or functions. Procedures are allowed to update their parameters, and have no return values. Pure procedures only interact with the world through their parameters (e.g., no I/O or global data). Functions may not change their parameters, and are always pure – the only eﬀect a function has is its return value, and it will always produce the same return value for the same arguments. Function applications form expressions, while procedure calls are statements. In addition, Magnolia has regular control-ﬂow statements like if and while. A novel feature (detailed in a previous paper [2]) is the special relationship between pure procedures and functions. Procedures may be called as if they were functions – the process of mutification turns expressions with calls to functionalised procedures into procedure call statements. An expression-oriented coding style is encouraged. Procedures are often preferred for performance reasons, while expressions with pure functions are easier to reason about, and is also the preferred way of writing axioms.

3

Extending Magnolia

At least four types of useful extensions spring to mind: 1. Adding new operation-like constructs, that look like normal functions or procedures, but for some reason cannot or should not be implemented that way – for example, because we need to bypass normal argument evaluation, or because some of the computation should be done at compile time. This

Yet Another Language Extension Scheme

125

type of change has a local eﬀect on the particular expressions or statements where the new constructs are used, and is similar to syntax macros in other systems. 2. Adding new syntax to the language, in order to make it more convenient to work with. We may also consider removing some of the default syntax. In Magnolia, this can be handled by extending the SDF2 grammar of the language. 3. Disabling features or adding extra semantic checks to existing language constructs. This can be used to enforce a particular coding style, to disable general-purpose features when making a DSL embedding, or to ensure that certain assumptions for aggressive optimisation holds. 4. Making non-local changes to the language – features requiring global analysis, or touching a wide selection of code. Cross-cutting concerns in aspect orientation are an example of this. We can implement this by extending the compiler with new transformations and storing context information across transformations. In a syntax macro system, new constructs are introduced by giving a syntax pattern and a replacement (or expansion). In languages like Lisp or Scheme, the full power of the language itself is available to construct the expansion. For Magnolia, things are a bit more complicated, since the extension may pass through several stages of the compiler before it is replaced by lower level constructs. We must therefore provide the various compiler stages with a description of how to deal with the language extension. To provide syntax extensibility of the kind found in languages like Dylan, one could provide Magnolia syntax for syntax deﬁnition, then extract and compile the syntax deﬁnitions to SDF2, as used in the compiler. We will not consider this here, however. A full treatment of compiler extension in Magnolia is also beyond the scope of this paper, we will therefore focus the macro-like operation patterns and brieﬂy sketch the transform interface to compiler extension. 3.1

Operation Patterns

An operation pattern is a simple interface to language extension, similar to macros in Lisp or Scheme. Patterns are used in the same way as a normal procedure or function, but is implemented using instantiation with arbitrary code transformation. They are useful for things that need to process arguments diﬀerently from normal semantics. The implementation of an operation pattern looks like a procedure or function deﬁnition, except that one or more of its parameters are meta-variables that take expression or statement terms, rather than values or variables. The argument terms and pattern body may be rewritten as desired by applying transforms to them (see examples below). When the operation pattern is instantiated, metavariables in the body are substituted, and any transformations are applied. The resulting code is inlined at the call site. Meta-variables are typed and are distinguished from normal variables through the type system, thus it is not necessary to use anti-quotation to indicate where

126

A.H. Bagge

meta-variables should be substituted. Operation patterns introduce a local scope, so local variables will not interfere with the call context. The semantic properties (typing rules, data-ﬂow rules, etc.) of an operation pattern are handled automatically by the compiler, and calls to operation patterns are treated the same as normal operation calls during type checking and overload resolution. This means that they can be overloaded alongside normal operations, and follow normal module scoping and visibility rules. Processing code with operation pattern calls requires some extra care, so that arguments that should be treated as code terms won’t get rewritten or lifted out of the call. Operation patterns can also conveniently serve as implementations of syntax extensions, by desugaring the syntax extension into a call to the pattern. For example, the following operation pattern implements a simple way to substitute a default value when an expression yields some error value: forall type T procedure default(T e, T f, expr T d, out T ret) { ret = e; if(ret == f) ret = d; } The f is the failure value (null, for example), d is the default replacement, and e is the expression to be tested. Magnolia will automatically provide a function version of it: forall type T function T default(T e, T f, expr T d); which we can use like: name = default(lookup(db,key), "", "Lucy"); We can describe the behaviour of forall type T axiom default1(T if(e == f) assert default(e, if(e != f) assert default(e, if(f == d) assert default(e, if(f != d) assert default(e, } 3.2

default by axioms, for example: e, T f, T d) { f, d) <-> d; f, d) <-> e; f, d) <-> e; f, d) f;

Transforms

For further processing of language extension, we add a new meta-programming operation to Magnolia – the transform – corresponding to a rule or strategy in Stratego. Transforms work on the term representation of a program, taking at least one term plus possibly other values as arguments, and returning a replacement term. Provided semantic analysis has been done, term pattern matching in transforms are sensitive to typing, overloading and name scoping rules. A transform may call other transforms and operations, and may also manipulate symbol tables and other compiler state. Several transforms can share the same name; when applied they are tried in arbitrary order until one succeeds. In addition to explicit calling, transforms can also be controlled through

Yet Another Language Extension Scheme

127

Table 1. Transform classes: Topdown and bottomup traversals can be modiﬁed by repeat, once or frontier. The phase classes can be used to apply a transform before, during or after a particular compiler phase, or to trigger application of a compiler phase. Transforms can also be classiﬁed by use – for example, simpliﬁcation transforms may be marked as such and used many places in the compiler. The ac class can be used to reorder expressions for associative-commutative matching. Traversals/modifiers repeat Can be used repeatedly once In traversal: Apply only once frontier In traversal: Stop on success topdown Traversal type bottomup Traversal type innermost Innermost reduction outermost Outermost reduction

Compiler Phases during(p) apply during p before(p) apply before p after(p) apply after p requires(p) run p ﬁrst triggers(p) run p after

Uses typecheck simplify mutify ac

transform classes, which describe how and (possibly) when transforms should be applied. For example, a transform may have the classes innermost and during(desugar), signifying that it should be applied using an innermost strategy during the desugaring phase of the compiler. A sample transform is: forall int i1, int i2, int i3 transform example(expr i1 * i2 + i3 * i2) [simplify,repeat] = (i1 + i3) * i2; This example has a pattern with three meta-variables, i1, i2, i3, all of which will match only integer expressions. The expression pattern in the argument list will be matched against the code the transform is applied to, and will only match the integer versions of + and *. If the match is successful, the code is transformed to (i1 + i3) * i2. The transform classes simplify and repeat tell the compiler that this rule can be applied during program simpliﬁcation, and that it will terminate if applied repeatedly. Table 1 shows a few diﬀerent transform classes. Axioms, when used as rewrite rules, can also have classes assigned to them, making them usable as transforms [3]. Transforms can be applied directly in program code (most useful inside operation patterns). For example, var x = example(a * b + c * b); will apply the above transform (the expression to the left is implicitly passed as the ﬁrst parameter) and rewrite the code to: var x = (a + c) * b; The double-bracket operator [[...]] can be used to apply inline rewrite rules, and to specify traversals – we’ll see examples of this later. 3.3

Semantic Rules

Semantic analysis rules are described by the typecheck transform, which takes a statement, expression or declaration as argument, and returns a resolved version

128

A.H. Bagge

of its argument – and its type, in the case of an expression. Resolving means annotating each use of an abstraction with a unique identiﬁer that leads back to its declaration – this is typically taken care of internally in the compiler. Type checking of a declaration will typically involve adding declarations to the symbol table; type checking other constructs is typically a simple case of recursively type checking sub-constructs. A (simpliﬁed) typecheck rule for assignment statements is: forall name x, expr e transform typecheck(stat{x = e;}) = stat{x = e’;} where { var (e’, t) = typecheck(e); if(!compatible(typeof(x), t)) call fail("Incompatible types in assignment"); } Note that typechecking may be better described as more formal semantic rules which can be used as a basis for reasoning about typechecking and programs. This is an option we are exploring. Axioms [3] can describe the abstract semantics of a construct. This is only applicable to expression-like constructs at the moment, we should also have a way of describing other constructs. Implementation rules are used to compile constructs to lower-level code. Instantiation rules are triggered during semantic analysis, and receive the unique id of the abstraction and the use case, and produce an instantiated version. Other implementation rules are free-form and should be tied to a program traversal strategy and compiler phase. No eﬀort is made on the part of the compiler to ensure that implementation rules don’t leave behind uncompiled constructs, though we are looking at techniques that can handle this [4]. Other compiler phases may also need rules – for example, doing data-ﬂow analysis and program slicing requires information about which variables are read and written in a statement – the readset and writeset transforms are used for this purpose. Transforms may also be provided for mapping between statement and expression forms. By keeping track of semantic information, we can make more powerful extensions. For example, with the following extended version of default a failure value is no longer needed – it is obtained automatically from a function declaration attribute: forall type T function T default(expr T e, expr T d) = default(e, getAttr("fail_value", e), d); 3.4

Module-Level and Global Extensions

Language extension should normally be done at the module level, so that some modules in your program may use the extension, and others won’t. For example, if your extension deﬁnes a restricted subset of Magnolia with some DSL features, you probably still want the compiler to process Magnolia libraries as if they were written in normal Magnolia. Therefore, Magnolia extensions have scope:

Yet Another Language Extension Scheme

129

– The names of transforms and operation patterns are accessible in the module in which they are deﬁned and in modules that import them, just as with other operations. – Transforms are normally applied to the whole program. Semantically aware term pattern matching ensure that only relevant parts of the code are touched, not code that merely looks similar to what is described by the pattern. – For syntax extensions and language-changing transforms that should only be applied to certain modules, there is a language declaration in the module header that can be used to import extension modules. Transforms imported via language are only applied to the local module. 3.5

Example Extensions

We will give two example extensions, one which uses transforms to enforce a restriction on the language, and one which uses operation patterns to add a map construct. Impure procedures are ones that violate the assumption that two calls with equivalent inputs give equivalent results. I/O is typically impure, a random generator that keeps track of the seed would also be impure. Since pure code is easier to reason about, we might want to have a sub-language of Magnolia where calls to impure code is forbidden. We implement this in a module pure, which is used by putting language pure in the module header of pure modules. Our language module contains the following transform: transform purity(stat{call p(_*)}) [after(typecheck)] where if(getAttr("impure", p)) call error("In call to ", p, " -- impure calls forbidden"); The transform purity will be applied to the code in all language pure modules after type checking is done (since the type checker might be used to infer impurity), and will match procedure calls. If the called procedure has the impure attribute, a compiler error is triggered. The map operation applies an operation element-wise to the elements of one or more indexable data structures (arrays, for example). Our map works on multiple indexables at the same time (like Lisp’s mapcar), without the overhead of dealing with a list of indexables at runtime. For example, A = map(@A * @B + @C); // map *,+ over elements of A, B, C A = map(@A * 5); // multiply all elements of A by 5 A = map(@A * V + @C); // V is indexable, but used as-is While map in Lisp and functional languages traditionally takes a function (or lambda expression) and one or more lists as arguments – we will instead integrate everything as one argument, making it look more like a list comprehension. Indexables marked with an @-sign are those that should have element-wise. The @ is just a dummy operator, deﬁned as: forall type A, type I, type E where Indexable(A, I, E) function E @_(A a);

130

A.H. Bagge

This function is generic in E (element type), A (indexable/array type) and I (index type) – together, these must satisfy the Indexable concept. Applying the @-operator outside a map operation will lead to a compilation error – this should ideally be checked for and reported in a user-friendly manner. A generic implementation of map is: forall type A, type I, type E where Indexable(A, I, E) procedure map(expr E e, out A a) { // define index space as minimum of input index spaces var idxSpace = min(e[[collect,frontier: @x:A -> indexes(x)]]); call create(a,idxSpace); // create output array for i in indexes(a) { // do computation a[i] = e[[topdown,frontier: @x:A -> x[i]]]; } } The implementation accepts an expression e (of the element type) and an output array a. The body of map is the pattern for doing maps, and this will be instantiated for each expression it is called with by substituting meta-variables and optionally performing transformations. Note that the statements in the pattern are not meta-level code, but templates to be instantiated. The [[...]] code are transformations which are applied to e – the result is integrated into the code, as if it had been written by hand. The ﬁrst transformation uses a collect traversal, which collects a list of the indexables, rewriting them to expressions which compute their index spaces on the way. This is used in creating the output array. The computation itself is done by iterating over the index space, and computing the expressions while indexing the @-marked indexables of type A. The frontier traversal modiﬁer prevents the traversal from recursing into an expression marked with @ – in case we have nested maps. As an example of map, consider the following: Z = map(@X * 5 + @Y); where X and Y are of type array(int). Here map is used as a function – the compiler will mutify the expression, obtaining: call map(@X * 5 + @Y, Z); At this point we can instantiate it and replace the call, giving var idxSpace = min([indexes(X), indexes(Y)]); call create(Z,idxSpace); for i in indexes(Z) { Z[i] = X[i] * 5 + Y[i]; } which will be inlined directly at the call site. Now that we have gone to the trouble of creating an abstraction for elementwise operations, we would expect there to be some beneﬁt to it, over just writing for-loop code. Apart from the code simpliﬁcation at the call site, and the fact that we can use map in expressions, we can also give the compiler more information about it. For example, the following axiom neatly sums up the behaviour of map:

Yet Another Language Extension Scheme

131

forall type A, type I, type E where Indexable(A, I, E) axiom mapidx(expr E e, I i) { map(e)[i] <-> e[[topdown,frontier: @x:A -> x[i]]]; } applying map and then indexing the result is the same as just indexing the indexables directly and computing the map expression. Furthermore, we can also easily do optimisations like map/map fusion and map/fold fusion, without the analysis needed to perform loop fusion.

4

Conclusion

There is a wealth of existing research in language extension [5,6,7] and extensible compilers [8,9], and little space for a comprehensive discussion here. Lisp dialects like Common Lisp [10] and Scheme [11] come with powerful macro facilities that are used eﬀectively by programmers. The simple syntax give macros a feel of being part of the language, and avoids issues with syntactic extensions. C++ templates are often used for meta-programming, where techniques such as expression templates [12] allow for features such as the map operation described in Section 3.5 (though the implementation is a lot more complicated). Template Haskell [13] provides meta-programming for Haskell. Code can be turned into an abstract syntax tree using quasi-quotation and processed by Haskell code before being spliced back into the program and compiled normally. Template Haskell also supports querying the compiler’s symbol tables. MetaBorg [14] provides syntax extensions based on Stratego/XT. Syntax extension is done with the modular SDF2 system, and the extensions are desugared (“assimilated”) into the base language using concrete syntax rules in Stratego. Andersen and Brabrand [4] describe a safe and eﬃcient way of implementing some types of language extensions using catamorphisms that map to simpler language constructs, and an algebra for composing languages. We have started implementing this as a way of desugaring syntax extensions. We aim to deal with semantic extension rather than just syntactic extension provided by macros. We do this by ensuring that transformations obey overloading and name resolution, by allowing extension of arbitrary compiler phases, and allowing the abstract semantics of new abstractions to be described by axioms. The language XL [15] provide a type macro-like facility with access to static semantic information – somewhat similar to operation patterns in Magnolia. In this paper we have discussed how to describe language extensions and presented extension facilities for the Magnolia language extensions, with support for static semantic checking and scoping. The facilities include macro-like operation patterns, and transforms can perform arbitrary transformations of code. Transforms can be linked into the compiler at diﬀerent stages in order to implement extensions by transforming extended code to lower-level code. Static semantics of extensions can be given by hooking transforms into the semantic analysis phase of the compiler.

132

A.H. Bagge

A natural next step is to try and implement as much of Magnolia as possible as extensions to a simple core language. This will give a good feel for what abstractions are needed to implement full-featured extensions, and also entails building a mature implementation of the extension facility – currently we are more in the prototype stage. There are also many details to be worked out, such as a clearer separation between code patterns, variables and transformation code, name capture / hygiene issues, and so on. The Magnolia compiler is available at http://magnolia-lang.org/. Acknowledgements. Thanks to Magne Haveraaen and Valentin David for input on the Magnolia compiler, and to Karl Trygve Kalleberg and Eelco Visser for inspiration and many discussions in the early phases of this research.

References 1. Bravenboer, M., Kalleberg, K.T., Vermaas, R., Visser, E.: Stratego/XT 0.17. A language and toolset for program transformation. Science of Computer Programming 72(1-2), 52–70 (2008) 2. Bagge, A.H., Haveraaen, M.: Interfacing concepts: Why declaration style shouldn’t matter. In: LDTA 2009. ENTCS, York, UK (March 2009) 3. Bagge, A.H., Haveraaen, M.: Axiom-based transformations: Optimisation and testing. In: LDTA 2008, Budapest. ENTCS, vol. 238, pp. 17–33. Elsevier, Amsterdam (2009) 4. Andersen, J., Brabrand, C.: Syntactic language extension via an algebra of languages and transformations. In: LDTA 2009. ENTCS, York, UK (March 2009) 5. Brabrand, C., Schwartzbach, M.I.: Growing languages with metamorphic syntax macros. In: PEPM 2002, pp. 31–40. ACM, New York (2002) 6. Standish, T.A.: Extensibility in programming language design. SIGPLAN Not. 10(7), 18–21 (1975) 7. Wilson, G.V.: Extensible programming for the 21st century. Queue 2(9), 48–57 (2005) 8. Nystrom, N., Clarkson, M.R., Myers, A.C.: Polyglot: An extensible compiler framework for Java. In: Hedin, G. (ed.) CC 2003. LNCS, vol. 2622, pp. 138–152. Springer, Heidelberg (2003) 9. Ekman, T., Hedin, G.: The JastAdd extensible Java compiler. In: OOPSLA 2007, pp. 1–18. ACM, New York (2007) 10. Graham, P.: Common LISP macros. AI Expert 3(3), 42–53 (1987) 11. Dybvig, R.K., Hieb, R., Bruggeman, C.: Syntactic abstraction in scheme. Lisp Symb. Comput. 5(4), 295–326 (1992) 12. Veldhuizen, T.L.: Expression templates. C++ Report 7(5), 26–31 (1995); Reprinted in C++ Gems, ed. Stanley Lippman 13. Sheard, T., Jones, S.P.: Template meta-programming for Haskell. In: Haskell 2002, pp. 1–16. ACM, New York (2002) 14. Bravenboer, M., Visser, E.: Concrete syntax for objects: domain-speciﬁc language embedding and assimilation without restrictions. In: OOPSLA 2004, pp. 365–383. ACM Press, New York (2004) 15. Maddox, W.: Semantically-sensitive macroprocessing. Technical Report UCB/CSD 89/545, Computer Science Division (EECS), University of California, Berkeley, CA (1989)

Model Transformation Languages Relying on Models as ADTs Jerónimo Irazábal and Claudia Pons LIFIA, Facultad de Informática, Universidad Nacional de La Plata Buenos Aires, Argentina {jirazabal,cpons}@lifia.info.unlp.edu.ar

Abstract. In this paper we describe a simple formal approach that can be used to support the definition and implementation of model to model transformations. The approach is based on the idea that models as well as metamodels should be regarded as abstract data types (ADTs), that is to say, as abstract structures equipped with a set of operations. On top of these ADTs we define a minimal, imperative model transformation language with strong formal semantics. This proposal can be used in two different ways, on one hand it enables simple transformations to be implemented simply by writing them in any ordinary programming language enriched with the ADTs. And on the other hand, it provides a practical way to formally define the semantics of more complex model transformation languages. Keywords: Model driven software engineering, Model transformation language, denotational semantics, Abstract data types, ATL.

1 Introduction Model-to-model transformations are at the core of the Model Driven Engineering (MDE) approach [1] and it is expected that writing such transformations will become an ordinary task in software development. Specification and implementation of model-to-model transformation involves significant knowledge of both the source and target domains. Even when the transformation designer understands both domains, defining the mapping between corresponding model elements is a very complex task. One direction for reducing such complexity is to develop domain-specific languages designed to solve frequent model transformation tasks. A domain-specific language focuses on a particular problem domain and contains a relatively small number of constructs that are immediately identifiable to domain experts. Domain-specific languages allow developers to create abstract, concise solutions in a simpler way. Indeed, this is the approach that has been taken by the MDE community. As a result, a variety of model transformation domain-specific languages have been recently developed, e.g. QVT [2], ATL [3], Tefkat [4] and VIATRA [5]. These languages are very rich and are used in various domains; each of them possesses its own syntax, programming paradigm and other specific language peculiarities. However, there are a number of facts that frequently hinder the use in industry of these specific languages, on one hand their application require a large amount of M. van den Brand, D. Gašević, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 133–143, 2010. © Springer-Verlag Berlin Heidelberg 2010

134

J. Irazábal and C. Pons

learning time that cannot be afforded in most projects; on the other hand considerable investment in new tools and development environments is necessary. And finally, the semantics of these specific languages is not formally defined and thus, the user is forced to learn such semantics by running transformations example suites within a given tool. Unfortunately, in many cases the interpretation of a single syntactic construct varies from tool to tool. Additionally other model engineering instruments, such as mechanism for transformation analysis and optimization, can be only built on the basis of a formal semantics for the transformation language; therefore, a formal semantics should be provided. To overcome these problems, in this paper we describe a minimal, imperative approach with strong formal semantics that can be used to support the definition and implementation of practical transformations. This approach is based on the idea of using “models as abstract data types” as the basis to support the development of model transformations. Specifically, we formalize models and metamodels as abstract mathematical structures equipped with a set of operations. The use of this approach enables transformations to be implemented in a simpler way by applying any ordinary imperative programming language enriched with the ADTs, thus avoiding the need of having a full model transformation platform and/or learning a new programming paradigm. Additionally, the meaning of the transformation language expressions is formally defined, enabling the validation of transformation programs. Complementary, this approach offers an intermediate abstraction level which provides a practical way to formally define the semantics of higher level model transformation languages The paper is organized as follows. Section 2 provides the formal characterization of models and metamodels as abstract mathematical structures equipped with a set of operations. These mathematical objects are used in section 3 for defining the semantics of a basic transformation language. Section 4 illustrates the use of the approach to solve a simple transformation problem, while Section 5 shows the application of the approach to support complex transformation languages (in particular ATL). Section 6 compares this approach with related research and Section 7 ends with the conclusions.

2 Model Transformation Languages with ADTs A model transformation is a program that takes as input a model element and provides as output another model element. Thinking about the development of this kind of programs there are a number of alternative ways to accomplish the task: A very basic approach would be to write an ordinary program containing a mix of loops and if statements that explore the input model, and create elements for the output model where appropriate. Such an approach would be widely regarded as a bad solution and it would be very difficult to maintain. An approach situated on the other extreme of the transformation language spectrum would be to count with a very high level declarative language specially designed to write model transformations (e.g. QVT Relations [2]). With this kind of language we would write the ‘what’ about the transformation without writing the ‘how’. Thus, neither concrete mechanism to explore the input model nor to create the output model is exposed in the program. Such an approach is very elegant and concise, but the

Model Transformation Languages Relying on Models as ADTs

135

meaning of the expressions composing these high level languages becomes less intuitive and consequently hard to understand. In addition, the implementation of heavy supporting framework is required (e.g. MediniQVT supporting tools [6]). A better solution, from a programming perspective, would be to build an intermediate abstraction level. We can achieve this goal by making use of abstract data types to structure the source and target models. This solution provides a controlled way to traverse a source model, and a reasonable means to structure the code for generating an output model. With this solution we would raise the abstraction level of transformation programs written in an ordinary programming language, while still keeping the control on the model manipulation mechanisms. Additionally we do not need to use a new language for writing model transformations, since any ordinary programming language would be sufficient. Towards the adoption of the later alternative, in this section we formally define the concepts of model and metamodel as Abstract Data Types (ADTs), that is to say as abstract structures equipped with a set of operations. Definition 1: A metamodel is a structure mm = (C, A, R, s, a, r) where C is the set of classes, A is the set of attributes and R is the set of references respectively; s is an anti-symmetric relation over C interpreted as the superclass relation, a maps each attribute to the class it belongs to; and r maps each reference to its source and target classes. For example, a simplified version of the Relational Data Base metamodel is defined as MMRDB =(C, A, R, s, a, r), where: C={Table, Column, ForeignKey} A={nameTable, nameColumn} R={columnsTable2Column , primaryKeyTable2Column , foreignKeysTable2ForeignKey , tableForeignKey2Table} s= {} a= {(nameTable, Table), (nameColumn, Column)} r= {(columnsTable2Column, (Table, Column)), (primaryKeyTable2Column, (Table, Column)), (foreignKeysTable2ForeignKey, (Table, ForeignKey)), (tableForeignKey2Table, (ForeignKey, Table))} The usual way to depict a metamodel is by drawing a set of labeled boxes connected by labeled lines; however the concrete appearance of the metamodel is not relevant for our purposes. Figure 1 shows the simplified metamodel of the Relational Data Base language. foreignKeys ForeignKey

Table name

columns primaryKey

Column name

table

Fig. 1. The metamodel of the Relational Data Base language

136

J. Irazábal and C. Pons

For the sake of simplicity, we assume single-valued single-typed attributes and references without cardinality specification. The previous metamodel definition could be easily extended to support multi-valued multi-typed attributes and to allow the specification of reference’s cardinality; however, in this paper those features would only complicate our definitions, hindering the understanding of the core concepts. Definition 2: A model is a structure m = (C, A, R, s, a, r, E, c, νa, νr) where mm = (C, A, R, s, a, r) is a metamodel, E is the set of model elements, c maps each element to the class it belongs to, va applied to an attribute and to an element returns the value of such attribute (or bottom, if the attribute is undefined), vr applied to a reference and an element returns a set of elements that are connected on the opposite end of the reference. In such case, we say that model m is an instance of metamodel mm. When the metamodel is obvious from the context we can omit it in the model structure. For example, let m = (C, A, R, s, a, r, E, c, νa, νr) be an instance of the MMRDB metamodel, where: mm = (C, A, R, s, a, r) is the metamodel defined above, E={Book, Author, nameBook, editorialBook, authorsBook2Author, nameAuthor} c={(Book, Table), (Author, Table), (nameBook, Column), (editorialBook, Column), (authorsBook2Author, ForeignKey), (nameAuthor, Column)} νa={((nameTable, Book), Book), ((nameTable, Author), Author)} νr= {((columnsTable2Column, Book), {nameBook, editorialBook}) , ((columnsTable2Column, Author), {nameAuthor}), ((primaryKeyTable2Column, Book), {nameBook}), ((primaryKeyTable2Column, Author), {nameAuthor}), ((foreignKeysTable2ForeignKey, Book), {authorsBook2Author}), ((tableForeignKey2Table, authorsBook2Author), {Book})}) Figure 2 illustrates the instance of the MMRDB metamodel in a generic graphical way. The concrete syntax of models is not relevant here. Book pkey: name editorial

authors

Author pkey: name

Fig. 2. An instance of the MMRDB metamodel

After defining the abstract structure of models and metamodels we are ready to define a set of operations on such structure. These operations complete the definition of the Abstract Data Type. Let M be the set of models and let MM be the set of metamodels, as defined above. The following functions are defined: (1) The function metamodel() returns the metamodel of the input model. metamodel: M → MM metamodel (C, A, R, s, a, r, E, c, va, vr) = (C, A, R, s, a, r) (2) The function classOf() returns the metaclass of the input model element in the context of a given model. classOf: E → M → C classOf e (C, A, R, s, a, r, E, c, va, vr) = c(e)

Model Transformation Languages Relying on Models as ADTs

137

(3) The function elementsOf() returns all the instances of the input class in the context of a given model. Instances are obtained by applying the inverse of function c. elementsOf: C → M → P(E) elementsOf c (C, A, R, s, a, r, E, c, va, vr) = c-1(c) (4) The function new() creates a new instance of the input class and inserts it into the input model. new: C → M → ExM new c (C, A, R, s, a, r, E, c, va, vr) = (e, (C, A, R, s, a, r, E∪{e}, c[e←c], va, vr)), with e ∉ E (5) The function delete() eliminates the input element from the input model. delete: E → M → M delete e (C, A, R, s, a, r, E, c, va, vr) = (C, A, R, s, a, r, E’, c’, va’, vr’), with E’ = E-{e}, c’ = c-{(e,c(e))}, va’ = va-{(a,(e’,n))| e=e’ ^ (a,(e’,n)) ∈ va} vr’ = vr-{(r,(e’,es))| e=e’ ^ (r,(e’,es)) ∈ vr} (6) The function getAttribute() returns the value of the input attribute in the input element belonging to the input model. getAttribute: A → E → M → Z⊥ getAttribute a e (C, A, R, s, a, r, E, c, va, vr) = va(a)(e) (7) The function setAttribute() returns an output model resulting from modifying the value of the input attribute in the input element of the input model. setAttribute: A → E → Z⊥ → M → M setAttribute a e n (C, A, R, s, a, r, E, c, va, vr) = (C, A, R, s, a, r, E, c, va[va(a)[e←n]], vr), if (a,s(e)) ∈ a (C, A, R, s, a, r, E, c, va, vr), if (a,ς(e)) ∉ a (8) The function getReferences() returns the set of elements connected to the input element by the input reference in the input model. getReferences: R → E → M → P(E) getReferences r e (C, A, R, s, a, r, E, c, va, vr) = vr(r)(e) (9) The function addReference() returns an output model resulting from adding a new reference (between the two input elements) to the input model. addReference: R → E → E → M → M addReference r e e’ (C, A, R, s, a, r, E, c, va, vr) = (C, A, R, s, a, r, E, c, va, vr ∪(r,(e,e’))), if (r,(c(e), c(e’))) ∈ r (C, A, R, s, a, r, E, c, va, vr), if (r,(c(e), c(e’))) ∉ r (10) The function removeReference() returns an output model resulting from deleting the input reference between the two input elements from the input model. removeReference: R → E → E → M → M removeReference r e e’ (C, A, R, s, a, r, E, c, va, vr) = (C, A, R, s, a, r, E, c, va, vr -(r,(e,e’))) The remaining functions (e.g. similar functions, but on the metamodel level) are omitted in this paper for space limitations.

138

J. Irazábal and C. Pons

3 A Simple Yet Powerful Imperative Transformation Language In this section we define SITL a simple imperative transformation language that supports model manipulation. This language is built on top of a very simple imperative language with assignment commands, sequential composition, conditionals, and finite iterative commands. As a direct consequence, this language has a very intuitive semantics determined by its imperative constructions and by the underlying model ADT. This language is not intended to be used to write model transformation programs, rather it is proposed as a representation of the minimal set of syntactic constructs that any imperative programming language must provide in order to support model transformations. In practice we will provide several concrete implementations of SITL. Each concrete implementation consists of two elements: an implementation of the ADTs and a mapping from the syntactic constructs of SITL to syntactic constructs of the concrete language. 3.1 Syntax The abstract syntax of STIL is described by the following abstract grammar: ::= null | 0 | 1 | 2 | … | | - | + | - | * | ÷ | <elemexp> . | size <elemlistexp> ::= true | false | = | < | > | ¬ | ∧ | ∨ | <elemexp> = <elemexp> | contains <elemlistexp> <elemexp> <modelexp> ::= m1 | m2 | ... ::= c1 | c2 | …| classof <elemexp> ::= a1 | a2 | … ::= r1 | r2 | … <elemexp> ::= <elemvar> | <elemlistexp> () <elemlistexp> ::= elementsOfClass inModel <modelexp> | <elemlistvar> | <elemexp> . ::= := in | <elemvar> := <elemexp> in | <elemlistvar> := <elemlistexp> in | ; | skip | if then else | for from to do | add <elemexp> to < elemlistvar> | remove <elemexp> from <elemlistvar> | <elemexp> . := | addRef <elemexp> <elemexp> | removeRef <elemexp> <elemexp> | forEachElem <elemvar> in <elemlistexp> where do | newElem <elemvar> ofclass inModel <modelexp> | deleteElem <elemexp>

Model Transformation Languages Relying on Models as ADTs

139

<procD>::=proc <procparams> beginproc endproc| <procD>;<procD> ::= | call (actualparams) | ; <program>::= <procD> Currently, we consider three types of variables: integer variables, element variables, and list of elements variables. It is worth to remark that STIL is limited to finite programs; we argue that model to model transformation should be finite, so this feature is not restrictive at all. Denotational Semantics The semantics of SITL is defined in the standard way [7]; we define semantic functions that map expressions of the language into the meaning that these expressions denote. The usual denotation of a program is a state transformer. In the case of SITL each state holds the current value for each variable and a list of the models that can be manipulated. More formally, a program state is a structure σ = (σM, σEM, σE, σEs, σZ) where σM is a list of models, σEM maps each element to the model it belongs to, σE maps element variables to elements, σEs maps element list variable to a list of elements, and σZ maps integer variables to its integer value or to bottom. Let Σ denote the set of program states; the semantic functions have the following signatures: [[-]]intexp : → Σ → Ζ⊥ [[-]]boolexp : → Σ → B [[-]]modelexp : <modelexp> → Σ → M⊥ [[-]]classexp : → Σ → C [[-]]elemexp : <elemexp> → Σ → E [[-]]elemlistexp : <elemlistexp> → Σ → [E] [[-]]attrexp : → Σ → Ζ [[-]]refexp : → Σ → Ζ [[-]]comm : → Σ → Σ Then, we define theses functions by semantic equations. The semantic equations of Integer expressions and Boolean expressions are omitted, as well as some equations related to well-understood constructs such as conditionals, sequences of commands and procedure calls. For the following equations let σ = (σM, σEM, σE, σEs, σZ) ∈ Σ: − Equations for integer expressions [[null]]intexp σ = ⊥ [[e . a]]intexp σ = getAttribute ([[a]]attrexp σ) ([[e]]elemexp σ) (σM (σEM ([[e]]elemexp σ))) − Equations for class expressions [[classof e]]classexp σ = classOf ([[e]]elemexp σ) (σM (σEM ([[e]]elemexp σ))) − Equations for element expressions [[ex]]elemexp σ = σE (ex) − Equations for element list expressions [[elementsOfClass c inModel m]]elemlistexp σ = elementsOf ([[c]]classexp σ) (σM ([[m]]modelexp σ))

140

J. Irazábal and C. Pons

[[esx]]elemlistexp σ = σEs (esx) [[e . r]]elemlistexp σ = getReferences ([[r]]refexp σ) ([[e]]elemexp σ) (σM (σEM ([[e]]elemexp σ))) − Equations for commands [[x := ie]]comm σ = (σM, σEM, σE, σEs, σZ[x←([[ie]]intexp σ)]) [[ex := ee]]comm σ = (σM, σEM, σE[ex ←([[ee]]elemexp σ)], σEs, σZ) [[e . a := ie]]comm σ = setattribute ([[a]]attrexp σ) ([[e]]elemexp σ) ([[ie]]intexp σ) (σM (σEM ([[e]]elemexp σ))) [[newElem ex ofclass c inModel m]]comm σ = (σM’, σEM’, σE’, σEs, σZ) with im = [[m]]modelexp σ, (e,m) = new ([[c]]classexp σ) (σM (im)), σM’ = σM[im ← m], σE’ = σE[ex ← e], σEM’ = σME[e ← im] [[deleteElem e]]comm σ = (σM’, σEM’, σE, σEs, σZ) with e’ = [[e]]elemexp σ, im = σEM e’, m = delete e’ (σM im) σM’ = σM[im ← m], σEM’ = σME[e ← im] [[for x from ie1 to ie2 do c]]comm σ = iSec ([[ie1]]intexp σ) ([[ie2]]intexp σ) x c σ iSec n m x c σ = σ, if n > m iSec (n+1) m x c ([[c]]comm ((σM, σEM, σE, σEs, σZ[x←n]))), if n ≤ m [[forEachElem ex in es where b do c]]comm σ = eSec ([[es]]elemlistexp σ) ex b c σ eSec es ex b c σ =σ, if es = ∅ eSec es’ ex b c σ’’, es ≠ ∅ with es = e:es’, σ’ = (σM, σEM, σE[ex ← e], σEs, σZ) σ’’ = [[c]]comm σ’, if [[b]]boolexp σ’, σ’’ = σ’, if not [[b]]boolexp σ’ By applying these definitions we are able to prove whether two programs (i.e. transformations) are equivalent. Definition 3: two programs t and t’ are equivalent if and only if ([[t]]comm σ)σM = ([[t’]]comm σ)σM, for all σ ∈ Σ. Note that this definition does not take the values of variables into consideration, so two programs using different sets of internal variables would even be equivalent. Equivalence is defined considering only the input and output models (observable equivalence).

4 A Simple Example Let mm be the metamodel defined in section 2; let m1 be an instance of mm and m2 be the empty instance of mm. The following SITL program when applied to a state containing both the model m1 and the model m2 will populate m2 with the tables in m1, but none of the columns, primary keys or foreign keys will be copied to m2. forEachElem t in (elementsOfClass Table inModel m1) where true do newElem t’ ofClass Table inModel m2; t’.name = t.name;

Model Transformation Languages Relying on Models as ADTs

141

The resulting model is m2=(E, c, νa, νr) where, E={Book, Author}, c={(Book, Table), (Author, Table)}, νa={((nameTable, Book), Book), ((nameTable, Author), Author)}, νr = ∅). A formal proof of the correctness of this transformation would be written in a straightforward way by using the SITL’s semantics definition.

5 Encoding ATL in SITL Due to the fact that SITL is situated at midway between ordinary programming languages and transformation specific languages, such intermediate abstraction level makes it suitable for being used to define the semantics of more complex transformation languages. With the aim of showing an example in this section we sketch how to encode ATL in SITL. Each ATL rule is encode into a SITL procedure. Lets considerer the following simple rule template in ATL: module m from m1: MM1 to m2: MM2 rule rule_name { from in_var1 : in_class1!MM1 (condition1), … in_varn : in_classn!MM1 (conditionn) to out_var1 : out_class1!MM2 (bindings1), … out_varm : out_classm!MM2 (bindingsm) do {statements}} The equivalent code fragment in SITL would be: proc rule_name () beginproc forEachElem in_var1 in (elementsOfClass in_class1 inModel m1) where condition1 do … forEachElem in_varn in(elementsOfClass in_classn inModel m1) where conditionn do newElem out_var1 ofclass out_class1 inModel m2; … newElem out_varm ofclass out_classm inModel m2; bindings1; … bindingsm; statements; endproc A more complete encoding of ATL in SITL taking into account called rules and lazy unique rules can be read in [8].

142

J. Irazábal and C. Pons

6 Related Work Sitra [9] is a minimal, Java based, library that can be used to support the implementation of simple transformations. With a similar objective, RubyTL [10] is an extensible transformation language embedded in the Ruby programming language. These proposals are related to ours in the sense that they aim at providing a minimal and familiar transformation framework to avoid the cost of learning new concepts and tools. The main difference between these works and the proposal in this paper is that we are not interested in a solution that remains confined to a particular programming language, but rather in a language-independent solution founded on a mathematical description. Barzdins and colleagues in [11] define L0, a low level procedural strongly typed textual model transformation language. This language contains minimal but sufficient constructs for model and metamodel processing and control flow facilities resembling those found in assembler-like languages and it is intended to be used for implementation of higher-level model transformation languages by the bootstrapping method. Unlike our proposal, this language does not have a formal semantics neither is based on the idea of models as ADTs. Rensink proposes in [12] a minimal formal framework for clarifying the concept of model, metamodel and model transformation. Unlike that work, our formal definitions are more understandable while still ensuring the characterization of all relevant features involved in the model transformation domain. Additionally, the proposal in [12] does not define a particular language for expressing transformations. On the other hand, due to the fact that SITL is situated at midway between ordinary programming languages and transformation specific languages, such intermediate abstraction level makes it suitable for being used to define the semantics of complex transformation languages. In contrast to similar approaches - e.g. the translation of QVT to OCL+Alloy presented in [13] or the translation of QVT to Colored Petri Nets described in [14] - our solution offers a significant reduction of the gap between source and target transformation languages.

7 Conclusions In this paper we have proposed the use of “models as abstract data types” as the basis to support the development of model transformations. Specifically, we have formalized models and metamodels as abstract mathematical structures equipped with a set of operations. This abstract characterization allowed us to define a simple transformation approach that can be used to support the definition and implementation of model-to-model transformations. The core of this approach is a very small and understandable set of programming constructs. The use of this approach enables transformations to be implemented in a simpler way by applying any ordinary imperative programming language enriched with the ADTs, thus we avoid the overhead of having a full model transformation platform and/or learning a new programming paradigm.

Model Transformation Languages Relying on Models as ADTs

143

Additionally, the meanings of expressions from the transformation language are formally defined, enabling the validation of transformation specifications. Such meaning is abstract and independent of any existing programming language. Finally, we have shown that other well-known model transformation languages, such as ATL, can be encoded into this frame. Thus, this approach provides a practical way to formally define the semantics of complex model transformation languages.

References [1] Stahl, T., Völter, M.: Model-Driven Software Development. John Wiley & Sons, Ltd., Chichester (2006) [2] QVT Adopted Specification 2.0 (2005), http://www.omg.org/docs/ptc/05-11-01.pdf [3] Jouault, F., Kurtev, I.: Transforming Models with ATL. In: Bruel, J.-M. (ed.) MoDELS 2005. LNCS, vol. 3844, pp. 128–138. Springer, Heidelberg (2006) [4] Lawley, M., Steel, J.: Practical Declarative Model Transformation With Tefkat. In: Bruel, J.-M. (ed.) MoDELS 2005. LNCS, vol. 3844, pp. 139–150. Springer, Heidelberg (2006) [5] Varro, D., Varro, G., Pataricza, A.: Designing the Automatic Transformation of Visual Languages. Science of Computer Programming 44(2), 205–227 (2002) [6] Medini, QVT. ikv++ technologies ag, http://www.ikv.de (accessed in December 2008) [7] Hennessy’s Semantics of Programming Languages. Wiley, Chichester (1990) [8] Irazabal, J.: Encoding ATL into SITL. Technical report (2009), http://sol.info.unlp.edu.ar/eclipse/atl2sitl.pdf [9] Akehurst, D.H., Bordbar, B., Evans, M.J., Howells, W.G.J., McDonald-Maier, K.D.: SiTra: Simple Transformations in Java. In: Nierstrasz, O., Whittle, J., Harel, D., Reggio, G. (eds.) MoDELS 2006. LNCS, vol. 4199, pp. 351–364. Springer, Heidelberg (2006) [10] Sánchez Cuadrado, J., García Molina, J., Menarguez Tortosa, M.: RubyTL: A Practical, Extensible Transformation Language. In: Rensink, A., Warmer, J. (eds.) ECMDA-FA 2006. LNCS, vol. 4066, pp. 158–172. Springer, Heidelberg (2006) [11] Barzdins, J., Kalnins, A., Rencis, E., Rikacovs, S.: Model Transformation Languages and Their Implementation by Bootstrapping Method. In: Avron, A., Dershowitz, N., Rabinovich, A. (eds.) Pillars of Computer Science. LNCS, vol. 4800, pp. 130–145. Springer, Heidelberg (2008) [12] Rensink, A.: Subjects, Models, Languages, Transformations. In: Dagstuhl Seminar Proceedings 04101 (2005), http://drops.dagstuhl.de/opus/volltexte/2005/24 [13] Garcia, M.: Formalization of QVT-Relations: OCL-based static semantics and Alloybased validation. In: MDSD today, pp. 21–30. Shaker Verlag (2008) [14] de Lara, J., Guerra, E.: Formal Support for QVT-Relations with Coloured Petri Nets. In: Schürr, A., Selic, B. (eds.) MODELS 2009. LNCS, vol. 5795, pp. 256–270. Springer, Heidelberg (2009)

Towards Dynamic Evolution of Domain Specific Languages Paul Laird and Stephen Barrett Department of Computer Science, Trinity College, Dublin 2, Ireland {lairdp,stephen.barrett}@cs.tcd.ie

Abstract. We propose the development of a framework for the variable interpretation of Domain Speciﬁc Languages (DSL). Domains often contain abstractions, the interpretation of which change in conjunction with global changes in the domain or speciﬁc changes in the context in which the program executes. In a scenario where domain assumptions encoded in the DSL implementation change, programmers must still work with the existing DSL, and therefore take more eﬀort to describe their program, or sometimes fail to specify their intent. In such circumstances DSLs risk becoming less ﬁt for purpose. We seek to develop an approach which makes a DSL less restrictive, maintaining ﬂexibility and adaptability to cope with changing or novel contexts without reducing the expressiveness of the abstractions used.

1

Introduction

In this position paper we propose a model for the dynamic interpretation of Domain Speciﬁc Languages (DSLs). We believe that this is an important but as yet largely unexplored way to support changes in a program’s execution, which varying context may require. The beneﬁt such an approach would deliver is a capacity to evolve a program’s behaviour to adapt to changing context, but without recourse to program redevelopment. A key beneﬁt of this approach would be the ability to simultaneously adapt several applications through localised change in DSL interpretation. Our research seeks to explore the potential of this form of adaptation as a mechanism for both systemic scale and context-driven adaptation. Domain Speciﬁc language constructs are a powerful method of programming primary functionality in a domain. A recent study by Kosar et al.[6] found that end-user eﬀort required to specify a correct program was reduced by comparison to standard programming practice. However development of DSL systems is time consuming and expensive[11]. Requirements that emerge during development may end up left out, leaving the language release suboptimal, or if included, may delay the release as the compiler or generator must be updated. Modelling lag [15] results. Domain evolution may also render inappropriate the formulae which roll up complex semantics into simple, accessible and expressive DSL statements, inappropriate. Where variability is high, resulting DSL constructs can become unwieldy or low-level in response. M. van den Brand, D. Gaˇ sevi´ c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 144–153, 2010. c Springer-Verlag Berlin Heidelberg 2010

Towards Dynamic Evolution of Domain Speciﬁc Languages

145

Updates of general purpose languages are overwhelmingly polymorphic in nature in order to ensure backward compatibility. It would generally be inappropriate to change the interpretation of low-level constructs such as byte streams and classes. However, because the underlying semantics of high level DSL terms may vary over the life cycle of the DSL, we argue that these semantic changes are best implemented in a manner capable of equivalent adaptation. If the intent or purpose of the program are not being changed nor should the program. The decoupling of program intent and implementation would allow for a new form of dynamic, post-deployment adaptation, with possibilities for program evolution by means other than those oﬀered by current adaptation techniques. The cost of developing a new domain speciﬁc language would be reduced by the use of such a framework for DSL interpretation. If any common features were shared with an existing DSL, their implementations and speciﬁcations could be reused in the new language.

2

Proposed Solution

We propose to investigate the feasibility of varying the interpretation of a domain speciﬁc program as an adaptation strategy. Our solution is component based, and would involve the dynamic reconﬁguration of the interactions of running components that constitute a DSL interpreter. Figure 1 Shows the architecture of the proposed solution. The language speciﬁcation functions as a co-ordination model, specifying the structure and behaviour of the interpreter. The language speciﬁcation is interpreted by a generic interpreter, which co-ordinates the interactions of executing components, shown as diamonds, to yield the required behaviour. Context is used to switch between variations of the interpretation, for example to deal with network degradation in mobile applications. The interpreter, on reading a statement, would instantiate the elements required to execute the statement, combine them in a conﬁguration which matches the speciﬁcation statement’s terms in the language description, provide the components with access to the appropriate data as inputs, and locations in which to store the outputs. The eﬀect achieved by using a generic interpreter, relying on language input to determine how it interprets a program, is to support a kind of adaptation based on changing the way in which a program is interpreted, by selective reconstruction of the interpreter. In order for the dynamic adaptation outlined earlier to function, the correct interpreter must be running after the adaptation, there must be a mechanism for the replacement of the version of the language in play with a newer version in the event of an update by the system. In our model, this amounts to dynamic architectural variation through component recomposition [3]. The architecture we are proposing to test the execution of DSLs is Service Oriented [1] , with the interpreter maintaining state and co-ordinating the instantiation and replacement of components as necessary. Some of these components could be stubs communicating with external services. Our approach proposes to use the late binding of service oriented computing to allow ﬂexibility in the

146

P. Laird and S. Barrett

Fig. 1. System Architecture

execution of a program. CoBRA[5] demonstrates the ability to reconﬁgure the interactions of executing components, including atomic replacements, and compares the method to other means of replacing service implementations. We envisage using an infrastructure of that nature to co-ordinate interactions below the interpreter, but to make the conﬁguration dependent on the DSL. CoBRA uses state store/restore to maintain state between replacement services, but we envisage separating the state from the behavioural implementation of components, with components able to access their relevant state information. Chains of execution are speciﬁed by giving a component outputs of previous components as inputs, while the interpreter need only deal with the outputs directly when intervention or a decision is required. The net eﬀect of input and output variables used in this manner is not unlike connectors in MANIFOLD [14] but with greater ﬂexibility for change. Interpretation as a Service. Enterprise computing systems have moved from mainframe based architecture to client server architecture and are now in some cases moving to a web based architecture [10]. This is being facilitated by technologies such as virtualisation [9] and Platform as a Service [19]. We posit a DSL platform operating across an organisation, capable of execution of an open ended set of DSL variations. This will allow us to support consistent change in interpretation across large scale enterprises. Changes at the level of the domain speciﬁc language could be used to eﬀect change across an entire organisation, across software systems. Application speciﬁc changes to the language used for interpretation could be used to pilot trial changes to the domain speciﬁc language, in order to evaluate their utility for future use across the domain. Applications may also have terms with application speciﬁc meaning, which could be adjusted in the same manner.

Towards Dynamic Evolution of Domain Speciﬁc Languages

147

Usage of cloud computing in banks is below that in other domains [20]. Some concerns expressed in the ﬁnancial services industry about using cloud computing based services include security, service provider tie-in and price rises, lack of control and potential for down-time if using a third-party cloud. The resources required to manage an internal cloud discourage this option, while both, but particularly third party clouds, could suﬀer from failure to respond in a timely manner to time critical operations. The resources issue is likely to diminish in importance as technology advances and prices fall. An interpreter for a domain speciﬁc language, provisioned on a Platform as a Service basis, will require resources to set up and maintain, however the ease with which programs could thereafter be written to run on the platform may outweigh this outlay. The initial cost is likely to be the inhibiting factor as this would be an expense which would otherwise not be incurred, while the savings in application maintenance and updating should more than oﬀset the cost of maintaining the platform. Large enterprises may run several diﬀerent software systems and may want to implement the same change across all of them, following a change in the domain. If they use the traditional model driven development approach and change the transformation of the term in the domain speciﬁc language, they must still regenerate the appropriate source code and restart the aﬀected components. This is not an atomic action and inconsistencies may arise between diﬀerent software systems in the organisation in terms of how they treat this term. In an environment where an interpreter is provisioned as a Platform as a Service, a single change to the interpretation of that term will aﬀect all software systems running on that platform.

2.1

An Example Domain

We introduce our model by way of an example banking application. Financial services is a domain with well-deﬁned domain constructs. A DSL for ﬁnancial products can be seen in [2]. DSL and DSML programs are concise and easier to understand and program for those who work in the domain than low-level code. However the concise program encodes complex behaviour in any particular statement. Over time, the precise interpretation of the high level abstractions may change, but the overall meaning would not. Changes to the language used by banking system developers would normally be required after policy decisions, statutory changes or the introduction of new banking products whose speciﬁcations do not match previously available options. An example of a change to the language, which does not require programming specialists, is the introduction of free banking. This means that if a current account matches certain criteria, then standard fees such as maintenance, standing order fees, transaction fees etc. do not apply. If the implementation of a standing order previously charged a fee to carry out the request, then this could be preceded by a conditional checking that the account was not free or did not fulﬁl the necessary conditions for free banking, which could easily be expressed in the language.

148

P. Laird and S. Barrett

Statutory changes introducing a new concept, such as deposit interest retention tax, would initially require new abstractions, but some of these could be implemented in the high level language. In the case of the introduction of a new tax, all that is needed is an abstract deﬁnition of where the tax will apply, what rate of tax is applicable, and a mechanism to deal with the tax collected. The abstract deﬁnition of where the tax will apply will almost certainly be expressible in the domain speciﬁc language, the tax rate is a primitive fraction, and while the mechanism to deal with the tax collected may be potentially complex, it will reﬂect the actual banking process involved, and will therefore also be expressible in the language. The introduction of a deduct tax function would encapsulate the treatment of tax so that a single statement need only be added to a credit interest function to include that functionality. As the entire meaning is contained in one location, only one change needs to be made if the bank decides to change where it holds the funds it is due to pay in tax, or if the government changes the tax rate. The DSL would contain provide functionality to reliably transfer money from the client account to the tax account, keeping the change succinct. Developers within the context of the banking system are constrained in what they express by what abstractions have been deﬁned in the domain speciﬁc language. They in turn constrain what the end users of the system can do. These relationships retain relevance, although the domain developers would have greater freedom to reﬁne the language and develop compositional constructs in order to facilitate their easier programming in future. The following example is a speciﬁcation of a loan in the RISLA [2] domain speciﬁc language for ﬁnancial products. The language and syntax are much more accessible to ﬁnancial engineers than an equivalent general purpose implementation, and certainly by comparison to COBOL. The implementation is achieved in COBOL by compiling the DSL code to produce COBOL code. New products or changes to products can be deﬁned easily if the changes are at the DSL level, such as specifying a minimum transaction size or maximum number of transactions in a given time, but changes to the scheme by which interest is calculated or the addition of fees or taxes, for example in a cross-border context, would require changes to how the terms are interpreted. Changes such as these could happen without any change to the product speciﬁcation, and therefore it would be inappropriate to change the deﬁnition of products at the DSL level to achieve such changes. If the interpretation of the DSL terms could be changed as we have proposed, this would allow the change to be eﬀected at the appropriate level of abstraction Figure 2 shows an example of Domain Speciﬁc code used to deﬁne a loan product in the RISLA language. To change the program so that only transactions of over 1000 euro could proceed is trivially easy, by changing the relevant number. Other changes, such as adding a transaction fee or tax, require changes to the implementation of one or more keywords, in the case of RISLA, this would be in COBOL, to which it is compiled. In the solution which we propose, the relevant interpretation is changed at runtime by reconﬁguring the interpreter.

Towards Dynamic Evolution of Domain Speciﬁc Languages

149

product LOAN declaration contract data PAMOUNT : amount STARTDATE : date MATURDATE : date INTRATE : int-rate RDMLIST := [] : cashflow-list ... registration define RDM as error checks "Date not in interval" in case of (DATUM < STARTDATE) or (DATUM >= MATURDATE) "Negative amount" in case of AMOUNT <= 0.0 "Amount too big" in case of FPA(RDMLIST >> []) > 0.0 RDMLIST := RDMLIST >> [] Fig. 2. Part of a Domain Speciﬁc Program

More signiﬁcantly, some changes, which could be catered for at the Domain Speciﬁc Language level, are more appropriately handled at the interpretation level. If for example, there was a taxation primitive in the DSL, and a tax was levied on all ﬁnancial products, it would not be necessary to redesign the language in order to implement the change, but it would be desirable. Implementing the levy as an inbuilt part of initialisation of or some other operation on any ﬁnancial product would localise the change in an Aspect Oriented way, saving eﬀort on the part of the programmers, and guaranteeing the reliability and uniformity of the change. Consider the implementation of a levy at a Domain Speciﬁc Program level. Code to handle the deduction of the levy would have to be added to the deﬁnition of every product where the levy would apply. If tax relief were subsequently granted on certain classes of products these would have to be modiﬁed once more. In a case where an adaptation aﬀects more than one Domain Speciﬁc Program, the atomicity of eﬀecting the change through varying the interpretation may be of great beneﬁt. A change to a transformation in a Model Driven Development scenario would have the same eﬀect on one program, whose execution could be stopped, as a change in interpretation. This would represent another useful application of the concept of Evolving DSLs, as the performance of a transformed model would be faster than an interpreted version, but would not provide the beneﬁts of atomic runtime adaptation of multiple applications. 2.2

A Multi-system Programming Paradigm

A solution of this kind produces a programming paradigm where languages can evolve organically to adapt to changing contexts. A potential application for this

150

P. Laird and S. Barrett

is in the management of organisation wide adaptation in large enterprises. These enterprises generally have many software systems operating on their computers and many of these may access may access a common resource or service. This service could be mapped to a term in a domain speciﬁc language if the enterprise used a DSL to specify its software. The service may also be used by other clients. It may be desirable to change the meaning of the term such that it is executed by a diﬀerent service, however replacing the service at its current location may not be appropriate as it has other clients. In a typical service oriented computing set up, the change would have to be speciﬁed in each program using the service which was to be changed. This could introduce inconsistency into the way some domain concept is handled by diﬀerent applications. By requiring the interpretation of the term by a common interpreter, the change need only be implemented once. 2.3

Programming for Evolving DSLs

When a domain developer deﬁnes something in terms of the abstractions provided to him, he is in eﬀect extending the language, as the interpreter refers to the program and to the domain speciﬁc language to ﬁnd a deﬁnition for any construct it encounters. This language extension may be speciﬁc to the program or context in which it is used, but can be co-opted by future developers as part of a more speciﬁc language for related software. Underlying changes could be implemented by replacing a component with a polymorphic variant, or by aspect oriented or reﬂective interception and wrapping, but this should not concern the domain programmer. The interpreter could deal with more than one level of abstraction above the executing components, in order to represent each abstraction in terms of its component parts, rather than in terms of its atomic low-level components. Thus a transaction is deﬁned in terms of reliable connections and simple instructions, below which issues such as the transaction commit protocol etc. are hidden.

3

Related Work

As well as languages to support multiple systems in large enterprises, we propose to examine the beneﬁts of this programming paradigm in domains such as Computational Trust [7]. The implementation of terms in a Trust DSL may change rapidly based on changing conditions or in response to other factors. This makes Trust a suitable candidate for dynamic interpretation. Many proposed DSLs are implemented as library packages instead due to the diﬃculty in ﬁnding language developers with appropriate domain knowledge[11]. Formal domain modelling can only capture a snapshot of the requirements of a domain, causing modelling lag[15]. A dynamic domain speciﬁc language is a DSL based not upon such a snapshot, but one which can be updated as necessary. Keyword Based Programming [4], Intentional Programming[17] and Intentional Software[18] allow incremental development of domain speciﬁc languages, and support their specialisation through application speciﬁc language extensions. However

Towards Dynamic Evolution of Domain Speciﬁc Languages

151

these require generation, compilation and/or reduction steps, after which the running application cannot be adapted in this manner. Papadopoulos and Arbab[14] show how MANIFOLD can handle autonomic reconﬁguration with regard to addition or removal of components. Our aim to automate the changes which would be needed to implement a change in the execution of the system. The Generic Modelling Environment [8] allows the construction of a modelling environment given a speciﬁcation of a Domain Speciﬁc Modelling Language. This could be used to represent the domain program, the generation of generalpurpose language code would not allow later dynamic adaptation. Nakatani et al. [12,13] describe how lightweight Domain Speciﬁc Languages or jargons can be interpreted using a generic interpreter and language descriptions for the relevant jargons. While there is composition of diﬀerent jargons to allow them to be used as part of an extended DSL, there is no attempt to modify a program through dynamically varying the interpretation of a term. Platform as a Service is an extension of Software as a Service which sees webbased application development environments hosted by a system provider. The resulting applications are often hosted in a Software as a Service manner by the same provider [19]. Software as a Service [16] is the provision of software through the internet or other network, in a service oriented way. The end user does not have to worry about hosting, updating or maintaining the software.

4

Conclusions

We have presented an outline framework for the design and maintenance of systems. Systems written in domain speciﬁc languages would be implemented through the runtime interpretation of that program so as to allow the reinterpretation of terms in the language. The design of domain speciﬁc languages from scratch would remain a signiﬁcant task, as abstractions from the domain need to be captured in a form that can be used to develop programs, however maintenance becomes much easier, as parts of the language are redeﬁned as required. There are several levels at which a programs execution can be changed. Software is written as an application in a programming language, the source code of which can be changed. If the program is interpreted, the virtual machine on which it runs can be altered, or if it is compiled changes can be made at compile time. The operating system itself can be changed, aﬀecting program execution. The lower the level at which an adaptation is implemented, the wider the eﬀects of that change will be, but the less expressive the speciﬁcation of that change and the less program speciﬁc the change will be. We propose to introduce adaptation at a level below the source, but above the executing components. This is an appropriate level at which to implement certain forms of adaptation. Adding another layer naturally introduces an overhead, but we wish to establish whether the beneﬁts to be gained from increased ﬂexibility justify the overhead incurred. The adaptations for which this form of adaptation is best suited are functional, non-polymorphic, runtime adaptations. The framework could naturally support

152

P. Laird and S. Barrett

polymorphic or non-functional runtime adaptation also, however these alone would not justify the creation of a framework for adaptation as aspect oriented programming can perform most of these adaptations adequately. Overall code localisation would improve, as any change which is implemented through a change in interpretation prevents identical edits throughout the code. Dynamic AOP also requires consideration of all code previously woven at runtime during further evolution. The ability to redeﬁne parts of the language in order to provide similar programs in a diﬀerent context could lead to the budding oﬀ of new languages from a developed domain speciﬁc language. This would signiﬁcantly lower the barrier to entry for any domain lacking the economies of scale required to justify DSL development, but which shared some high level abstractions with a related domain. Opening the interpretation of a DSL to runtime adaptation would allow the simultaneous adaptation of multiple applications running on a DSL platform. Delivering such a platform would take considerable resources in set-up and maintenance, but would ease the process of organisation-wide adaptation and increase its reliability and consistency.

References 1. Allen, P.: Service orientation: winning strategies and best practices. Cambridge University Press, Cambridge (2006) 2. Arnold, B., van Deursen, A., Res, M.: An algebraic speciﬁcation of a language describing ﬁnancial products. In: IEEE Workshop on Formal Methods Application in Software Engineering, pp. 6–13 (1995) 3. Barrett, S.: A software development process. U.S. Patent (2006) 4. Cleenewerck, T.: Component-based DSL development. In: Pfenning, F., Smaragdakis, Y. (eds.) GPCE 2003. LNCS, vol. 2830, pp. 245–264. Springer, Heidelberg (2003) 5. Irmert, F., Fisher, T., Meyer-Wegener, K.: Runtime adaptation in a serviceoriented component model. In: Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems (2008) 6. Kosar, T., L´ opez, P.E.M., Barrientos, P.A., Mernik, M.: A preliminary study on various implementation approaches of domain-speciﬁc language. Information and Software Technology 50(5), 390–405 (2008) 7. Laird, P., Dondio, P., Barrett, S.: Dynamic domain speciﬁc languages for trust models. In: Proceedings of the 1st IARIA Workshop on Computational Trust for Self-Adaptive Systems (to appear, 2009) 8. Ledeczi, A., Maroti, M., Bakay, A., Karsai, G., Garrett, J., Thomason, C., Nordstrom, G., Sprinkle, J., Volgyesi, P. (eds.): The Generic Modeling Environment, Workshop on Intelligent Signal Processing, Budapest, Hungary (2001) 9. Marinescu, D., Kroger, R.: State of the art in autonomic computing and virtualization. Technical report, Distributed Systems Lab, Wiesbaden University of Applied Sciences (2007) 10. Markus, M.L., Tanis, C.: The enterprise systems experience–from adoption to success. Framing the domains of IT research: Glimpsing the future through the past, 173–207 (2000)

Towards Dynamic Evolution of Domain Speciﬁc Languages

153

11. Mernik, M., Sloane, T., Heering, J.: When and how to develop domain-speciﬁc languages. ACM Computing Surveys 37(4), 316–344 (2005) 12. Nakatani, L.H., Ardis, M.A., Olsen, R.G., Pontrelli, P.M.: Jargons for domain engineering. SIGPLAN Not. 35(1), 15–24 (2000) 13. Nakatani, L.H., Jones, M.A.: Jargons and infocentrism. In: First ACM SIGPLAN Workshop on Domain-Speciﬁc Languages, pp. 59–74. ACM Press, New York (1997) 14. Papadopoulos, G.A., Arbab, F.: Conﬁguration and dynamic reconﬁguration of components using the coordination paradigm. Future Generation Computer Systems 17(8), 1023–1038 (2001) 15. Safa, L.: The practice of deploying dsm report from a japanese appliance maker trenches. In: Gray, J., Tolvanen, J.-P., Sprinkle, J. (eds.) 6th OOPSLA Workshop on Domain-Speciﬁc Modeling (2006) 16. SIIA. Software as a service: Strategic backgrounder. Technical report, Software and Information Industry Association (2001) 17. Simonyi, C.: The death of computer languages. Technical report, Microsoft (1995) 18. Simonyi, C., Christerson, M., Cliﬀord, S.: Intentional software. In: Proceedings of the 21st OOPSLA conference. ACM, ACM, New York (2006) 19. Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud deﬁnition. SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2009) 20. Voona, S., Venkataratna, R., Hoshing, D.N.: Cloud computing for banks. In: Finacle Connect (2009)

ScalaQL: Language-Integrated Database Queries for Scala Daniel Spiewak and Tian Zhao University of Wisconsin – Milwaukee {dspiewak,tzhao}@uwm.edu

Abstract. One of the most ubiquitous elements of modern computing is the relational database. Very few modern applications are created without some sort of database backend. Unfortunately, relational database concepts are fundamentally very diﬀerent from those used in generalpurpose programming languages. This creates an impedance mismatch between the the application and the database layers. One solution to this problem which has been gaining traction in the .NET family of languages is Language-Integrated Queries (LINQ). That is, the embedding of database queries within application code in a way that is statically checked and type safe. Unfortunately, certain language changes or core design elements were necessary to make this embedding possible. We present a framework which implements this concept of type safe embedded queries in Scala without any modiﬁcations to the language itself. The entire framework is implemented by leveraging existing language features (particularly for-comprehensions).

1

Introduction

One of the most persistent problems in modern application development is that of logical, maintainable access to a relational database. One of the primary aspects of this problem is impedance mismatch [7] between the relational model and the paradigm employed by most general-purpose programming languages. Concepts are expressed very diﬀerently in a relational database than in a standard memory model. As a result, any attempt to adapt one to the other usually results in an interface which works well for most of the time, but occasionally produces strange and unintuitive results. One solution to this problem of conceptual orthogonality is to “give up” attempting to adapt one world to the other. Instead of forcing objects into the database or tables into the memory model, it is possible to simply allow the conceptual paradigms to remain separate. This school of thought says that the application layer should retrieve data as necessary from the relational store by using concepts native to a relational database: declarative query languages such as SQL. This allows complete ﬂexibility on the database side in terms of how the data can be expressed in the abstract schema. It also gives the application layer a lot of freedom in how it deals with the extracted data. As there is no relational store to constrain language features, the application is able to deal with data on its own M. van den Brand, D. Gaˇ sevi´ c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 154–163, 2010. © Springer-Verlag Berlin Heidelberg 2010

ScalaQL: Language-Integrated Database Queries for Scala

155

terms. All of the conﬂict between the dissonant concepts is relegated to a discrete segment of the application. This is by far the simplest approach to application-level database access, but it is also the most error-prone. Generally speaking, this technique is implemented by embedding relational queries within application code in the form of raw character strings. These queries are unparsed and completely unchecked until runtime, at which point they are passed to the database and their results converted using more repetitive and unchecked routines. It is incredibly easy even for experienced developers to make mistakes in the creation of these queries. Even excluding simple typos, it is always possible to confuse identiﬁer names, function arities or even data types. Worse yet, the process of constructing a query in string form can also lead to serious security vulnerabilities — most commonly SQL injection. None of these problems can be found ahead of time without special analysis. The Holy Grail of embedded queries is to ﬁnd some way to make the host language compiler aware of the query and capable of statically eliminating these runtime issues. As it turns out, this is possible within many of the .NET language family through a framework known as LINQ [8]. Queries are expressed using language-level constructs which can be veriﬁed at compile-time. Furthermore, queries speciﬁed using LINQ also gain a high degree of composability, meaning that elements common to several queries can often be factored into a single location, improving maintainability and reducing the risk of mistakes. It is very easy to use LINQ to create a trivial database query requesting the names of all people over the age of 18: var Names = from p in Person where p.Age > 18 select p.Name; This will evaluate (at runtime) an SQL query of the following form: SELECT name FROM people WHERE age > 18 Unfortunately, this sort of embedding requires certain language features which are absent from most non-homoiconic [10] languages. Speciﬁcally, the LINQ framework needs the ability to directly analyze the structure of the query at runtime. In the query above, we are ﬁltering the query results according to the expression p.Age > 18. C# evaluation uses call-by-value semantics, meaning that this expression should evaluate to a bool. However, we don’t actually want this expression to evaluate. LINQ needs to somehow inspect this expression to determine the equivalent SQL in the query generation step. This is where the added language features come into play. While it is possible for Microsoft to simply extend their language with this particular feature, lowly application developers are not so fortunate. For example, there is no way for anyone (outside of Sun Microsystems) to implement any form of LINQ within Java because of the language modiﬁcations which would be required. We faced a similar problem attempting to implement LINQ in Scala.

156

D. Spiewak and T. Zhao

Fortunately, Scala is actually powerful enough in and of itself to implement a form of LINQ even without adding support for expression trees. Through a combination of operator overloading, implicit conversions, and controlled callby-name semantics, we have been able to achieve the same eﬀect without making any changes to the language itself. In this paper, we present not only the resulting Scala framework, but also a general technique for implementing other such internal DSLs requiring advanced analysis and inspection prior to evaluation. Note that throughout this paper, we use the term “internal DSL” [4] to refer to a domain-speciﬁc language encoded as an API within a host language (such as Haskell or Scala). We prefer this term over the often-used “embedded DSL” as it forms an obvious counterpoint to “external DSL”, a widely-accepted term for a domain-speciﬁc language (possibly not even Turing Complete) which is parsed and evaluated just like a general-purpose language, independent of any host language. In the rest of the paper, Section 2 introduces ScalaQL and shows some examples of its use. Section 3 gives a general overview of the implementation and the way in which arbitrary expression trees may be generated in pure Scala. Finally, Section 4 draws some basic comparisons with LINQ, HaskellDB and similar eﬀorts in Scala and other languages.

2

ScalaQL

The entire ScalaQL DSL is oriented around a single Scala construct: the forcomprehension. This language feature is something of an amalgamation of Haskell’s do-notation and its list-comprehensions, rendered within a syntax which looks decidedly like Java’s enhanced for-loops. One trivial application of this construct might be to construct a sequence of 2-tuples of all integers between 0 and 5 such that their sum is even: val tuples = for { x <- 0 to 5 y <- 0 to 5 if (x + y) % 2 == 0 } yield (x, y) There are really three separate components to this syntax. The ﬁrst is the generator (e.g. x <- ...), which sets up the local variable x containing the current element in the comprehension. The second component is the ﬁlter (if ...), which deﬁnes the conditions under which this comprehension holds. Finally, we have the yield clause, which deﬁnes the result in terms of the variables set up by the generator(s). There may be any number of generators and ﬁlters, but only one yield. Every for-comprehension is parsed into a corresponding series of calls to methods flatMap, map and filter.1 The map and filter methods are standard 1

Unless the for-comprehension lacks a yield, in which case foreach replaces flatMap and map.

ScalaQL: Language-Integrated Database Queries for Scala

157

higher-order utility functions. The flatMap method is eﬀectively Scala’s version of Haskell’s >>= operator (monadic bind ). It is deﬁned for collections as a composition of the map and flatten functions. By rewriting for-comprehensions in terms of other language elements at parse time, Scala empowers third-party frameworks (such as ScalaQL) to exploit the syntax simply by implementing the relevant methods. Altogether, this syntax provides a way of working with Scala collections in an almost declarative fashion reminiscent of a query language. In fact, it is possible to make use of for-comprehensions to perform SQL-like queries against Scala collections. For example: // regular Scala collections, not ScalaQL val people: List[Person] = ... val companies: List[Company] = ... val underAge = for { p <- people c <- companies if p.company == c if p.age < 14 } yield p This expression yields a List of all people under the age of 14 who are employed by some company. If we were to formulate this same query in SQL, the result would be something like this: SELECT p.* FROM people p JOIN companies c ON p.company_id = c.id WHERE p.age < 14 Intuitively, for-comprehensions are a natural syntactic device for representing declarative queries against generic collections. ScalaQL makes it possible to use that same syntax to represent database queries. Using ScalaQL, we can take our query example from earlier and slightly adapt it into something that will actually run against a database: val underAge = for { p <- Person c <- Company if p.company is c if p.age < 14 } yield p Recall that for-comprehensions are translated into a corresponding series of calls to flatMap, map and filter. In this case, the ﬁrst (outermost) call to flatMap will be targetted on the Person object. This is what allows ScalaQL to “hijack”

158

D. Spiewak and T. Zhao

the for-comprehension syntax. Person must implement — or inherit from a type which implements — the flatMap and map methods such that an an abstract representation of the query is produced (see Section 3). The primary syntactic diﬀerence between this and the same query run against Scala List(s) is the use of the is operator (rather than ==) to test equality. This is necessary because of the way that Scala handles the == method.2 Amazingly enough, it is the only syntactic concession made by the framework. All other String, Int and Boolean operators work exactly as expected. For example, the < operator is used above to compare p.age to the integer literal, 14. The above expression will produce an instance of Query[Person], one which will produce a sequence of Person entities when evaluated. ScalaQL does not evaluate queries at declaration point. Instead, evaluation is deferred until the query is actually used as a sequence. For example: underAge foreach { p => println(p.firstName + ' ' + p.lastName) } The foreach method is not declared for type Query. When the Scala compiler sees this invocation, it determines that an implicit conversion from Query[Person] to Seq[Person] is required in order to make everything work. This implicit conversion is transparently injected into the bytecode by the Scala compiler and invoked at runtime just prior to the invocation of foreach. It is this implicit conversion, deﬁned by ScalaQL, which handles the query evaluation. The primary advantage to this deferred evaluation is it allows queries to be treated compositionally. For example, we might want to construct a query which ﬁnds all of the under-age employees working at MegaCorp. Rather than redundantly deﬁning the query constraints for under-age workers, we can simply build our new query by composing with the old: val megaCorpEmps = for { p <- underAge if p.company.name is "MegaCorp" } yield p If we were to evaluate the megaCorpEmps query, it would execute SQL against the database very similar to the following: SELECT p.* FROM people p JOIN companies c ON p.company_id = c.id WHERE p.age < 14 AND c.name = 'MegaCorp' 2.1

Projection

So far, all of the queries we have expressed using ScalaQL have had a very simple yield statement, producing an instance of Query parameterized against 2

Unlike other symbolic methods, Scala deﬁnes == as an alias for equals. Our experiments revealed some bugs in Scala’s type checker when either equals or == are deﬁned to return anything other than Boolean (unrelated and well-formed sections of code would arbitrarily fail to type-check).

ScalaQL: Language-Integrated Database Queries for Scala

159

an entity type. ScalaQL is also capable of projecting on single ﬁelds as well as arbitrary record types deﬁned as anonymous classes. This makes it possible to deﬁne type safe projections with arbitrary ﬁelds. Single-ﬁeld and single-expression projection works exactly as expected. We deﬁne our yield clause in terms of the row locals deﬁned in the generators (e.g. p or c), using ﬁelds, operators and values in the same fashion as in the ﬁlters. For example: val names = for { p <- Person if p.age > 18 } yield p.lastName This deﬁnes an instance of type Query[Varchar] which produces the last names of all of the people in the database over the age of 18. With a few slight modiﬁcations, we can actually produce the concatenation of the ﬁrst and last names in the standard “Last, First” format: val names = for { p <- Person if p.age > 18 } yield p.lastName + ", " + p.firstName When evaluated, this query will execute SQL similar to the following: SELECT CONCAT(CONCAT(p.last_name, ', '), p.first_name) FROM people p WHERE p.age > 18 One particularly thorny aspect of projection which has been a diﬃcult area for similar query DSLs in the past is that of multi-ﬁeld projection. In SQL, it is possible to construct a query which produces a subset of the resulting ﬁelds; not just one ﬁeld, but several. This is diﬃcult because it requires the ad-hoc deﬁnition of new record types corresponding to the ﬁelds in question. While classes are technically a form of record type, very few languages suﬃciently facilitate the deﬁnition of classes on a case-by-case basis. When each query requires a diﬀerent record type (class) for its projection, query deﬁnition becomes a very tedious aﬀair. Fortunately, Scala provides a lightweight syntax for deﬁning Java-style anonymous inner-classes which extend AnyRef. This syntax (which actually comes from C#) makes it easy to deﬁne new classes at query-site without becoming syntactically burdensome: val people = for { p <- Person if p.age > 18 } yield new { val firstName = p.firstName val lastName = p.lastName }

160

D. Spiewak and T. Zhao

This query selects only the first name and last name ﬁelds from the people table. The new { ... } syntax deﬁnes a new anonymous inner-class containing two ﬁelds: firstName and lastName. This type will be used to populate the query results. Thus, the type of the people value is Query[$t], where $t is the type of the anonymous inner-class (this type is hidden by Scala’s type inference, hence the use of the “$t” notation). We can demonstrate this fact by iterating over the query results and accessing ﬁelds: people foreach { p => println(p.firstName + ' ' + p.lastName) }

3

Implementation

The most important guiding concept of ScalaQL’s implementation is that of the abstract query tree, which is similar in principle to an abstract syntax tree used in the implementation of most programming languages. Unlike most internal DSLs, ScalaQL does not immediately evaluate the invocation syntax into a ﬁnal result. Instead, it creates an abstract representation of the desired query in an AST-like structure. This structure is what is actually contained by a value of type Query. When the Query is converted to a Seq, the abstract query tree is converted into the corresponding SQL, which is evaluated against the database to produce the ﬁnal result. The query tree is composed of three elements: views, projections and expressions. Views directly correspond to relations in relational algebra and may be either tables or queries (another abstract query tree). Projections have three diﬀerent forms, each corresponding to one of the three diﬀerent projection types supported by ScalaQL: single ﬁeld, single table and ﬁeld subset. Projections may also contain expressions in cases where the yield clause is not a simple ﬁeld or entity: for { p <- Person } yield p.firstName + " " + p.lastName Expressions are where most of the interest lies. The addition of abstract expression trees as ﬁrst-class values was one of the primary changes in C# 3.0 as required by LINQ. Since Scala does not have this feature, we must ﬁnd a way to construct expression trees using a diﬀerent approach. The solution is a combination of implicit conversions and operator overloading. In the above example, we have given the sub-expression p.firstName + " ". While p.firstName may appear to be a ﬁeld of type String, it actually has type Varchar, which extends the StringExpression class. This class deﬁnes a number of methods, including +, an operator which takes another StringExpression as a parameter. We have deﬁned an implicit conversion from type String to StringExpression, allowing literal strings to be concatenated onto abstract StringExpression(s). The result of this + method is an abstract expression node, AddStr, which also extends StringExpression.

ScalaQL: Language-Integrated Database Queries for Scala

161

Of course, strings are not the only data type manipulated by SQL expressions. For this reason, we have also created implementations for NumericExpression, BooleanExpression and TimeExpression. Each of these classes deﬁnes operator methods according to how their respective type is expected to behave. Thus, NumericExpression deﬁnes +, *, % and more, while BooleanExpression deﬁnes &&, || and so on. Every expression class extends Expression, which deﬁnes operator methods common to all expressions: is and !=. All of these operations return abstract expression nodes representing the speciﬁc operation in question. These nodes each resolve to a diﬀerent SQL operation or function, making it possible to eﬀectively compile Scala expressions into SQL at runtime. In a sense, the expression DSL parses code which appears to be conventional String, Int and Boolean expressions into a structure very reminiscent of a compiler’s abstract syntax tree. This tree can then undergo a code generation phase, which produces the corresponding SQL. Type safety is ensured by the fact that the operator methods in each expression class will only accept certain parameter types. Thus, it is impossible to concatenate a StringExpression and a NumericExpression; the + operator method in StringExpression only accepts another StringExpression. Inherited operator methods like is and != are guaranteed type safety through the use of an abstract type declared in the Expression superclass. This type eﬀectively allows the parameters for any operator methods in Expression to vary covariantly with subtyping, ensuring that it is impossible to test a NumericExpression and a BooleanExpression for equality. The other advantage to this approach in general (besides type safety) is that it allows optimizations and other in-depth analysis to be performed against the abstract expression tree prior to resolution (code generation). Normally, a DSL evaluates directly to its ﬁnal result, making it very diﬃcult to perform any sort of non-trivial processing on the instructions. This is because direct evaluation eﬀectively restricts any processing to a single pass over the instructions. By evaluating to an intermediate form (the expression tree), we make it possible to perform multi-pass analysis (including optimization) against a complete representation of the DSL instructions.

4

Related Work

SQLJ [9] embeds SQL into Java and is statically typed. However, dynamic queries are not supported as every SQLJ query is converted in a pre-compilation step. While not technically a language extension, SQLJ is certainly not “plainold Java”. SchemeQL [12] is similar to SQLJ in that it processes embedded query statements using an external preprocessor, but without providing any static typing. Safe Query Object [1] achieves many of the same goals as SQLJ, all while working within regular Java syntax. Users specify queries using special Java classes which are compiled into JDO queries. Safe Query Object also supports a wide variety of query operations including existential quantiﬁcation, parameters and dynamic queries. However, like SQLJ, a special compilation step is

162

D. Spiewak and T. Zhao

required to perform the conversion. As mentioned previously, systems such as LINQ [8] support SQL-like queries through language extensions. Java Language Extender [11] is another framework which operates in this fashion. Of all of the projects in this ﬁeld, HaskellDB [6] is likely the most similar to our approach in that it functions as an internal domain-speciﬁc language. Operations such as ﬁlter, join and conditionals are all supported in a statically checked, type safe environment provided by Haskell’s type system. However, Haskell imposes heavier restrictions on function overloading than does Scala. Thus, HaskellDB is forced to use operators like .+. instead of the more familiar + when summing query values. Also, Scala’s implicit conversions are in some ways more powerful than Haskell’s type classes. ScalaQL allows the use of integer literals directly in query expressions, while HaskellDB requires the explicit use of the constant function. Related to HaskellDB is the Pan language [3]. While Pan has very little to do with database queries, it does demonstrate the power of internal DSL construction with an intermediate form. Like ScalaQL, Pan relies on carefully-constructed ADTs to statically ensure well-formedness of DSL expressions. The authors of Pan also discuss ways in which the intermediate form of the DSL may be leveraged in the implementation of advanced optimizations and analyses. The AraRat [5] framework provides similar query functionality in C++ through the use of preprocessor directives, operator overloading and templates. Its focus is primarily on directly representing relational algebra within the syntax of C++, rather than a more “familiar” dialect like SQL. Thus, a join is represented using the * operator, rather than through a more mainstream nomenclature. AraRat does share what is perhaps ScalaQL’s most important feature in that it represents views in their abstract form, allowing queries to be highly compositional and easily optimized. AraRat provides a large amount of type safety in the construction and composition of queries, but it does not extend that safety to the evaluation of those queries and subsequent parsing of the results. Queries are simply converted to char* using the asSQL() function. This diﬀers from ScalaQL, which converts abstract views into properly type safe sequences during evaluation. This limitation is not entirely surprising given the fact that C++ lacks a generic database access framework like JDBC. Various non-academic eﬀorts have also been made to solve this problem of language embedded queries based on real-world requirements. Ambition [2] is a widely-used internal DSL for Ruby which provides a very natural syntax for constructing queries. Notably, its core framework is not restricted to merely database access; it has also been applied to other query domains such as LDAP and XPath. However, as can be expected from a framework designed for a dynamically-typed language like Ruby, Ambition provides no static guarantees regarding query correctness. A project very similar to ScalaQL has been developed independently by Stefan Zeiger [13]. Like ScalaQL, this project aims to provide a framework for type safe queries within Scala using for-comprehensions. However, despite this similarity, there are some important diﬀerences. ScalaQL makes use of the pseudo-monadic

ScalaQL: Language-Integrated Database Queries for Scala

163

filter operation for declaring query conditionals, allowing the use of the if syntax in for-comprehensions. Zeiger’s framework deﬁnes a separate series of methods for this (though it can use filter for some conditionals). Projection diﬀers greatly between the frameworks, with ScalaQL relying on anonymous inner-classes while Zeiger’s framework uses ﬁeld combinators to generate arbitrary views (e.g. firstName - lastName - age).

5

Summary

In this paper, we have given a brief overview of the ScalaQL framework, focusing speciﬁcally on static type safety and syntactic intuitiveness. By exploiting the existing for-comprehension construct, ScalaQL blends seamlessly with conventional query-like operations performed on Scala collections. We predict that ScalaQL — or something like it — will become an important part of generalpurpose Scala ORM frameworks in the future.

References 1. Cook, W.R., Rai, S.: Safe Query Objects: Statically typed objects as remotely executable queries. In: Proceedings of the International Conference on Software Engineering (ICSE), pp. 97–106 (2005) 2. Defunkt, C.: Ruby’s Ambition (2008), http://ambition.rubyforge.org/ 3. Elliott, C., Finne, S., De Moor, O.: Compiling Embedded Languages. Journal of Functional Programming 13(03), 455–481 (2003) 4. Fowler, M.: Domain Speciﬁc Language (2007), http://www.martinfowler.com/bliki/DomainSpecificLanguage.html 5. Gil, J.Y., Lenz, K.: Simple and safe SQL queries with C++ templates. In: GPCE 2007: Proceedings of the 6th international conference on Generative programming and component engineering, pp. 13–24 (2007) 6. Leijen, D., Meijer, E.: Domain speciﬁc embedded compilers. In: Proceedings of the 2nd Conference on Domain-Speciﬁc Languages, pp. 109–122 (1999) 7. Maier, D.: Representing database programs as objects. In: Advances in database programming languages, pp. 377–386. ACM, New York (1990) 8. Meijer, E., Beckman, B., Bierman, G.M.: LINQ: Reconciling object, relations and XML in the .NET framework. In: Proceedings of the ACM Symposium on Principles Database Systems (2006) 9. Melton, J., Eisenberg, A.: Understanding SQL and Java together: a guide to SQLJ, JDBC, and related technologies. Morgan Kaufmann, San Francisco (2000) 10. Steele Jr., G.L.: Common LISP: the language. Digital Press (1984) 11. Van Wyk, E., Krishnan, L., Bodin, D., Johnson, E.: Adding domain-speciﬁc and general purpose language features to java with the java language extender. In: Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, pp. 728–729 (2006) 12. Welsh, N., Solsona, F., Glover, I.: SchemeUnit and SchemeQL: Two little languages. In: Third Workshop on Scheme and Functional Programming (2002) 13. Zeiger, S.: A Type-Safe Database Query DSL for Scala (2008), http://szeiger.de/blog/2008/12/21/ a-type-safe-database-query-dsl-for-scala/

Integration of Data Validation and User Interface Concerns in a DSL for Web Applications Danny M. Groenewegen and Eelco Visser Software Engineering Research Group, Delft University of Technology, The Netherlands [email protected], [email protected]

Abstract. Data validation rules constitute the constraints that data input and processing must adhere to in addition to the structural constraints imposed by a data model. Web modeling tools do not address data validation concerns explicitly, hampering full code generation and model expressivity. Web application frameworks do not offer a consistent interface for data validation. In this paper, we present a solution for the integration of declarative data validation rules with user interface models in the domain of web applications, unifying syntax, mechanisms for error handling, and semantics of validation checks, and covering value well-formedness, data invariants, input assertions, and action assertions. We have implemented the approach in WebDSL, a domain-specific language for the definition of web applications.

1 Introduction The engineering of web applications requires catering for a number of different concerns including data models, user interfaces, actions, data validation, and access control. In the mainstream technology for web application development these concerns are supported by loosely coupled languages that require abundant boilerplate code and lack static verification. The domain-specific language engineering challenge for the web application domain [21] is to realize a concise, high-level, declarative language for the definition of web applications in which the various concerns are supported by specialized sub-languages, yet linguistically integrated, and from which implementations can be derived automatically. This requires investigation and understanding of, and the design of appropriate domain-specific languages for each of the sub-domains of the web application domain. Moreover, it requires the seamless linguistic integration of these separate languages that ensures the consistency of models in the different domains and that leverages their combination. This research program is relevant for the discovery of good abstractions for the web engineering domain. It is also relevant as a case study in the systematic development of families of domain-specific languages. In previous work we have studied the domains of data models and user interface definitions [21], access control [6], and workflow [7], the results of which have been implemented as sub-languages of the WebDSL language [22]. In this paper, we address the domain of data validation and its interaction with the user interface. The core of a data-intensive web application is its data model. The web application must be organized to preserve the consistency of data with respect to the data model during updates, deletes, and insertions. The core consistency properties of a data model are formed by structural constraints, that is, the data members of and relations between M. van den Brand, D. Gaˇsevi´c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 164–173, 2010. c Springer-Verlag Berlin Heidelberg 2010

Integration of Data Validation and User Interface Concerns

165

entities. Some consistency properties cannot be expressed as structural constraints. Furthermore, some data integrity constraints do not pertain directly to persistent data. Data validation rules constitute the constraints that data input and processing must adhere to in addition to the structural constraints imposed by the data model. A high-level web engineering solution should provide a uniform and declarative validation model that integrates with the other relevant technical models. In addition to ensuring data consistency by enforcing a validation model, the integration of data validation in a web application requires a mechanism for reporting constraint violations to the user, indicating the origin of the violation in the user interface with a sensible error message and consistent styling. Model-driven methodologies such as OOHDM [18], WebML [4], UWE [10], OOWS [15], and Hera [20] do not make data validation concerns explicit in their models. When generating code from models, as demonstrated for UWE [11], WebML [2], and Hera [5], validating data requires an escape from model to code, hampering full code generation and model expressivity. In this paper, we present a language design that integrates declarative data validation rules with user interface models in the domain of web applications, unifying syntax, mechanisms for error handling, and semantics of validation checks, and that covers value well-formedness, data invariants, input assertions, and action assertions. We have implemented the approach in WebDSL [21], a domain-specific language for the definition of web applications. The main contributions of this paper are (1) the design of abstractions for data validation in web applications for concise and uniform specification of value well-formedness, data invariants, input assertions, and action assertions, (2) the seamless integration of data validation rules and user interface definitions, and (3) an example of the integration of models for multiple technical domains. In the next section we give a brief introduction to WebDSL and the running example used in the rest of the paper. Section 3 discusses validation features necessary for web applications, namely value well-formedness, data invariants, input assertions, and action assertions. Section 4 discusses related and future work, and Section 5 concludes.

2 WebDSL WebDSL [21] is a domain-specific language for the development of web applications that integrates data models, user interface models, user interface actions, styling, access control [6], and workflow [7]. While these different concerns are supported by separate domain-specific sub-languages, the static semantics of the language enforces the integrity of the different concerns of an application model. What distinguishes WebDSL from web application frameworks in general purpose languages [9,13,16] is static verification and abstraction from accidental complexity (boilerplate code). Compared to web modeling tools [19,11,14,2], WebDSL combines high expressivity with good coverage (customization options). The WebDSL compiler generates a complete implementation in Java or Python. In this section we give an overview of the features of WebDSL needed in this paper and introduce the running example used to discuss data validation in this paper. We illustrate the various categories of data validation with a small user management application. The example application consists of two data model entities, namely User

166

D.M. Groenewegen and E. Visser

entity User { username :: String email :: Email }

entity UserGroup { name :: String (id) members -> Set<User> }

define page editUser(u:User) { form { group("User") { label("Username") { input(u.username) } label("Email") { input(u.email) } action("Save", save()) } } action save() { return user(u); } }

Fig. 1. Value well-formedness for Email type

and UserGroup (Fig. 1). Data model definitions describe the persistent data model in a WebDSL application. Data model entities consist of properties with a name and a type. Types of properties are either value types (indicated by ::) or associations to other entities defined in the data model. Value types are basic data types such as String and Int, but also domain-specific types such as Email that carry additional functionality. Associations are composite (the referer owns the object, indicated by <>) or referential (the object may be shared, indicated by ->). Associations can be to collections such as Set or List, demonstrated by the members property of the UserGroup entity. Page definitions in WebDSL describe the web pages that allow users to view and modify data model entities. Page definitions consist of the name of the page, the names and types of the objects passed as parameters, and a presentation of the data contained in the parameter objects. For example, the editUser(u:User) definition in Fig. 1 creates a page for editing the properties of User entity u. WebDSL provides basic markup operators such as group and label for defining the structure of a page. Navigation is realized using the navigate element, which takes a link text and a page with parameters as arguments. Furthermore, page definitions can be reused by declaring them as template. Templates can be included in page definitions by supplying the associated parameters. In addition to presenting data objects, pages can also modify objects. For example, the content of a User entity can be modified with the editUser page. The page element input(u.username) declares an appropriate form input element based on the type of its argument; in this case a text field. A data modification is finalized by means of an action, which can apply further modifications to the objects involved. For example, in the save action the changes to the User are saved. The return statement of an action is used to realize page flow by specifying the page and its arguments where the browser should be directed after finishing the action.

3 Validation Abstractions Data validation is required in multiple contexts in web applications. In this section we distinguish four variants, show how these are expressed in WebDSL using declarative data validation rules, and how error messages are integrated in the user interface.

Integration of Data Validation and User Interface Concerns entity User {

username :: String (id)

password :: Secret

extend entity User { username(validate(isUnique(),"Username is taken")) validate(password.length >= 8, "Password needs to be at validate(/[a-z]/.find(password), "Password must contain validate(/[A-Z]/.find(password), "Password must contain validate(/[0-9]/.find(password), "Password must contain }

email :: Email

167

}

least 8 characters") a lower-case character") an upper-case character") a digit")

define page editUser(u:User) { form { group("User") { label("Username"){ input(u.username) } label("Email"){ input(u.email) } label("New Password") { input(u.password) } action("Save", save()) } } action save() { return user(u); } }

Fig. 2. Data invariants for User entity validation

3.1 Value Well-Formedness Value well-formedness checks verify that a provided input value conforms to the value type. In other words, the conversion of the input value from request parameter to an instance of the actual type must succeed. This type of validation is usually provided by libraries or frameworks. However, it has to be declared explicitly, and possibly at each input of a value of the type. In WebDSL, value well-formedness rules are checked automatically. WebDSL supports types specific for the web domain, including Email, URL, WikiText, and Image. Automatic value well-formedness constraints for all value types provides decent input validation by default. Moreover, these built-in type validation checks and messages can be customized in an application. The editUser page in Fig. 1 consists of a form with labeled inputs for the User entity properties. The save action persists the changes to the database, provided that all validation checks succeed. (Changes to existing entities are automatically stored in WebDSL, new entities need to be saved explicitly using the save() method.) Since well-formedness validation checks are automatically applied to properties, the email property is validated against its well-formedness criteria. The result of entering an invalid email address is shown in the screenshot: a message is presented to the user and the action is not executed. 3.2 Data Invariants Data invariants are constraints on the data model, i.e. restrictions on the properties of data model entities. These validation rules can check any type of property, such as a reference, a collection, or a value type. By declaring validation in the data model, the validation is reused for any input or operation on that data. In Ruby on Rails [16] data invariants can be defined in a ‘validate’ method of the active record class, which

168

D.M. Groenewegen and E. Visser

entity UserGroup {

name :: String (id) moderators -> Set<User>

owner -> User members -> Set<User>

memberLimit :: Int }

extend entity UserGroup { validate(owner in moderators, "Owner must always be a moderator") validate(owner in members, "Owner must always be a member") validate(members.length <= memberLimit, "Exceeds member limit") } define page editUserGroup(ug:UserGroup) { form { group("User Group") { label("Name") { input(ug.name) } label("Member Limit") { input(ug.memberLimit) } label("Moderators") { input(ug.moderators) } label("Members") { input(ug.members) } action("Save", save()) } } action save() { return userGroup(ug); } }

Fig. 3. Data invariants for UserGroup entity validation

then gets called by the framework when validation is required. Multiple checks in a validation method tangle validation for different properties. The Seam [9] framework supports the specification of data invariants declaratively through annotations. However, these annotations consist of a limited number of built-in checks and an escape to specify a custom class that handles validation for a property. In the worst case each validation rule needs a separate class, incurring the syntactic overhead of Java class declarations several times. Validation rules in WebDSL are of the form validate(e,s) and consist of a Boolean expression e to be validated, and a String expression s to be displayed as error message. Any globally visible functions or data can be accessed as well as any of the properties and functions in scope of the validation rule context. Validation checks on the data model are performed when a property on which data validation is specified is changed and when the entity is saved or updated. Validation is connected to properties either by adding the validation in the property annotation or by referring to a property in the validation check. More specific validation checks are supported which are only checked when the entity is in a certain state, these are validatesave, which is checked when an entity is saved for the first time, validateupdate, checked on any update, and validatedelete, checked before deleting the entity. The validation mechanism takes care of correctly presenting validation errors originating from the data model. For form inputs causing data invariant violations the message is placed at the input being processed. When data model validation fails during the execution of an action, the error is shown at the corresponding button. Fig. 2 presents an extended User entity with several invariants and a password property. The username property has the id annotation, which indicates the property is

Integration of Data Validation and User Interface Concerns

169

unique and can be used to identify this entity type. The isUnique member function (a generated function that takes into account the existence of an ’id’ property) is called to verify this constraint. The password property is annotated with validation rules that express requirements for a stronger password. By declaring validation rules in the entity, explicit checks in the user interface can be avoided. Both the WebDSL page definition and the resulting web application page are shown below the entity definition. Fig. 3 shows more advanced validation rules, which express dependencies between the properties of an entity. The UserGroup entity is extended with an owner reference, a moderators set, and a memberLimit value. The editUserGroup page allows the owner to edit some of the UserGroup properties. The validation rule on the moderators set expresses that the owner should always be in this set of moderators (similarly, the owner should always be a member). The member set is constrained in size based on the memberLimit value. Validation rules that cover multiple properties, such as the ’owner in moderators’ check, are performed for all input components of properties the validation is specified on. However, the checks can be added to a single property as well, in order to specialize the error message. 3.3 Input Assertions Input assertions are necessary when the validation rule targets an input that is not directly connected to the persisted data model. These types of constraints are easy to address in the form environment itself. For example, a validation check in XForms [1] verifies properties of the entered form data. The model in XForms, on which validation is specified, is a model of the input data produced by the form. Unfortunately, such form validation solutions are not integrated with validation on the application data model. For example, an input for an entity produces the identifier as form data, in the XForms model it is just a String, but in the application data model it is an entity reference. Validation checks in WebDSL pages have access to all variables in scope, including page variables and page arguments. The placement and order of validation rules does not influence the results of the checks. Visualization of errors resulting from validation in forms are placed at the location of the validation declaration. Usually such a validation rule is connected to an input, which can be expressed by placing the validation rule as a child element of input. The example in Fig. 4 demonstrates the final addition to the user edit form, an extra password input field in which the user must repeat the entered password. This validation cannot be expressed as a data invariant, since the extra password field is not part of the User entity. Therefore, the rule is expressed in the form directly, where it has access to the page variable p. This variable contains the repeated password whereas the first password entry is saved in the password field of User entity u. When entering a different value in the second field the validation error is presented, as can be seen in the screenshot. 3.4 Action Assertions Action assertions are predicate checks at any point in the execution of actions and functions for verification during the processing of inputs. The action processing needs to be

170

D.M. Groenewegen and E. Visser define page editUser(u:User) { var p: Secret; form { group("User") { label("Username") { input(u.username) } label("Email") { input(u.email) } label("New Password") { input(u.password) } label("Re-enter Password") { input(p) { validate(u.password == p, "Password does not match") } } action("Save", action{ } ) } } }

Fig. 4. Form validation with input assertions define page createGroup() { var ug := UserGroup {} form { group("User Group") { label("Name") { input(ug.name) } label("Owner") { input(ug.owner) } action("Save", save()) } } action save() { validate(email(newGroupNotify(ug)) ,"Owner could not be notified by email"); return userGroup(ug); } }

Fig. 5. Action assertions for UserGroup creation define page editUser(u:User) { ... form ... action save() { message("User information success..."); return user(u); } } define page user(u:User) { group("User") { label("Username") { output(u.username) } label("Email") { output(u.email) } navigate(editUser(u)) {"edit"} } }

Fig. 6. Success message

aborted, reverting any changes made, and the validation message has to be presented in the user interface. This type of validation is not directly supported in existing solutions, requiring an investment in finding appropriate hooks in the implementation. For example, Ruby on Rails [16] assumes validation is specified in data model classes, errors are passed through those model classes and the form mechanism is built around that. There is no mechanism for a validation check as part of a controller action, this requires a lowlevel encoding that passes the check result and error message, or wrapping validation in a data model class. WebDSL supports this type of validation transparently using the same validation rules. The errors resulting from action assertion failures are displayed at the place the execution originated, e.g. above the submit button which triggered the erroneous action.

Integration of Data Validation and User Interface Concerns

171

Fig. 5 provides an example of action assertion. On the right is a page definition for a createGroup page which allows creating new UserGroup entities. The constraint expressed in the save action is that creating a new group requires email notification to the specified owner (which might not be the user executing this operation). The newGroupNotify email definition retrieves an email address from its UserGroup argument (through ug.owner.email) and tries to send a notification email to the owner of the new group. When this fails, for instance because there is no mail server responding to the email address, the call returns false and the validation check produces the error. This result is shown on the left in the screenshot. Generic error handling, such as problems with a database commit, can also be expressed using action assertions. The web application can then display an error message in the form instead of redirecting to a default error page. 3.5 Messages This section has described assertions that report erroneous behavior in actions. Related to such action assertions, is a generic messaging mechanism for giving feedback about the correct execution of an action. This requires a place to show messages, for instance by adding a default message template at the top of each page. Furthermore, the message should be declared in the action code. An example of such messaging is shown in Fig. 6. The save action of the editUser page gives a message to the page redirected to, namely user. The result of the executed action is shown on the left. 3.6 Validation Mechanics A page request in WebDSL is processed in the following five phases: Convert request parameters: check value well-formedness validation rules for page arguments and input parameters, then convert these to the correct types. Update model values: check data invariants for input data, and then insert in data model entities. Validate forms: check input assertions in page definitions. Handle actions: perform action, abort if an action assertion fails (in that case no changes are made to the data model). Render or redirect: show page, including produced validation errors. Redirect if an action executed successfully.

4 Discussion Web Modeling Tools Several model-driven methodologies for creating web applications have been proposed in recent years, including OOHDM [18], SHDM [12], WebML [4], UWE [10], OOWS [15], and Hera [20]. WebDSL goes beyond being a methodology for designing web applications and providing a path to actual implementation by leveraging full code generation. The transformation from problem space to solution space is completely automated. In this paragraph we discuss how these methodologies and their tools relate to WebDSL in general, and data validation integration in particular. The Hera Presentation Generator [5] allows modeling forms to support editing data in the session. The persisted domain data of the application cannot be changed. Hera-S [19] also incorporates persisting form input data through update queries. The only example in the paper of such an update shows incrementing a view counter, a simple

172

D.M. Groenewegen and E. Visser

operation that does not process form input data. Kraus et al. [11] present the generation of partial web applications from UWE models. An application skeleton is generated including JSP pages and navigation between them. Forms and input data are not discussed, which probably means it is part of the custom code. HyperDe [14] is a tool that allows online creation of web applications designed with the SHDM method. The paper shows an example of an input field for a person’s email address. This involves manual construction of data binding (showing the email and reading it from the submit data) and does not indicate how validation of that input can be performed. WebRatio [2] is a tool for generating web applications based on the WebML method. The conceptual WebML models do not model data validation concerns, while WebRatio does have form validation features. These can be directly mapped to validation features in the underlying Struts [3] framework. Validation which goes beyond the form, such as querying the database, has to be implemented in a Struts validator class. This implementation requires intricate knowledge of the translation process and implementation platform. From our study of the literature we conclude that declarative modeling of data validation is ignored in model driven web engineering. As a result, validation concerns require an escape from model to code, hampering full code generation and model expressivity. Future Work. The current validation model focuses on verifying that the data satisfies a set of constraints. Actions that break these constraints are forbidden and result in an error message. An alternative approach would be to solve constraints automatically [8] and repair data so that it complies with the constraints or to suggest such repairs to the user. Since most inputs in web application forms are strings, expressivity of validation rules could be increased by incorporating a domain-specific language for string constraints. Scaffidi et al. [17] demonstrate that parsing technology can provide rich string input validation and feedback.

5 Conclusion The domain-specific language engineering challenge for the web application domain [21] is to realize a concise, high-level, declarative language for the definition of web applications in which the various concerns are supported by specialized sub-languages, yet linguistically integrated, and from which implementations can be derived automatically. This paper presents a solution for the integration of data validation, a vital component of web applications, into a web application DSL that includes data models, user interfaces, and actions. This solution unifies syntax, mechanisms for error handling, and semantics for data validation checks covering value well-formedness, data invariants, input assertions, and action assertions. Our approach improves over current web modeling tools by providing declarative data validation rules from which a complete implementation is generated. Unlike web application frameworks, our solution supports different kinds of data validation uniformly. The integration of data validation rules into WebDSL, a web application DSL that supports data models, user interfaces, and actions, allows web application developers to take a truely model-driven approach to the design of web applications, concentrating on the logical design of an application rather than the accidental complexity of low-level implementation techniques.

Integration of Data Validation and User Interface Concerns

173

References 1. Boyer, J.M. (ed.): XForms 1.0, 3rd edn. W3C Recommendation (2007) 2. Brambilla, M., Comai, S., Fraternali, P., Matera, M.: Designing web applications with WebML and WebRatio. In: Web Engineering: Modelling and Implementing Web Applications, pp. 221–260 (2007) 3. Brown, D., Davis, C., Stanlick, S. (eds.): Struts 2 in Action. Manning Publ. Co. (2008) 4. Ceri, S., Fraternali, P., Bongio, A.: Web Modeling Language (WebML): a modeling language for designing Web sites. Computer Networks 33(1-6), 137–157 (2000) 5. Frasincar, F., Houben, G., Barna, P.: HPG: the Hera Presentation Generator. Journal of Web Engineering 5(2), 175 (2006) 6. Groenewegen, D.M., Visser, E.: Declarative access control for WebDSL: Combining language integration and separation of concerns. In: Schwabe, D., Curbera, F. (eds.) International Conference on Web Engineering (ICWE 2008), July 2008, pp. 175–188 (2008) 7. Hemel, Z., Verhaaf, R., Visser, E.: WebWorkFlow: An object-oriented workflow modeling language for web applications. In: Czarnecki, K., Ober, I., Bruel, J.-M., Uhl, A., V¨olter, M. (eds.) MODELS 2008. LNCS, vol. 5301, pp. 113–127. Springer, Heidelberg (2008) 8. J¨arvi, J., Marcus, M., Parent, S., Freeman, J., Smith, J.N.: Property models: from incidental algorithms to reusable components. In: GPCE, pp. 89–98 (2008) 9. Kittoli, S. (ed.): Seam - Contextual Components. A Framework for Enterprise Java. Red Hat Middleware, LLC (2008) 10. Koch, N., Kraus, A., Hennicker, R.: The authoring process of the UML-based web engineering approach. In: Web-Oriented Software Technology (2001) 11. Kraus, A., Knapp, A., Koch, N.: Model-driven generation of web applications in UWE. In: Model-Driven Web Engineering (MDWE 2007), Como, Italy (July 2007) 12. Lima, F., Schwabe, D.: Application modeling for the semantic web. In: Latin AmericanWeb Congress (LA-WEB 2003), Washington, DC, USA, p. 93. IEEE Computer Society, Los Alamitos (2003) 13. MacDonald, M., Szpuszta, M.: Pro ASP. NET 3.5 in C# 2008. Apress (2007) 14. Nunes, D., Schwabe, D.: Rapid prototyping of web applications combining domain specific languages and model driven design. In: International Conference on Web Engineering (ICWE 2006), pp. 153–160 (2006) 15. Pastor, O., Fons, J., Pelechano, V.: OOWS: A method to develop web applications from weboriented conceptual models. In: Web Oriented Software Technology (IWWOST 2003), pp. 65–70 (2003) 16. Ruby, S., Thomas, D., Heinemeier Hansson, D.: Agile Web Development with Rails, 3rd edn. Pragmatic Programmers (2009) 17. Scaffidi, C., Myers, B.A., Shaw, M.: Topes: reusable abstractions for validating data. In: ICSE 2008, pp. 1–10 (2008) 18. Schwabe, D., Rossi, G., Barbosa, S.: Systematic hypermedia application design with OOHDM. In: Proceedings of the the seventh ACM conference on Hypertext, pp. 116–128. ACM, New York (1996) 19. van der Sluijs, K., Houben, G., Broekstra, J., Casteleyn, S.: Hera-S: web design using sesame. In: International Conference on Web Engineering (ICWE 2006), pp. 337–344 (2006) 20. Vdovjak, R., Frasincar, F., Houben, G., Barna, P.: Engineering semantic web information systems in Hera. Journal of Web Engineering 2, 3–26 (2003) 21. Visser, E.: WebDSL: A case study in domain-specific language engineering. In: L¨ammel, R., Visser, J., Saraiva, J. (eds.) Generative and Transformational Techniques in Software Engineering II. LNCS, vol. 5235, pp. 291–373. Springer, Heidelberg (2008) 22. Visser, E., et al.: WebDSL, 2007–2009, http://webdsl.org

Ontological Metamodeling with Explicit Instantiation Alfons Laarman and Ivan Kurtev Department of Computer Science, University of Twente, the Netherlands {a.w.laarman,kurtev}@ewi.utwente.nl

Abstract. Model Driven Engineering (MDE) is a promising paradigm for software development. It raises the level of abstraction in software development by treating models as primary artifacts. The definition of a metamodel is a recurring task in MDE and requires sound and formal support. The lack of such support causes deficiencies such as conceptual anomalies in the modeling languages. From philosophical point of view metamodels can be seen as metaconceptualizations. Metalanguages have to provide constructs for building ontological theories as a base for modeling languages. This paper describes a new metalanguage derived from the study of Formal Ontology. This metalanguage raises the level of abstraction of metamodels from pure abstract syntax to semantics descriptions based on ontologies. Thus, the language developers can make conscious choices for their modeling concepts and can explicitly define important relations such as instantiation and generalization. With this metalanguage, we aim at a precise conceptual and formal foundation for metamodeling. Keywords: Metamodeling, ontologies, instantiation semantics.

1 Introduction Model Driven Engineering (MDE) relies on models and model transformations for development of software systems. We perceive models are symbolic entities expressed in a modeling language. The traditional way for defining a language is to define first its grammar. In MDE, the central concept is the language metamodel. Metamodels are often regarded as definitions of the language abstract syntax. Currently, language developers are supported by several, mainly object-oriented metamodeling languages: ECore, MOF, KM3 [13]. Concerning metamodeling, several problems with a conceptual nature are still open. Atkinson and Kühne distinguish between linguistic and ontological metamodeling based on linguistic and ontological instantiation respectively [2]. They argue that both types of metamodeling are equally important. The linguistic instantiation is used by the modeling tools builders. The ontological instantiation is used by domain experts. However, it is difficult to build ontological models and metamodels because the current metalanguages support mainly linguistic metamodeling and do not provide first-class constructs for ontological metamodeling. Furthermore, Guizzardi [9] evaluates UML for its suitability to perform conceptual modeling. The base for evaluation is a foundational ontology elaborated by the author and incorporating previous work presented in [11] and [7]. The foundational ontology specifies the structures in the world recognized by the study of formal ontology. M. van den Brand, D. Gašević, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 174–183, 2010. © Springer-Verlag Berlin Heidelberg 2010

Ontological Metamodeling with Explicit Instantiation

175

Guizzardi shows that the ontological meaning of models based on formal ontology cannot be retained when these models are expressed in UML. The UML language has several anomalies that decrease the quality of models. Examples of such anomalies are construct overloading, construct redundancy, and construct incompleteness. The current metamodeling practice demonstrated by the metalanguages from the MOF family does not consider the ontological foundations of (meta-) modeling. Since the MOF corresponds to UML infrastructure, any modeling language (domainspecific or general-purpose) can potentially suffer the same anomalies found in UML. The described problems emerge due to two reasons: lack of clear understanding of the metamodeling activity regarding its ontological foundation and lack of constructs in the current metalanguages to express required explicit information. We address the problems described above by proposing a view on the content of metamodels and a new metalanguage. In our approach, metamodels are lifted from pure abstract syntax definitions to expressions of metaconceptualizations based on a foundational ontology. We retain the structural definition of a language and enhance it with ontological meaning. The philosophical justification of our approach comes from the statement of Quine that in every language an ontology can be found [19]. Thus, the metamodeling activity is a task that identifies and specifies the world structures that are of an interest to solve a given problem. The metalanguage has to be capable to express such structures. We use a simple foundational ontology Four-category Ontology to build a new metalanguage. We propose Ontology Grounded Metalanguage (OGML) as an experimental language for studying the definitions of metamodels based on ontological principles. In OGML, linguistic and ontological instantiations are treated uniformly from technical perspective. Both are defined on the basis of the explicit instanceOf definition construct in OGML. The paper is organized as follows. Section 2 clarifies the meaning of the concepts used in the paper. Section 3 presents the Four-category ontology and compares it with existing foundational ontologies. Section 4 describes OGML by examples. Section 5 discusses the main open issues and positions our approach within the existing work. Section 6 concludes the paper.

2 Conceptual Background The title of this paper refers to terms that are interpreted in different ways in the literature: ontology (ontological), metamodel(-ing), and instantiation. We give a short background on these terms and give our understanding. A commonly accepted notion of metamodel is that it is a model of models expressed in a given language. Thus, a metamodel defines the constraints for all the admissible models expressed in the language. Often, the metamodel is regarded as a definition of the abstract syntax of the language. The term ontological metamodeling (and ontological instantiation) was introduced by Atkinson and Kühne in [2]. They distinguish between linguistic and ontological metamodeling. Fig. 1 illustrates the distinction between them in the context of the three-levels MOF architecture.

176

A. Laarman and I. Kurtev

Fig. 1. Linguistic and ontological instantiation

Linguistic metamodeling is used to define metamodels of languages. The instances of metamodels are models at M1 obtained by linguistic instantiation. Linguistic metamodeling defines the form that a statement (model) in a language may take. Linguistic instanceOf delimits metalevels (e.g. M1 and M2). Ontological metamodeling allows the type/instance relation to exist within a single metalevel. In Fig. 1 the object Lassie is an instance of the class Collie. The instanceOf relation is called ontological and it is concerned with the content that a statement (model) has by representing a particular domain. The ontological instanceOf partitions models into ontological levels (e.g. O1 and O2) within a single linguistic level. The linguistic instanceOf is defined by the metalanguage used to define metamodels (for example, MOF) and the ontological instanceOf is defined by a particular modeling language (for example, UML). Guizzardi [10, 8] studies the relation between metamodels and ontologies. He recognizes two distinct purposes of metamodels: as a definition of the abstract syntax and as a definition of the world view underlying the language. Assume that we would like to define a language that describes state of affairs in a given domain (Fig. 2). 'RPDLQ 2QWRORJ\

UHSUHVHQWHGE\

'RPDLQ &RQFHSWXDOL]DWLRQ

UHSUHVHQWHGE\

LQVWDQFH2I

LQVWDQFH2I

$SDUWLFXODUVWDWH RIDIIDLUV

UHSUHVHQWHGE\

'RPDLQ $EVWUDFWLRQ

/DQJXDJH 0HWDPRGHO

LQVWDQFH2I

UHSUHVHQWHGE\

0RGHO

Fig. 2. Domain conceptualization and metamodel

The middle column in Fig. 2 represents domain abstractions and a domain conceptualization. They are conceptual entities in the modeler’s mind. In order to communicate them we define a language to be used to specify models. In Fig.2, we show the language metamodel. Guizzardi understands the term metamodel as a specification of the world view of the language, that is, the description of what a language can describe in terms of real world phenomena. The capability of the language to express certain domain is measured by comparing the elements of the metamodel to the elements of the representation of the domain conceptualization called domain ontology. Here the domain ontology is supposed to be the best possible representation of the

Ontological Metamodeling with Explicit Instantiation

177

domain conceptualization. The smaller the gap between the domain ontology and the metamodel is the more precisely the models can represent the real world phenomenon in the domain. Unfortunately, current practice of metamodeling in MDE mostly treats metamodels as definitions of the abstract syntax. Metamodelers are not aware of the real world meaning of the language constructs. The result is decreased quality of the models due to anomalies in the modeling languages. Metalanguages such as MOF are not expressive enough to articulate the difference between various modeling constructs. Consider for example the model elements Collie and Lassie in Fig. 1. They are instances of MOF classes, that is, they are MOF objects. However, the real world meaning is rather different. Lassie represents an individual, a concrete collie. Collie represents the characteristics of all the dogs of this breed, that is, it captures the universal properties of the collies. In a MOF-like architecture, this difference is not expressible. Furthermore, the metatypes Class and Object classify types and individuals respectively, so they are different. Both are instances of MOF Class and consequently indistinguishable by the MOF-based tools. Finally, the definition of the ontological instanceOf in UML is just a MOF association and is treated as any other association in the UML metamodel. We aim at retaining ontological properties of the metamodels by treating them as representation of the language underlying world view. Therefore, metamodels become more than descriptions of the abstract syntax of a language. They are enriched with explicit knowledge of the ontological nature of their constructs. When we talk about explicit instantiation, we mean that a metamodeling language provides us with a firstclass construct for defining ontological instantiations according to the understanding of Kühne.

3 Approach In MDE, metamodels are expressed in a language called metalanguage. Current metalanguages are mainly object-oriented due to pragmatical reasons such as familiarity to the developers and tool support. If we perceive a metamodel as something more than a structural definition, then we need to study the requirements for a suitable metalanguage. Consider the upper layer in Fig. 2. The domain ontology is an artifact expressed in a language. What is the domain conceptualization of this language? What is the “ideal” ontology that captures this conceptualization? According to Guizzardi, we can apply the pattern in Fig. 2 by treating domain conceptualizations as a domain of study. The result of the application is shown in Fig. 3. The set of various domain-specific conceptualizations is conceptualized in a domain-independent metaconceptualization. The representation of this metaconceptualization as an ontology is called Foundational Ontology. It is derived from the study of Formal Ontology. Several authors provide concrete versions of Fig. 3. Wand [20] uses the Bunge-Wand-Weber (BWW) ontology as a foundational ontology and UML as a language for expressing domain models. Guizzardi performs a similar study on UML by using Unified Foundation Ontology (UFO) as a foundational ontology. The two approaches study the ontological correctness of UML metamodel.

178

A. Laarman and I. Kurtev

Fig. 3. Ontologies and metaconceptualization

We aim at formulating language metamodels by using a vocabulary derived from a foundational ontology. In this way, the constructs of metamodels become instances of the most fundamental and domain-independent ontological categories. For example, UML Class and ER Entity are classified as constructs that are used to represent classifiers (or universals). Although they belong to different languages, they have a similar ontological nature. In this way, metamodels carry additional ontological information that can be used to align and compare metamodels with each other as well as to a given foundational ontology. The approach for treating metamodels as representation of metaconceptualizations leads to the following interpretation of the metalevels: • M1: models that represent reality. They are expressed in a modeling language; • M2: metamodels of modeling languages that represent the real world view embodied in the language; • M3: a metametamodel of a metalanguage. The metalanguage is used to express various worldviews. It is derived from a metaconceptualization, which in turn is derived from a foundational ontology; To proceed with this approach we need to select a Foundational Ontology. We examined several existing foundational ontologies: UFO, DOLCE [5], BWW. We used the following criteria for selecting a foundational ontology: • The ontology should be simple; • The constructs should be familiar to the developers; • The ontology should allow expressing the metamodels of the major existing programming, data description, and modeling languages, both general purpose and domain specific; Considering these requirements we opt for a descriptive minimalistic ontology, like in the approach of Guizzardi and Wand et al. Also because our work can be considered as an initial experiment in applying formal ontology theory in metamodeling, we chose a small foundational ontology called Four-category Ontology (FCO). For the sake of minimality, we did not include the refined concepts of universals such as sortal, role, category, etc. found in UFO. In FCO, the basic distinction is between individuals and universals as the most fundamental entities of being. Figure 4 depicts the concepts in this ontology. Individuals are classified as Substantial and Moment individuals. Substantial individual or just substance is something that can exist by itself without depending on the existence of other individuals. In the programming languages and modeling languages, substantial individuals are usually represented as objects (e.g. Java object and UML object).

Ontological Metamodeling with Explicit Instantiation

179

Fig. 4. The Four-Category Ontology

Moments are individuals that exist in other individuals. Moments cannot exist standalone, they are existentially dependent on at least one individual (called bearer). The relation between a moment and its bearer(s) is called Inherence relation. Moments may inhere in more than one individual. In programming and modeling languages, moments are called in various ways: slot and link in UML, field in Java, etc. Universals are entities that can be instantiated in individuals. The individuals that exemplify a universal have something in common. For example, things that consist of matter have a mass. In this case mass is a universal. Universals are classified into substantial and moment universals. Substantial universals are exemplified by substantial individuals and moment universals are exemplified by moment individuals. Instantiation relation is the relation between an individual and a universal. Universals have their representatives in the existing computer languages. UML classes correspond to substantial universals. UML attributes and associations correspond to moment universals.

4 Ontology Grounded Metalanguage OGML is our experimental metalanguage based on FCO. It helps the language developers to make conscious choices for their modeling concepts and enforces the definition of important relations such as instantiation and generalization. In the current section, we introduce OGML by defining the metamodel of a tiny subset of UML, called Simple UML. The metamodel of the language is shown in Fig. 5 (left part) together with an example model (right part) and instanceOf relations. The upper part represents class diagrams and the lower part object diagrams. A metamodel expressed in OGML consists of definitions. Definitions describe how a particular language conceptualizes the world by defining the structure of universals and individuals. In addition to this, a metamodel may define explicitly the instantiation and generalization relation of the language. UML classes, for example, the Crocodile, are substantial universals from ontological point of view. We instantiate the OGML construct SubstantialDefinition to express that the element Class in the UML metamodel defines the structure of substantial universals (lines 1-2). Classes have attributes, which are in turn moment universals, expressed as instances of the OGML construct MomentDefinition (lines 4-9). The relation between a moment definition and the substantial definition(s) is called characterization relation. The definition of Attribute states the fact that concrete attributes are attached to a single class. Since characterization relation connects two constructs, it has two roles: the universalDefinitionRole and a momentDefinitionRole. To define UML Association

180

A. Laarman and I. Kurtev

Fig. 5. The example language SimpleUML

then we could instantiate MomentDefinition with two characterization relations. This expresses the fact that the instances of associations (called links in the context of UML) are moments that inhere in two individuals. It should be noted that OGML allows a moment definition to characterize another moment definition. This ultimately allows a moment to inhere in another moment. This is a major difference with the BWW ontology where properties do not have properties. The definition how UML represents individuals follows a similar structure. Substantial individuals are defined by instantiating ObjectDefinition (lines 11-12) and moments are defined by a PropertyDefinition (lines 14-17). The fact that UML slots inhere in UML objects is expressed by the dependsOn clause. 1. SubstantialDefinition Class { 2. } 3. 4. MomentDefinition Attribute { 5. attribution universalDefinition = "Class" 6. universalDefinitionRole = "owner" 7. momentDefinitionRole = "attributes" 8. multiplicity = 1-*; 9. } 10. 11. ObjectDefinition Object { 12. } 13. 14. PropertyDefinition Slot { 15. value : String; 16. dependsOn Object role = "slots" multiplicity = *; 17. }

An important construct in OGML is the definition of instanceOf relations. In the terminology of Fig. 1, OGML metamodels defines instantiation relation as a first class construct. Let us consider the definition of UML instanceOf. We need to express the facts that (a) classes are instantiated to objects and attributes to slots; (b) an individual can be queried for the values of its moments and the values obey certain constraints. The concrete syntax is illustrated in the following listing. Line 2 states that every class is instantiated to an object. In this case, substantial universals are instantiated to substantial individuals.

Ontological Metamodeling with Explicit Instantiation

181

1. Relations UMLInstanceOfAssociationsOnLinks { 2. c : Class -> o : Object { 3. } 4. a : Attribute -> s : Slot { 5. attribution { 6. naming name <- a.name; 7. valuing [a.lowerbound .. a.upperbound] s.value; 8. typing a.type; 9. } 10. } 11. }

OGML allows substantial universals to be instantiated to other universals, thus achieving a multilevel ontological metamodeling according to Fig. 1. Line 4 states the attribute moment universals are instantiated to slots. If an UML object has a set of slots then the object may be queried by using the name of the slot, which is obtained as the name of the defining attribute (line 6). The value of the slot is stored in its value property (line 7). Lines 7 and 8 also specify multiplicity and typing constraints. Querying the value of a moment is based on the concept of attribute function used in BWW ontology. For each moment, at least one attribute function is defined. In our example, slots are unary moments and only one attribute function is needed. If a moment inheres in more than one individual then an attribute function is defined per characterization relation. Note that line 5, explicitly names the characterization to which the attribute function is assigned (attribution). OGML explicitly defines its own instanceOf relation following the same idea illustrated in the SimpleUML example. Hence, from the perspective of the tools using models, there is no technical difference between the linguistic and ontological instantiations. We built a tool [18] that allows expressing OGML metamodels and conforming models in a concrete syntax. By having two models that are related either by linguistic or ontological instanceOf, and the metamodel of their language, the tool is capable of checking the conformance between the models by using a single algorithm. The tool provides full OCL support with an extension for dealing with multiple classifications of a given model element (for example, the crocodile Jena is an instance of Object from the point of view of OGML and an instance of Crocodile from UML point of view).

5 Discussion and Related Work The design of OGML raises multiple questions. The first question is the choice of a foundational ontology. We opted for FOC due to its simplicity and the observation that its constructs are usually represented in some form in many computer languages. However, in the current version of OGML it is not possible to treat properly primitive data types such as integers, booleans, etc. They are equalized to substantial individuals, which is ontologically debatable. OGML needs to be extended with constructs for defining abstract entities, for example, mathematical structures. The second question is how to incorporate a full-fledged Foundational Ontology such as UFO. One possibility is to extend OGML. This will result in a large metametamodel with many constructs needed for conceptual modeling only. Another possibility is to define a foundational ontology as a metamodel. In any case, committing to a certain foundational ontology as a theoretical base for OGML poses an immediate limitation that all the models in the modeling space become more or less

182

A. Laarman and I. Kurtev

aligned with the world view of one ontology. However, there may be other foundational ontologies that are perfectly possible alternatives. The third question is about how OGML relates to the existing self-reflective metametamodels. OGML is defined as a self-reflective metametamodel [17]. This definition poses interesting challenges that deserve a separate paper, and is intentionally omitted here due to a lack of space. We claim that technically the linguistic and ontological instantiations are the same, at least because they are all expressed by a single OGML construct. On the other hand, the work by Kühne [14, 15] and Gasevic [6] indicate the opposite. We have to clearly state that we do not claim conceptual equivalence between the two types of instantiations. In [14, 15, 6] they are distinguished mainly on the basis of the nature of the represented systems by the so-called represents or μ relation. In our work, we do not represent this relation, hence this difference is not apparent. Furthermore, our understanding of instanceOf is a shortcut similar to the conformantTo relation used by Bezivin, Favre, and Gasevic. When we say that an object o is an instance of class C according to a certain definition of the instanceOf relation, we mean the following. o is a member of the extension of C, where the membership is checked on the basis of the semantics of OGML instanceOf definition construct (encoded in the tool) and the intensional representation of C. On the other hand, the intensional representation of C perceived simply as an expression in a given language may be a member of the extension of another class. Clearly, Gasevic made this same distinction. It should be noted that the difference between the ontological and linguistic instantiations and the nature of a metamodel are still debatable [12]. A language with at least three levels of ontological instantiation may allow representation of MOF, MOF metamodels, and MOF models in a single level. Then, the linguistic instantiation in the context of MOF becomes ontological. Thus, these two concepts appear to be relative. It is beyond the scope of this paper (and the space does not permit) to discuss this issue. Atkinson and Kühne [1] propose an approach for multilevel metamodeling in which a modeling construct is assigned with a potency that indicates how many times it can be instantiated. Although this seems reasonable from technical point of view, there is no guidance to the modeler how to assign the potency value. We believe that considering the ontological nature of a modeling construct is a clearer way to reason about the instantiations.

6 Conclusions In this paper, we proposed a view on metamodeling that treats metamodels as specifications of the world view embodied in a modeling language. This view is regarded as a metaconceptualization and is expressed in a metalanguage called OGML built upon a foundational ontology. As such, metamodels are more than just a definition of the abstract syntax of a language. In addition, we provide a construct for explicit definition of instantiation relation for the modeling languages and it is applied to the OGML itself. This enables support of ontological metamodeling based on formal ontology theory and uniform treatment of linguistic and ontological instantiation in the modeling tools. We envision at least two promising applications of this approach: interoperability in the line of [3] and enhancing the set of transformation scenarios in MDE as

Ontological Metamodeling with Explicit Instantiation

183

described in [16]. These two applications together with a proper formalization of OGML are the main directions for a future research.

References 1. Atkinson, C., Kühne, T.: The Essence of Multilevel Metamodeling. In: Gogolla, M., Kobryn, C. (eds.) UML 2001. LNCS, vol. 2185, pp. 19–33. Springer, Heidelberg (2001) 2. Atkinson, C., Kühne, T.: Model-driven development: a metamodeling foundation. IEEE Software 20(5), 36–41 (2003) 3. Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P.A., Gianforme, G.: Model-independent schema translation. VLDB J. 17(6), 1347–1370 (2008) 4. Degen, W., Heller, B., Herre, H., Smith, B.: GOL: toward an axiomatized upper-level ontology. In: FOIS 2001, pp. 34–46 (2001) 5. Gangemi, A., Guarino, N., Masolo, C., Oltramari, A.: Sweetening WORDNET with DOLCE. AI Magazine 24(3), 13–24 (2003) 6. Gasevic, D., Kaviani, N., Hatala, M.: On Metamodeling in Megamodels. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 91–105. Springer, Heidelberg (2007) 7. Guarino, N., Welty, C.A.: A Formal Ontology of Properties. In: Dieng, R., Corby, O. (eds.) EKAW 2000. LNCS (LNAI), vol. 1937, pp. 97–112. Springer, Heidelberg (2000) 8. Guizzardi, G.: Ontological Foundations for Structural Conceptual Models. PhD thesis. University of Twente (2005) ISBN 90-75176-81-3 9. Guizzardi, G., Ferreira Pires, L., van Sinderen, M.: An Ontology-Based Approach for Evaluating the Domain Appropriateness and Comprehensibility Appropriateness of Modeling Languages. In: Briand, L.C., Williams, C. (eds.) MoDELS 2005. LNCS, vol. 3713, pp. 691–705. Springer, Heidelberg (2005) 10. Guizzardi, G.: On Ontology, ontologies, Conceptualizations, Modeling Languages, and (Meta)Models. In: DB&IS 2006, pp. 18–39 (2006) 11. Heller, B., Herre, H.: Ontological Categories in GOL. Axiomathes 14, 71–90 (2004) 12. Hesse, W.: More matters on (meta-)modelling: remarks on Thomas Kühne’s "matters". Software and System Modeling 5(4), 387–394 (2006) 13. Jouault, F., Bézivin, J.: KM3: a DSL for Metamodel Specification. In: Gorrieri, R., Wehrheim, H. (eds.) FMOODS 2006. LNCS, vol. 4037, pp. 171–185. Springer, Heidelberg (2006) 14. Kühne, T.: Matters of (Meta-)Modeling. Software and System Modeling 5(4), 369–385 (2006) 15. Kühne, T.: Clarifying matters of (meta-) modeling: an author’s reply. Software and System Modeling 5(4), 395–401 (2006) 16. Kurtev, I., van den Berg, K.: MISTRAL: A Language for Model Transformations in the MOF Meta-modeling Architecture. In: MDAFA 2004, pp. 139–158 (2004) 17. Laarman, A.W.: An Ontology Based Metalanguage with Explicit Instantiation. Master’s thesis, University of Twente (2009) 18. OGML website, http://wwwhome.cs.utwente.nl/~laarman/ogml/ (retrieved at 15-9-09) 19. Quine, W.V.O.: Ontological relativity’ and other essays. Columbia University Press, New York (1969) 20. Wand, Y., Storey, V., Weber, R.: An Ontological Analysis of the Relationship Construct in Conceptual Modeling. ACM Trans. DB Syst. 24(4), 494–528 (1999)

Verifiable Parse Table Composition for Deterministic Parsing August Schwerdfeger and Eric Van Wyk Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN {schwerdf,evw}@cs.umn.edu

Abstract. One obstacle to the implementation of modular extensions to programming languages lies in the problem of parsing extended languages. Speciﬁcally, the parse tables at the heart of traditional LALR(1) parsers are so monolithic and tightly constructed that, in the general case, it is impossible to extend them without regenerating them from the source grammar. Current extensible frameworks employ a variety of solutions, ranging from a full regeneration to using pluggable binary modules for each diﬀerent extension. But recompilation is time-consuming, while the pluggable modules in many cases cannot support the addition of more than one extension, or use backtracking or non-deterministic parsing techniques. We present here a middle-ground approach that allows an extension, if it meets certain restrictions, to be compiled into a parse table fragment. The host language parse table and fragments from multiple extensions can then always be eﬃciently composed to produce a conﬂict-free parse table for the extended language. This allows for the distribution of deterministic parsers for extensible languages in a pre-compiled format, eliminating the need for the “source code” grammar to be distributed. In practice, we have found these restrictions to be reasonable and admit many useful language extensions.

1

Introduction

In parsing programming languages, the usual practice is to generate a single parser for the language to be parsed. A well known and often-used approach is LR parsing [1] which relies on a process, sometimes referred to as grammar compilation, to generate a monolithic parse table representing the grammar being parsed. The LR algorithm is a generic parsing algorithm that uses this table to drive the parsing task. However, there are cases in which it is desirable to generate diﬀerent portions of a parser separately and then put them together without any further monolithic analysis. An example can be found in the case of extensible programming languages, wherein a host language such as C or Java is composed with several extensions, each possibly written by a diﬀerent party. The

This work was partially funded by the National Science Foundation grants #0347860 and #0429640.

M. van den Brand, D. Gaˇ sevi´ c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 184–203, 2010. c Springer-Verlag Berlin Heidelberg 2010

Veriﬁable Parse Table Composition for Deterministic Parsing

185

connection tripdb with table trip_log ; class TripLogData { boolean examine_trips ( ) { rs = using tripdb query { SELECT dist, time FROM trips WHERE time > 600 } ; boolean res = false ; foreach (int dist, int time) in rs { Unit t = time ; Unit d = distance ; Unit r = d / t ; res = res || table ( t < 3600 : T F d > 10000 : F * ) } return res ; } } Fig. 1. Code written in an extended version of Java using the SQL, foreach, tables, and dimension analysis extensions

host language grammar is distributed to extenstion developers for use in writing language extensions. Finally, the user of the extensible language will collect the set of grammars deﬁning the desired language extensions and provide them to a compiler generator that merges the host and extension grammars and builds a monolithic parser from this composition. Of course, extensible languages must also provide a means for describing the semantics of the language extensions. This can be accomplished by rewriting rules as in MetaBorg [2] or by composable attribute grammar speciﬁcations [3,4,5], but these are not discussed in this paper. Consider the small, slightly contrived, sample program in Figure 1 written in a version of Java to which ﬁve extensions have been added [4]. The ﬁrst and second add the using ... query ... and connection ... constructs, respectively, to extend Java with the database query language SQL, for static detection of syntax and type errors in the query. The import-like connection construct sets up the connection to the database and retrieves database type schemas to type-check the query. (In practice, these two extensions are bundled in a single SQL-embedding package, but grammatically remain separate to satisfy the restrictions placed on extensions, which are discussed further below.) The third extension is a facsimile of the foreach loops added in Java 1.5. A fourth extension implements new types to support dimension and unit analysis, which ensure that values representing physical measurements (e.g. d, a length in meters) are used correctly. This is done by overloading arithmetic operators for these types to detect incorrect use of these values, e.g., adding a length to a rate. The ﬁfth extension adds a construct for representing boolean conditions in a tabular form, inspired by similar constructs in modeling languages such as RSML−e [6]. The keyword table is followed by rows consisting of a Java expression followed by a colon and several truth-indicators (T, F, *) indicating if the expression is expected to be true, or false, or if it does not matter. In this case,

186

A. Schwerdfeger and E. Van Wyk

the table evaluates to true if t < 3600 is true and d > 10000 is false, or if t < 3600 is false (the value of d does not matter). This extension checks that each row has the same number of truth-indicators and that the expression in each row is of type boolean. To support these types of extensions, language extension frameworks and tools must allow new concrete syntax to be added to the language as well as new semantic analysis to typecheck the extension constructs. Ideally, these extensions can be developed by separate parties, unaware of each other’s extensions, and the non-expert programmer would be provided with some mechanism to compose the host language (Java in this case) automatically with the language extensions. This approach requires monolithic composition of the grammars and thus has some signiﬁcant problems: (1) This monolithic composition allows little room for failure. For example, the composed grammar may contain ambiguities or the generated LR parse table may contain conﬂicts. (2) It requires that the grammar’s “source code” be released to any end user of the extensible system. This may be undesirable if an extension writer or the writer of the host grammar only wish to release the grammars in binary form (i.e., the parse tables). (3) The parser generation process is too time consuming for some applications. Our previous work [7] addressed the ﬁrst problem by presenting some reasonable restrictions on the form and substance of the extension grammars, which if adhered to, provide a guarantee to each extension writer that the monolithic grammar compilation of any composition of extensions that meet the restrictions will not fail. Thus, in that approach, the monolithic grammar compilation by the end-user can be automated, and is guaranteed to result in a conﬂictfree LALR(1) parser, but is still time-consuming and requires the user to be in possession of the source code for all constituent grammars. Contributions. In this paper we present a corollary of that work, based on a relaxed version of the same grammar restrictions, that allows parse tables to be composed after they have been compiled, provided that a modest amount of additional metadata is provided with each parse table. We also specify the process of composing the host language parse table and extension parse table fragments, and provide a time-complexity analysis of this process. Finally, we describe a technique for using a diﬀerent scanner for each language extension and for the host language, obviating the problem of generating a new scanner from the composed language speciﬁcations. The structure of the rest of this paper is as follows. Section 2 discusses background, including formal deﬁnitions of grammars and parse tables, examples of useful extensions, and a summary of context-aware scanning. Section 3 discusses the restrictions needed on the extension grammars (which we have found to be broad enough to admit many interesting extensions to Java, including all those mentioned above). Section 4 discusses the speciﬁcs of the parse table composition, including the metadata that needs to be retained. Section 5 discusses speciﬁc data structures that can be used to expedite the composition process, as well as providing a time-complexity analysis for it. Section 6 discusses related work, and section 7 concludes.

Veriﬁable Parse Table Composition for Deterministic Parsing

2

187

Background

This section deﬁnes the notations used in the remainder of the paper for constructs used in LR-style parsing: grammars, LR ﬁnite automata, parse tables, etc. It also discusses some examples of language extensions and provides some background on context-aware scanning [8], upon which our approach relies. 2.1

Formal Definitions of the Extension System

Host and extension grammars Definition 1. (Language grammar ) A language grammar is deﬁned as a 5-tuple T, N T, P, s ∈ N T, regex : T → Regex. As expected, T is the ﬁnite set of terminal symbols, N T is the ﬁnite set of nonterminal symbols, P is the ﬁnite set of productions of the form N T → (T ∪ N T )∗ , and s is the start symbol. regex is a mapping that associates a regular expression (over some alphabet) with each terminal symbol.1 We use CFG L to denote the set of all language grammars. Definition 2. (Γ H ) We use the symbol Γ H = TH , N TH , PH , sH , regex H to designate the host language grammar. For extensions, we use a slightly modiﬁed grammar form; instead of a start nonterminal, extensions have a bridge production connecting them to their host language. Definition 3. (Extension grammar ) An extension grammar is deﬁned as a 5tuple Γ E = TE , N TE , PE , ntH → μE sE , regex E where sE ∈ N TE , ntH ∈ N TH , and dom(regexE ) = TE ∪{μE }. Let CFG E be the set of all extension grammars. The production ntH → μE sE is the bridge production, its left hand side being a host language nonterminal and its right hand side the extension’s marking terminal (μE ) — a terminal introduced by the extension but not in TE . It is followed by a nonterminal (sE ) in N TE , which thus becomes the extension’s start symbol. It is possible to allow more than one bridge production, each with a distinct marking terminal, and also allow any non-empty sequence of host and extension terminals to follow the marking terminal. However, we have taken the more restrictive approach to simplify presentation. This simpliﬁed approach does still allow host language nonterminals on the right hand side of extension productions. If pE ∈ PE , its left hand side symbol must be in N TE , but symbols on its right hand side need only be in TH ∪ N TH ∪ TE ∪ N TE . Definition 4. (Γ E extends Γ H ) If (and only if) Γ E satisﬁes the restrictions of Deﬁnition 3 with respect to Γ H , we say that Γ E extends Γ H . Note that the term “grammar” will be used without the adjectives “host,” “language,” or “extension” when the context makes the distinction clear. 1

Generally, the regular expressions associated with terminals are not included in context-free grammars in this manner. We include them because when using a context-aware scanner, the scanner and parser are tightly coupled and both are generated from this type of speciﬁcation.

188

A. Schwerdfeger and E. Van Wyk

Parse tables Definition 5. (States) Let States denote the set of all rows of a parse table. Definition 6. (Parse table) A parse table is a 4-tuple PT = (Γ P T , StatesP T , πP T , γP T , nSP T ∈ StatesP T ). Γ P T is a grammar that this parse table parses correctly; πP T : StatesP T × T → P(Actions), where Actions = {accept} ∪ {reduce(p) : p ∈ P } ∪ {shif t(x) : x ∈ States}; γP T : StatesP T × N TP T → P(Goto), where Goto = {goto(x) : x ∈ States}. nSP T is the start state. Note that parse table fragments for extensions can refer to states in the host language parse table. Also, the grammar Γ P T is mentioned here only to reference components of it (e.g., T P T ) used in parse table construction; it is not necessary to distribute it along with the parse table. Definition 7. (Error action) An error action is represented by an empty cell in the parse table; i.e., when for some parse table pt, some n ∈ States and t ∈ T , πpt (n, t) = ∅. Definition 8. (Parse table conflicts) A cell (n, t) in a parse table pt has a conflict if |πpt (n, t)| ≥ 2. A state n is conflict-free if no cells in its row have a conﬂict; i.e., ∀t ∈ TP T . (|πP T (n, t)| ≤ 1). A parse table pt is conflict-free if all n ∈ States P T are conﬂict-free. Definition 9. (∪G and ∪∗G ) Let ∪G : CFGL × CFGE → CFGL be a noncommutative, non-associative operation on context-free grammars. If Γ E extends Γ H , then Γ C = Γ H ∪G Γ E = TC , N TH ∪ N TE , PC , sH , regex C where: – TC = TH ∪ TE ∪ {μE }, μE being the extension’s marking terminal. E – PC = PH ∪PE ∪{ntH → μE sE }, (ntH → μE sE is the Γ bridge production). regexH (t ) if t ∈ TH – regexC (t ) = regexE (t ) if t ∈ TE or t = μE ∪∗G is an operation on a host grammar and an unordered set of extensions, generalizing in the intuitive manner. 2.2

Grammar Examples

Figure 2 shows some of the nonterminals, terminals, and productions declared in the grammars for Java and the SQL, tables, and dimension analysis extensions. This is Java 1.4, to which a notion of generic types has been added. The SQL Query extension’s marking terminal is Using, with the regular expression /using/; the SQL Connection extension’s, Connection; the tables extension’s, Tbl; the dimension analysis extension’s, Units. The SQL Query extension’s start symbol is Sql ; the tables extension’s, BTable; the dimension-analysis extension’s, DimS . Except for a few terminals, the SQL Query extension does not use any host language constructs. However, Java expressions form a part of table rows (TRow ) in the tables extension, while Java types form a part of type expressions in the dimension-analysis extension. Thus, it is syntactically (but not

Veriﬁable Parse Table Composition for Deterministic Parsing

189

Java: Terminals: Question /?/, Colon /:/, Comma /,/ Semi /;/, LParen /(/, RParen /)/, LBrk /{/, RBrk /}/, LT //, Id /[A-Za-z][A-Za-z0-9]*]/ Nonterminals: Expr , PrimaryExpr , Dcl, Type Expr → Expr Question Expr Colon Expr Expr → PrimaryExpr PrimaryExpr → Id Type → ... Java type expressions ... SQL Connection: Terminals: Connection /connection/, SqlId /[A-Za-z]+/ , With /with/, Table /table/ Nonterminals: ConnDcl Dcl → Connection ConnDcl ConnDcl → SqlId With Table SqlId Semi SQL Query: Terminals: Using /using/, Query /query/, Select /SELECT/, From /FROM/, Where /WHERE/, SqlId /[A-Za-z]+/ Nonterminals: Sql , SqlQ, SqlIds, SqlExpr Expr → Using Sql Sql → SqlId Query LBrk SqlQ RBrk SqlQ → Select SqlIds From SqlId Where SqlExpr SqlIds → SqlId SqlIds → SqlId Comma SqlIds SqlExpr → ... Tables: Terminals: Tbl /table/ Nonterminals: BTable, TRows, TRow , TFStarList PrimaryExpr → Tbl BTable BTable → LParen TRows RParen TRows → TRow TRows → TRow TRows TRow → Expr Colon TFStarList Dimension analysis: Terminals: Units /Unit/ Nonterminals: DimensionExpr DimS Type → Units DimS DimS → LT DimensionExpr Comma Type GT DimensionExpr → ... list of dimension and unit speciﬁcations Fig. 2. Sample grammar productions from host and extensions (adapted from [7])

190

A. Schwerdfeger and E. Van Wyk

semantically) correct for an SQL query to appear at the beginning of a table row. In practice, extension start nonterminals are not necessary; in the case of the Tables extension we use the production PrimaryExpr → Tbl LParen TRows RParen instead. 2.3

Context-Aware Scanning

The context-aware scanners discussed here are identical to those described in [7], an extension of the basic idea found in [8] and implemented in Copper, our parser and context-aware scanner generator. However, while in discussing the composition of grammars the exact building process for the scanner does not matter, it is of more relevance here, as discussed in section 4.2. At each scan, a context-aware scanner will only match terminals in the valid lookahead set ; this is the set of terminals that have valid parse actions in the current parse state (i.e., shift, reduce, or accept, but not error). For example, in Fig. 1, the scanner will recognize “table” as either a SQL keyword or a Tables keyword (but not both), based on the context of the parser. Similarly, “time” is recognized as either an identiﬁer or dimension speciﬁer (as in the type of t and r) based on the current parse state. This can be achieved by building a diﬀerent scanner for each parse state, but this is ineﬃcient so in practice the valid lookahead set information is “compiled into” the scanner DFA and the parser then passes the valid lookahead set for the current parse state to the scanner when it is called for the next token. Context-aware scanning is useful in modular composition of grammars, because with a diﬀerent valid lookahead set for each parse state, the partitioning of the LR DFA into disjoint sections for each extension, as described below, also partitions the lexical syntax of extensions and ensures that they do not interfere with each other lexically.

3

Extension Restrictions for Parse Table Composition

Previously, we deﬁned an analysis, isComposable(Γ H , Γ Ei ), that language extension developers can apply to their extension grammars, in eﬀect “certifying” the extension’s grammar as being safe to compose with other identically certiﬁed extensions [7]. This can be formulated as follows: H Ei H (∀i ∈ [1, n].[isComposable(Γ ∪G Γ Ei )]) E , Γ ) E∧ conflictFree(Γ H 1 n =⇒ conflictFree(Γ ∪G Γ , . . . , Γ )

If each extension grammar Γ Ei passes this analysis, and results in a conﬂict-free parse table when combined with the host language, then the composition of the host language and all the extensions Γ E1 , . . . , Γ En passing this analysis will also result in a conﬂict-free parse table. In this paper we introduce a very similar analysis, isComposableP T (Γ H , Γ Ei ). If the analysis passes, it guarantees that the portions of the parse table for Γ H ∪G Γ Ei speciﬁc to the extension can be extracted and distributed as a parse table

Veriﬁable Parse Table Composition for Deterministic Parsing

191

fragment. The user of the extensible language can then direct the supporting tool to compose parse tables instead of grammars. The resulting parse table is conﬂict free. This analysis can be formulated as follows: H (∀i ∈ [1, n].[isComposablePT (Γ H , Γ Ei ) ∧ conflictFree(Γ ∪G Γ Ei )]) H E1 En =⇒ conflictFree(PT ∪T PT , . . . , PT )

The parse table merging operation ∪∗T is deﬁned in Section 4; in this section we focus on the restrictions imposed by isComposablePT . These restrictions are less restrictive than (in fact, a subset of) those imposed by isComposable , since the process of merging parse tables produces a parse table that is not quite so monolithic as that produced by compiling Γ C = Γ H ∪G Γ E1 , . . . , Γ En directly; this is described in more detail below. We ﬁrst present some background deﬁnitions of relations on states of LR DFAs, then the restrictions imposed by isComposable , and ﬁnally the relaxation that deﬁnes isComposablePT . 3.1

Definitions Relating to LR Finite Automata States

Definition 10. (LR item, lookahead ) Formally, with respect to a grammar Γ , an LR item is a pair (p, n) ∈ P Γ × N representing a production in the grammar and a marker that can be placed before or after any symbol on the production’s right hand side. Therefore, the minimum value for n is zero, and the maximum is the number of symbols on the right hand side. In LALR(1) DFAs, an LR item is always accompanied by a set of lookahead, represented formally as a map la : Items → P(T Γ ). An LR item is usually written with the marker in place rather than speciﬁed by a number, with the lookahead after it. For example, the LR item i = (A → α, 0) with la(i) = z is written A → •α, z, and the LR item (A → αβ, 2) is written A → αβ•, z. In the following s and t are LR DFA states. Definition 11. (I-subset, ⊆I ) s is an I-subset of t, written s ⊆I t, if Items s ⊆ Items t , where the sets Items represent all LR items in their respective states. Definition 12. (LR(0)-equivalence, ≡0 ) s and t are LR(0)-equivalent, written s ≡0 t, if s ⊆I t and t ⊆I s, i.e., the two states have the same item set. We use the term LR(0)-equivalent because in an LR(0) DFA, where there are no lookahead sets, two LR(0)-equivalent states would be equal. Definition 13. (IL-subset, ⊆IL ) s is an IL-subset of t, written s ⊆IL t, if s ⊆I t and ∀i ∈ Itemss . (las (i) ⊆ lat (i)). Definition 14. (LR(1)-equivalence, ≡1 ) s and t are LR(1)-equivalent, written s ≡1 t, if s ⊆IL t and t ⊆IL s, i.e., the states’ item sets and all lookahead sets are equal. Note that if s ⊆IL t, and t produces a conﬂict-free parse state, so does s.

192

A. Schwerdfeger and E. Van Wyk

3.2

The isComposable Restrictions

The analysis consists of checking the forms of the host grammar Γ H = (TH , N TH , PH , sH , regex H ) and the extension grammar E Γ Ei = TEi , N TEi , PEi , ntH → μE i si , regex Ei and then compiling two LR DFAs — M orig from Γ H and M Ei from Γ H ∪G Γ Ei — to verify that they conform to certain restrictions, which serve to partition the LR DFA (and, by extension, the derivative parse table) into separate sections for parsing host constructs and extension constructs; this guarantees that when Γ H is composed of extensions that meet the restrictions into with any number Γ C = Γ H ∪G Γ E1 , . . . , Γ En , the resulting LR DFA M C will be able to be partitioned in an analogous manner and its derivative parse table will also be conﬂict-free. The ﬁrst restriction is that an extension must contain only one production with a host-language nonterminal on its left hand side, called the bridge producE E E tion, of the form ntH → μE i si , where μi is a unique marking terminal and si Ei is a nonterminal deﬁned by Γ . That the extensions satisfy this restriction can be seen in Figure 2. The second restriction is that the follow sets of the nonterminals in N TH are identical in Γ H ∪G Γ Ei , disregarding the marking terminal μE i . The tables extension satisﬁes this because anytime it derives Expr , the expression is followed by a colon, and the follow set of Expr already contains the colon in the host language (on account of the ?: conditional expression). Similarly, in the dimension analysis example the Type nonterminal is always followed by the greater-than symbol. The third restriction is that every state in the LR DFA M Ei must be able to be placed into one of three classes of states: Ei 1. MH — states “owned” by the host grammar. A state is in this class if, disreE garding bridge production items (ntH → •μE i si ) and the marking terminal, orig it is identical to some state in M . 2. MEEii — states owned by the extension grammar. A state is in this class if it contains some item with a nonterminal nt Ei ∈ N TEi on its left hand side. Ei 3. MNH — states that are not owned by either host or extension. A state n is in this class if it cannot be placed in the other two, and (disregarding marking terminals and bridge productions): Ei – there is a state n ∈ MH such that n ⊆IL n , and Ei – for every state nH ∈ MH , if n ⊆I nH , then n ⊆IL nH .

These restrictions and partitions of states ensure that when the host and several extensions are composed and built into the LR DFA M C , this DFA can be split C C into analogous partitions MH , MNH , and MECi for each Γ Ei .

Veriﬁable Parse Table Composition for Deterministic Parsing

193

Ei It is the third class MNH of states in the third restriction (the “new-host” states) with which we are most concerned here. It is possible that the same host-language construct can be derived from more than one extension: if, for example, two extensions to Java each embed Java expressions. The classes MEEii cannot share any states because, since they contain items with extension nonterE minals on the left hand side, no pair of states in MEEii × MEjj will have identical

Ei item sets (be LR(0)-equivalent). This is not true of the classes MNH , which may have LR(0)-equivalent states. When building an LALR(1) DFA, all LR(0)E equivalent states are merged; those pairs of states in MEEii × MEjj that have identical item sets may be merged into new states n ∈ M C with the same item sets. This restriction ensures that this new state does not produce parse-table conﬂicts.

3.3

The isComposablePT Restrictions

When composing parse tables instead of grammars, to avoid combining Ei parse table rows, the states in the third class, MNH , will not be merged, but will remain separate as they would in an LR(1) DFA. Therefore, the restrictions on the third class (item 3 in the enumerated list above) are not required and can be safely dropped without aﬀecting the conﬂict-free composability guarantee Ei provided by the analysis; it can simply be those states that are not in MH but contain no item with an extension nonterminal on its left hand side. An example is our Java extension for dimension analysis, in which some Java constructs are embedded. This causes states containing no items of extension syntax to be made, but since the Java constructs are being used outside their Ei normal syntactic context, these states do not ﬁt into the MH class. However, Ei they also happen to fail the criteria for the MNH class and thus this grammar fails the analysis isComposable . Ei The restrictions of MNH are meant to guarantee that if, by chance, another extension should generate a state that is LR(0)-equivalent to some state in that class, a conﬂict will not be introduced when merging these two LR(0)-equivalent states during grammar compilation. But in this case, the states remain separate in their diﬀerent extension parse tables and are not merged at all, obviating the need for the restrictions. This means that the dimension analysis extension, Ei which only fails the test on MNH , passes the analysis isComposable P T . LALR(1) parsers are preferred to LR(1) parsers precisely because they do this merging, because LR(1) parse tables having not been compacted this way are much larger. But the circumstances in which mergable states remain unmerged in this case are uncommon; two extensions produce LR(0)-equivalent new-host Ei states rather infrequently, and the restrictions of MNH are in place to handle corner cases rather than common ones, so this duplication will not produce signiﬁcant overhead.

194

4

A. Schwerdfeger and E. Van Wyk

Merging Parse Tables and Scanners

In this section we describe the process of merging parse tables and parse table fragments, the metadata that is required in this process, and how to create separate scanners to use with the separately created parse tables. 4.1

Merging Parse Tables

The restrictions of isComposable and isComposablePT , as described above, enforce a strict separation of states that are used for parsing the host language and those that are used for parsing extensions. Consequently, the items E ntH → •μE i si are the only adC ditions to states in MH , the host Introduced in merge parse table parse table parse table language states in the LR DFA for the composed language. Thus, in Fig. 3. A graphical representation of the source the parse states corresponding to of all actions in the composed parse table these host language states, it is only in the columns for extension marking terminals (μE i ) that any new actions need to be introduced when adding an extension — states speciﬁc to an extension Ei , in MECi , are entirely separate. It follows that if isComposableP T (Γ H ∪G Γ Ei ) then one can take the rows for the parse table for M orig and the parse table rows for extension states (corresponding to state in MEEii and Ei MNH ) from M Ei and concatenate them, add a new column μE i with appropriate actions, and end up with a parse table for Γ H ∪G Γ Ei , veriﬁed correct and free of conﬂicts. Furthermore, one can concatenate a parse table for M orig with those of several extensions, adding a new column for each marking terminal; the resulting parse table would then parse M C and also be conﬂict-free. The goto-table (columns labeled by nonterminals) do not change during composition and can be composed without any modiﬁcations. Figure 3 provides a graphical representation of how the composed parse table is put together. Definition 15. (P TiE ) With respect to a composed parse table P T , the parse table fragment for an extension Γ Ei , labeled P TiE , consists of those rows in the Ei parse table corresponding to states MEEii and MNH , i.e., the rows that “belong” Ei to Γ . Definition 16. (The parse table merge operation ∪T ) Let ∪T : ParseTables × ParseTables → ParseTables be the parse table composition operator analogous to ∪G , the grammar composition operator. (P T H ∪T P T E ) is deﬁned iﬀ Γ E extends Γ H . Then the composed parse table P T H ∪T P T E = (Γ H ∪G Γ E , StatesC , πC , γC , nSH ), where:

Veriﬁable Parse Table Composition for Deterministic Parsing

195

– StatesC = StatesH ∪ StatesE . ⎧ πH (n, t) if n ∈ StatesH ⎪ ⎪ ⎪ ⎪ and t ∈ TH ⎨ if n ∈ StatesE – πC (n, t) = πE (n, t) ⎪ ⎪ and t ∈ TH ∪ TE ⎪ ⎪ ⎩ πμ (n, t) if t = μE where πμ (see Deﬁnition 19 below) represents all new parse actions, viz., shifts on marking terminals, reductions on the bridge productions, and reductions on marking-terminal lookahead. ⎧ γH (n, nt) if n ∈ StatesH ⎪ ⎪ ⎨ and nt ∈ N TH – γC (n, t) = . γE (n, nt) if n ∈ StatesE ⎪ ⎪ ⎩ and t ∈ N TH ∪ N TE Additional metadata needed for composition. Some metadata not stored in parse tables is needed to add new shift and reduce actions for computing πμ . Thus, we brieﬂy review LALR(1) construction to clarify its deﬁnition. How LALR(1) shift actions are put into the parse table. According to the usual rules for constructing LALR(1) parse tables, a shift action is put into table cell (n, t) when an item A → α • tβ, z is in state n of the LR DFA from which the table was built. If α is empty (i.e., the item is of the form A → •tβ, z), the shift action in question will only show up if some item B → •Aγ, y is also present in state n. The nonterminals A in such items need to be kept track of for each state. Definition 17. (initNTs) Let initNTs : States → P(N T ) represent these nonterminals (A above) for each state. How LALR(1) reduce actions are put into the parse table. Again according to the usual rules, a reduce action Reduce(A → α) is put into table cell (n, t) when an item in A → α•, z, with t ∈ z, is in state n of the LR DFA. The lookahead sets are in turn derived from the union of the first sets of several nonterminals. The precise method of calculation is speciﬁed in the closure rule for building LR(1) DFA states. If there is an item A → α • Xβ, z already in a state, items for all productions with X on the left hand side are added to the state, and are given lookahead consisting of all terminals in the first set of β. If β is nullable, the lookahead in z is passed on as well. If this β is a nonterminal, it needs to be kept track of so that if it turns up on the left hand side of a bridge production, the marking terminal can be added to the appropriate lookahead sets. Definition 18. (laSources) Let laSources : States × N T → P(P ) represent these sources for each reduce action in a state; speciﬁcally, if the nonterminal nt has been found to be a source of the lookahead in the item A → α• in the state n, as described above, then A → α will be placed in laSources(n, nt).

196

A. Schwerdfeger and E. Van Wyk

Definition 19. (πμ ) We can thus deﬁne πμ as follows: – πμ must include shift actions on the new marking terminal. If the glue production is ntH → μE sE , then ∀n ∈ StatesC . (h ∈ initN T s(n) ⇒ Shif t(nSE ) ∈ πμ (n, μE )). – πμ must include reduce actions on marking-terminal lookahead — if h contributes lookahead to a set of reduce actions in state n, μE must be added to that set. Therefore, p ∈ laSources(n, h) ⇒ Reduce(p) ∈ πμ (n, μE ). Definition 20. (The operation ∪∗T ) Merging a host language parse table with multiple extension parse table fragments is the straightforward generalization of the operator ∪T . We refer to this generalization as ∪∗T : ParseTables × P(ParseTables) → ParseTables . Concealment of source code. In an implementation, the two maps initNTs and laSources can be stored explicitly in the form presented above, but should the writers of the host grammar or an extension grammar wish to conceal its “source code,” this is also feasible in our approach. The diﬀerent sections of the parse table can be encoded in the traditional manner by using integers to represent terminals and nonterminals. The additional metadata can be encoded as follows. initNTs can be inferred, if necessary, from the goto tables (if for any state n there is an action in the appropriate goto table — host or extension, depending on the n — for a nonterminal h, h ∈ initNTs(n)). laSources only needs to store enough information to remove the proper number of elements from the parse stack when a reduce action takes place, and to look up the proper action in the goto table after that. Thus, the minimum that is needed is the left hand side of the production to be reduced upon and the number of symbols on its right hand side. In practice, a code fragment might be speciﬁed that takes the elements removed from the stack and replaces them with some computed result such as the corresponding abstract syntax tree. 4.2

Composition of Scanners

As the operation of merging parse tables oﬀers a rapid way to compose parsers, it is apropos to discuss ways of composing scanners to feed tokens to the composed parser. There are two primary problems to be resolved when composing scanners. First is the problem of bundling each of the component parse tables with appropriate scanner DFAs in such a way that they can be assembled quickly to be used with the composed parser, but with a minimum of duplication of scanner DFA states. Speciﬁcally, the DFA(s) built for the host language (or more precisely for the parse states of the host language) will not be able to recognize terminals deﬁned in an extension, but the DFA(s) built for the extension states may have to recognize host-language terminals. This happens when, for example, an extension construct derives a host-language nonterminal h and any terminal that can begin a string derived from h must be scanned for. Second is the problem of building the DFAs at composition time to scan for the various marking

Veriﬁable Parse Table Composition for Deterministic Parsing

197

terminals introduced by multiple extensions. Extension-language non-marking terminals will only have to be scanned for in the states of that extension, but any marking terminal may have to be scanned for in any state in the composed parse table. Again, this occurs when an extension construct derives a host nonterminal h; if the extension deﬁnes a bridge production with h as the left hand side, the marking terminal will be among the terminals that can begin strings derived from h. These must be recognized in the host language scanner DFA and possibly in other extension scanner DFAs. Scanner bundling problem. One solution to this problem (which we do not presently recommend) is to use a diﬀerent parse-state-based context-aware scanner for each state with diﬀerent valid lookahead. A better solution is to compile a diﬀerent scanner not for each state, but for each extension. Then the scanner for a particular extension Γ Ei would be called only when the parser is in one of the parse states in States Ei , and would only need to scan for the terminals in the valid lookahead sets of those states. This approach involves some duplication of states, although we ﬁnd that the amount of duplication necessary for our extensions is not wholly unreasonable. Table 1. Scanners for extension states ableJ extension SQL Tables Foreach Dimension analysis Terminals in composed grammar (TH ∪ TE ) 194 161 158 181 Terminals containing actions in n ∈ States E (% of above total) 132 (68%) 69 (43%) 95 (60%) 105 (58%) States in scanner for States 930 (120 new) 852 (42 new) 813 (3 new) 856 (46 new) States in scanner for States E (% of above total) 549 (59%) 253 (30%) 411 (51%) 345 (40%)

See Table 1 for some test data, comparing this approach with the approach of building a single new scanner DFA for the entire composed language in the context of ableJ, our extensible version of Java [4]. The columns in this table represent several of our modular extensions. – The ﬁrst row gives the number of terminals in the host Java grammar and the given extension grammar together. The Java grammar alone has 157 terminals. The “foreach” extension (providing the same functionality as the foreach loops added to Java 1.5) introduces no new terminals except its marking terminal, while the extension for embedded SQL introduces a large number of new terminals for SQL syntax.

198

A. Schwerdfeger and E. Van Wyk

– The second row gives the proportion of the totals in the ﬁrst that are in some valid lookahead set of an extension state. It is evident that these are made up mainly of host terminals. This is because in Java a very large number of keywords can begin expressions or statements, which are derived by various extension constructs, and hence anything that can begin an expression or statement must be scanned for. – The third row gives the number of states in the DFA that can scan on valid lookahead sets in both host and extension states (i.e., the scanner that would be built if the composed grammar were compiled from source). – The fourth row gives the number of states in the DFA that can scan just on those terminals enumerated in the second row (and hence can be bundled with the extension for use in scanning on any state within it). The total number of scanner DFA states needed in the composed parser, therefore, is all the states in the fourth row put together, plus the number of states in the scanner DFA for the host grammar compiled alone. In this case, the Java host grammar produces a scanner DFA with 810 states, so the total number of states will be 810 + 549 + 253 + 411 + 345 = 2368. To compare, the scanner DFA generated when the composition of all these extensions is compiled from source contains 1020 states. This signiﬁcantly increases the number of scanner DFA states required, but may be a reasonable price to pay for the advantages of quickly composing previously generated parse tables and scanners. Marking terminal problem. The arrangements above do not, however, oﬀer any solution to the marking terminal problem. The brute force approach is simply to rebuild the DFAs of each scanner for every aﬀected state to include the marking terminals, but this is potentially expensive, as is all building of arbitrary DFAs. It goes so far as to make useless the approach of building a single DFA for each extension, since all such single DFAs would have to be rebuilt. A more eﬃcient option would be to use a “double-state” DFA, wherein the regular expressions corresponding to all the necessary marking terminals are compiled into a separate DFA — this is a polynomial-time process, since these regular expressions usually match only one string, and thus the NFA generated from their union is acyclic — and the old and new DFAs are then run in parallel when the scanner is called. One or both will then return a match; if the DFA scanning for marking terminals ﬁnds the match, we may choose to prefer that over another possible match by the pre-existing scanner DFA.

5 5.1

Implementation/Time Complexity Analysis Data Structures

To summarize the problem, the composition process must take several parse tables, as well as the maps initNTs and laSources, and assemble them into the ﬁnished parse table according to the rules put forth in the previous section.

Veriﬁable Parse Table Composition for Deterministic Parsing

199

One could do this in the na¨ıve manner and generate the same parse table that would be generated if the grammar were recompiled from the ground up, but this approach is somewhat ineﬃcient. The two maps are arranged by state, and must be stored piecemeal with the maps for each chunk of states being bundled along with the parse table for the corresponding extension. It is very straightforward to use hash-tables to hold the maps. It is common practice when building parse tables for practical use to assign a number to each grammar symbol and then use a two-dimensional array to index parse actions. This is, furthermore, the only data structure that is eﬃcient enough for practical use in that application. In this case, however, to expedite the process of merging we will use several two-dimensional arrays, which can then be used as a large “virtual” two-dimensional array by mapping diﬀerent indices of the “virtual” array to each “real” table. Let us go back to Figure 3. The leftmost column in that table — the actions on terminals in TH — would form its own array. This would allow the parse table for the host language to be copied byte-for-byte into the composed parse table, and the same for those columns of the extension parse tables containing actions on terminals in TH . The rightmost column — the actions for marking terminals — would also form its own array. There are no actions on terminals in TEi except in the states inside States Ei . This is illustrated by the bottommost two white squares in Figure 3. It follows that these columns of each extension parse table can be stored in the same columns of the composed parse table; more speciﬁcally, starting at column |TH |+ 1.2 This means that at the time extension parse tables are built, only one factor is unknown: the exact index at which the block of states States Ei will be situated in the table. This unknown factor could be represented by a bit attribute on shift actions, indicating whether the destination of the action is a host or extension state. There are then two ways to arrange the TEi columns of the extension parse tables: 1. Keep a separate array for each and translate the exact array indices of extension states at runtime. This has the advantage that the extension parse tables can be copied byte-for-byte as the host-language parse table is, and the disadvantage of more overhead for a more cluttered “virtual” array. See Figure 4(a) for a diagram of this arrangement. 2. Use an approach in which the host-language table’s array and the marking terminal array are separate, but all the extension-symbol columns are in a single array with maxEi |TEi | columns, for a total of three arrays, translating array indices at compile-time. This has the advantage of being less cluttered, but the disadvantages that the same processing must be carried out at compile-time and that a byte-for-byte copy of the extension table cannot be made. See Figure 4(b) for a diagram of this arrangement. 2

With a traditional scanning apparatus, this arrangement would cause erroneous parses if a terminal in TEi was recognized when in a state belonging to another extension (n ∈ States Ej ), but this is not a problem when using a context-aware scanner.

200

A. Schwerdfeger and E. Van Wyk

(a) parse table

parse table parse table

Introduced in merge

(b)

Fig. 4. Ways of partitioning the composed parse table for faster composition: (a): the multiple-array approach; (b): the three-array approach

5.2

Time Complexity Analysis and Performance Statistics

The bulk of the computation in this process is copying and/or adapting pieces of parse tables. If a parse action takes up A bytes, the multiple-array approach involves copying exactly A · (|StatesH | · |TH | + Ei (|StatesEi | · |TEi |)) bytes. In the three-array approach, A · (|StatesH | · |TH |) bytes are copied (the hostlanguage parse table, which likely comprises the bulk of the composed table) and |States| · maxEi |TEi | cells in the extension-language array are ﬁlled. For each such cell that is ﬁlled, if the action that ﬁlls it has an extension state as its destination, it must be updated with an exact state number, as stated above. This takes constant time. In the na¨ıve single-array approach, no part of the table can be copied in; |States|· TH ∪ Ei T Ei cells in the single array are ﬁlled, using the same process as in the three-array approach. In all approaches, for each extension, a parse-table column for the extension’s marking terminal (States cells) must be ﬁlled out from the maps initNTs and laSources. For each cell (n, μE i ), one lookup on initNTs must be made to determine if a shift action should be inserted in the cell, and one lookup on laSources must be made to determine if a reduce action should be inserted in that cell. Each lookup takes constant time. Performance statistics. We have completed a na¨ıve prototype implementation of the new parsing and scanning algorithms used in this approach. Tests on approximately 300 ableJ source ﬁles showed that the splitting of parse tables produced no signiﬁcant eﬀect on the parser runtimes. Running scanner DFAs in parallel for marking terminals increased runtimes by an average of 10% over the established algorithms in Copper, which can be implemented to be as eﬃcient as traditional disjoint DFA-based scanners.

Veriﬁable Parse Table Composition for Deterministic Parsing

6

201

Related Work

This paper is largely a corollary of our previous work on veriﬁable grammar composition in a deterministic framework, which shares the same motivation: to have a guarantee that the syntax of several eclectic extensions can be merged together seamlessly in a deterministic parsing framework. Context-aware scanning [8] ensures that these schemes will work in practice, by taking context as a factor in scanning. Each time the parser calls to a context-aware scanner for a token, the scanner is passed the valid lookahead set and scans only for terminals in that set. Using a traditional scanner, every terminal in the composed grammar would be scanned for at every scan, including terminals from all extensions in any extension state, so no writer of any single extension can predict how to handle the scanning. But in a context-aware system, in the states of States Ei , only the terminals of TH ∪ TEi are in the valid lookahead set, so the scanning is made in exactly the same way as if Γ Ei had been composed alone with Γ H . The modular determinism analysis [7] utilizes exactly the same concepts as presented in this paper, but in the context of composing grammars rather than parse tables. That paper resolves additional practical issues such as disambiguation by operator precedence and associativity. Bravenboer and Visser [9] outline a strategy for composing the parse tables of arbitrary extensions into a single GLR (speciﬁcally, GLR(0)) table. The host grammar is compiled into an -DFA, an ordinary DFA with -transitions retained as metadata. This allows the addition of new items to an -DFA state (i.e., the introduction of new extensions) without the need to recompute the entire closure of the state. Most of the information from the -DFA is then included with the parse table. Although this has the advantage of being general, it has the disadvantage of being dependent on the nondeterministic GLR approach in which grammar ambiguities could arise in the composition of extensions. Also, the generality means that the approach works essentially on memoization — holding metadata to ensure that only the smallest possible parts of the parse table are recompiled — and this raises the possibility that in some cases, large portions or even the entire parse table must be recompiled. This also means that more metadata must be maintained in this approach than in ours. Tatoo [10], which is both a scanner and parser generator, introduces a number of innovations. One is support for rapid composition of extensions without the need for regenerating parse tables; the system can switch between diﬀerent precompiled parse tables and thus support some notion of parse table composition. But Tatoo’s concept of extensions is diﬀerent from ours: while we conceive of a fully independent host grammar supplemented by an unspeciﬁed set of extensions, in Tatoo, certain “holes” are explicitly left in the host grammar, and users must “ﬁll” each of these with one of a possible selection of extensions written to ﬁll that particular “hole.” Therefore, the extensions to a Tatoo grammar are not optional, are of a ﬁxed number, and are of a more restricted character. Component LR-parsing [11] (CLR) is similar to Tatoo’s approach in that multiple separate parse tables are used, but CLR introduces two new parse actions: switch and return. When a component parser enters an error state it

202

A. Schwerdfeger and E. Van Wyk

inspects the current state and will either switch to another component parser, return to the parser that called it, or backtrack. This point at which the calling parser fails is where, in our approach, a shift on a marking terminal would occur. The priorities of these new actions are ﬁxed by the parsing algorithm and the order in which component parsers are called is determined by the textual-order in which they appear in the speciﬁcation. Backtracking is used when a component parser fails and the system backtracks to try another component parser. This approach, however, is underpinned by the idea that host syntax always has precedence — that the parse table of an extension component is entered only if there is no valid host syntax at that point. We make the entry points more explicit, so that it is possible to choose whether host or extension syntax should take precedence at the entry points.

7

Discussion and Future Work

In this section we cover some opportunities for future work in the area of composing parse tables at runtime. We also discuss the importance of having compiletime guarantees of determinism when building extensible languages. We are currently incorporating the ideas presented here into Copper, our integrated parser and context-aware scanner generator. Runtime composition. Our experience to date has been with a model of language extension that is carried out at compile time. The compiler is provided with a ﬁle listing the host grammar and all the extensions, which are then compiled into a ﬁnished parser prior to any input being parsed. But there is also a possible application for composition of extensions at parse time. Consider the practice in Java of importing library classes through the use of import statements. A parser could be made to read such an import statement for language extensions instead, and compose in the extension on the ﬂy. Clearly, compiling a composed grammar in such conditions would take far too much time to be practical, but the use of parse table composition in this case could very well be plausible: assemble the necessary parse tables on the ﬂy when reducing on the production of the import statement. Another approach might be to compose beforehand, but only “turn on” an extension’s marking terminal if the extension is imported. Determinism guarantees. It might be posited that the restrictions imposed by our method of parse table composition are too narrow to be of signiﬁcant utility, but we argue that the composition guarantee outweighs the moderate loss of expressivity imposed by the restrictions if the writer of an extension is able to perform a static analysis that ensures the extension will compose deterministically. Other techniques of parse table composition allow varying ranges of extensions, including some that can compose arbitrary extensions; it is apropos to oﬀer, as we have, some comparison of the diﬀerent approaches. Another appropriate comparison for all of these approaches is with libraries. We have drawn this parallel above in discussing runtime composition, but in

Veriﬁable Parse Table Composition for Deterministic Parsing

203

general, libraries are the accepted mechanism for programmers to “extend” their language with new capabilities. Although libraries provide no new syntactic constructs or semantic analysis, the library writer can distribute the code following compilation and type-checking, ensuring that the programmer can bring together and use any combination of libraries needed to address a particular problem. It is this guarantee that our approach provides, a guarantee essential when parsing extensible languages: if they are to become widely used, we need this ability of the extension writer to provide a guarantee of safe composition by any non-expert.

References 1. Knuth, D.E.: On the translation of languages from left to right. Information and Control 8(6), 607–639 (1965) 2. Bravenboer, M., Visser, E.: Concrete syntax for objects: domain-speciﬁc language embedding and assimilation without restrictions. In: Proc. of OOPSLA 2004 Conf., pp. 365–383 (2004) 3. Van Wyk, E., Bodin, D., Krishnan, L., Gao, J.: Silver: an extensible attribute grammar system. Electronic Notes in Theoretical Computer Science (ENTCS) 203(2), 103–116 (2008); Originally in LDTA 2007 4. Van Wyk, E., Krishnan, L., Schwerdfeger, A., Bodin, D.: Attribute grammar-based language extensions for Java. In: Ernst, E. (ed.) ECOOP 2007. LNCS, vol. 4609, pp. 575–599. Springer, Heidelberg (2007) 5. Ekman, T., Hedin, G.: The JastAdd extensible Java compiler. In: Proc. of OOPSLA, pp. 1–18. ACM, New York (2007) 6. Thompson, J.M., Heimdahl, M.P., Miller, S.P.: Speciﬁcation based prototyping for embedded systems. In: Nierstrasz, O., Lemoine, M. (eds.) ESEC 1999 and ESECFSE 1999. LNCS, vol. 1687, p. 163. Springer, Heidelberg (1999) 7. Schwerdfeger, A., Van Wyk, E.: Veriﬁable composition of deterministic grammars. In: Proc. of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, New York (2009) 8. Van Wyk, E., Schwerdfeger, A.: Context-aware scanning for parsing extensible languages. In: Intl. Conf. on Generative Programming and Component Engineering (GPCE). ACM Press, New York (2007) 9. Bravenboer, M., Visser, E.: Parse table composition - separate compilation and binary extensibility of grammars. In: Gaˇsevi´c, D., L¨ ammel, R., Van Wyk, E. (eds.) Software Language Engineering. LNCS, vol. 5452, pp. 74–94. Springer, Heidelberg (2009) 10. Cervelle, J., Forax, R., Roussel, G.: Tatoo: an innovative parser generator. In: Proc. Principles and practice of programming in Java (PPPJ), pp. 13–20. ACM, New York (2006) 11. Wu, X., Bryant, B.R., Gray, J., Mernik, M.: Component-based LR parsing. Computer Languages, Systems & Structures 36(1), 16–33 (2010)

Natural and Flexible Error Recovery for Generated Parsers Maartje de Jonge1 , Emma Nilsson-Nyman2, Lennart C.L. Kats1 , and Eelco Visser1 1

Dept. of Software Technology, Delft University of Technology, The Netherlands [email protected], [email protected], [email protected] 2 Dept. of Computer Science, Lund University, Sweden [email protected]

Abstract. Parser generators are an indispensable tool for rapid language development. However, they often fall short of the finesse of a hand-crafted parser, built with the language semantics in mind. One area where generated parsers have provided unsatisfactory results is that of error recovery. Good error recovery is both natural, giving recovery suggestions in line with the intention of the programmer; and flexible, allowing it to be adapted according to language insights and language changes. This paper describes a novel approach to error recovery, taking into account not only the context-free grammar, but also indentation usage. We base our approach on an extension of the SGLR parser that supports fine-grained error recovery rules and can be used to parse complex, composed languages. We take a divide-and-conquer approach to error recovery: using indentation, erroneous regions of code are identified. These regions constrain the search space for applying recovery rules, improving performance and ensuring recovery suggestions local to the error. As a last resort, erroneous regions can be discarded. Our approach also integrates bridge parsing to provide more accurate suggestions for indentation-sensitive language constructs such as scopes. We evaluate our approach by comparison with the JDT Java parser used in Eclipse.

1 Introduction Domain-specific languages offer substantial gains in expressiveness and ease of use for a particular problem domain. To efficiently construct and use domain-specific languages, language development environments should be used, such as IMP [6], the MetaEnvironment [27], MontiCore [14], openArchitectureWare [8], or Spoofax/IMP [13]. With these tools, languages are constructed using a grammar as the principal artifact. Using a parser generator, a grammar can be used to automatically generate a parser. When deployed, the parser constructs abstract syntax trees (ASTs) from programs, used to provide the user with syntactical and semantic editor services, such as an outline view and error marking. Parser generators are an indispensable tool for rapid language development, allowing the language to be quickly changed according to new domain insights and needs. Yet general-purpose programming languages are often still constructed using handcrafted or partially handcrafted parsers. For example, the Java parser used in the popular Eclipse JDT Java editor, is based on a parser generated by JikesPG (now known as LPG) [5]. M. van den Brand, D. Gaˇsevi´c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 204–223, 2010. c Springer-Verlag Berlin Heidelberg 2010

Natural and Flexible Error Recovery for Generated Parsers

205

However, the parser employs handwritten recovery rules as well as a number of large, customized Java components. The reason often stated for not using a purely generated parser is that they fall short of the finesse of a handcrafted one, built with the language semantics in mind. A particular area where generated parsers have provided unsatisfactory results is that of error recovery, which is essential for parsing incomplete and syntactically incorrect programs, and thus indispensable for interactive editors. Problems with error recovery in generated parsers are the quality of the recovered program and the reported errors, and finding a good trade-off between recovery quality and performance. Some parser generators allow custom recovery rules to improve error recovery quality [2,5,10,12]. Custom recovery rules allow a language engineer to inspect and improve an error recovery strategy. Compared to a handcrafted parser, a rule-based recovery specification is much easier to maintain, especially as languages are changed or reused to build new languages. Another way to improve error recovery is through grammar analysis, such as LPG’s scope detection [5]. In previous work we introduced an approach to error recovery that derives properties from grammars to produce explicit, customizable recovery rules [12]. Using scannerless generalized-LR (SGLR) parsing, the approach supports languages with a complex lexical syntax, such as AspectJ [3], and language embeddings and extensions, such as the Stratego program transformation language with embedded Java fragments [30]. Using generalized parsing, SGLR can parse ambiguous grammars. By considering the different ambiguous meanings of a syntactically incorrect program, through inspection of an expanding search space for applying the set of recovery rules, the approach can provide recovery suggestions that local recovery methods cannot. An open problem we identified with our approach is that some search space-based suggestions are too “creative” and not natural (i.e., as a programmer would suggest them) [12]; in some cases it is simply better to ignore a small part of the input file, rather than to try and fix it using a combination of insertions and discarded substrings. Another open problem is that for tight clusters of errors, it is not always feasible to provide good suggestions in an acceptable time span. In order to provide better, more natural suggestions, the present paper proposes an approach to identify the region in which a parse error is found. By restricting the search space for applying the recovery rules to this region, it becomes much less likely that the user is presented with “creative” suggestions that are nowhere near the original error. Using a smaller search space also helps performance. To further help performance, we add a form of “panic mode” [7]: if no solution of applying the recovery rules is found within an acceptable time span, the entire region can be skipped and marked as erroneous. This way, the parser can still continue, report other errors, and construct a partial AST. We select erroneous regions based on indentation usage. Using indentation, programs typically form logical, nested regions of code. The approach of using layout information for partitioning files has been inspired by the technique of bridge parsing [20]. Bridge parsing is a supplementary technique to grammar-based error recovery. It uses structural information, such as typical use of indentation for bracket placement,

206

M. de Jonge et al.

to improve recovery quality. To further improve the quality of recovery suggestions, we adapted the bridge parsing approach to be usable with an SGLR parser. We have identified and focus our paper on two open issues with error recovery for generated parsers. The first is the quality of corrections, which is often lacking since a generic solution is not aware of the semantics or typical structure of a language. The second is that given high-quality recovery, a good balance with the performance of error recovery must be maintained. To address these issues, this paper provides the following contributions: – The use of layout to select regions of code that enclose a syntax error. These can be analyzed in detail by a secondary strategy, or discarded if no recovery is found within an acceptable time span. – The application of bridge parsing based on a context-free (tokenizer) grammar rather than a scanner, showing how bridge parsing can be integrated into a parser rather than used as a preprocessor, improving results. – The use of grammars for automatic construction of a tokenizer grammar and the heuristic derivation of a bridge parser specification. We begin this paper with background on error recovery and setting out a number of requirements for good error recovery. In Section 3, we show how regions around a syntax error can be selected and used for coarse-grained error recovery. Section 4 describes how these regions can be used to apply recovery production rules. We refine error recovery for scopes based on bridge parsing in Section 5. Finally, Section 6 evaluates our approach and compares different configurations, using the Eclipse JDT parser as a baseline.

2 Error Recovery Parsers serve two purposes: determining the grammatical structure of an input program, and syntactically validating it. Given the grammatical structure, the parser constructs an abstract syntax tree (AST), used for semantic analysis in tools such as compilers or editors. While performing syntactic validation, a parser also reports any errors that exist in the input. A good parser does not only report the first character or token that is not valid according to the grammar, but also provides the user with a more sophisticated diagnosis. It can for example report missing constructs (e.g., “} expected here”). An even better parser also supports error recovery: based on the analysis of an error, it can recover from an error and continue parsing the rest of a file. Recovery techniques can be divided into correcting error recovery, which tries to transform the input string into a syntactically correct one, and non-correcting error recovery, which tries to continue the analysis by skipping parts of the input [7]. Error recovery plays an important role in modern, interactive development environments (IDEs). IDEs parse a file as it is typed in, making incomplete programs and syntax errors the common case rather than the exceptional one. Using error recovery, a parser can still construct a partial abstract syntax tree, allowing the IDE to perform semantic analysis and provide the user with interactive feedback (e.g., error marking, content completion).

Natural and Flexible Error Recovery for Generated Parsers

207

In their comparative study, Degano and Priami [7] set out a number of quality criteria for good error recovery strategies, on which we will elaborate here. We distinguish between aspects that impact users and developers of a language. Firstly, there are three main criteria with respect to the end user’s experience: – Constructing a good AST: The recovered program should be as close to the program as intended by the programmer as possible. Since the AST is used for syntactic and semantic editor services in the IDE (e.g., the outline and error markers), the quality of the reconstructed AST is of great importance for the user experience. – Providing good feedback: The parser should provide the user with good suggestions of how to fix the program. Spurious error messages should be avoided; instead, a small number of natural suggestions should be reported. – Delivering adequate performance: For interactive use, the error recovery mechanism must not incur an unacceptable overhead. As their last criterion, Degano and Priami have suggested to only take performance degradation into account only if greater than a fixed maximum value. Important criteria for developers of a language or an IDE (plugin) are: – Flexibility: The approach must be easily adaptable to language insights and language changes. – Language independence: an error recovery algorithm should be independent of a particular language. It should be usable with any given grammar, without introducing a prohibitive amount of work. – Transparency: it should be clear why a particular recovery is presented. The grammar engineer should have insight into how the recovery works for a given grammar.

3 Coarse-Grained Error Recovery A parser that supports error recovery typically operates by consuming tokens (or characters) until an erroneous token is found. At the point of detection of an error, the recovery mechanism is activated. Simple, local approaches to error recovery will then attempt to make a modification to the input so that at least one more original symbol can be parsed [7]. For most cases, this works quite well. There are cases, however, particularly for complex languages, where these algorithms choose a poor repair that leads to further problems as the parser continues (“spurious errors”). Spurious errors are the result of one of the major problems in error recovery: the difference between the point of detection and the actual location of an error in the source program [7]. In contrast to local methods, global recovery methods examine the entire program and make a minimum of changes to repair all syntax errors [2,17]. While these give the “best” repair, they are not efficient. An alternative approach to local or global recovery is to consider only the direct context of the error, by identifying the region of code in which the errors reside [16,18,21]. Using regions for error recovery has three main advantages. Firstly, they reduce the search space for a recover algorithm. Secondly, they constrain the recovery suggestions to a particular part of the file, avoiding suggestions that are spread out all over the file. And thirdly, they can be used as a secondary recovery strategy [7], i.e. erroneous regions can be discarded entirely if a detailed analysis of the region does not provide a better recovery solution.

208

M. de Jonge et al.

class X { int i; void methodX(){ i=1; if(true){ foo(); bar(); } m(); } }

Fig. 1. Indentation closely follows the hierarchical structure of a program

3.1 Nested Structures as Regions Code constructs such as “while” statements and method bodies form good regions for regional error recovery. They form free standing blocks, in the sense that they can be omitted without influencing the interpretation of other blocks. Erroneous free standing blocks can simply be skipped, providing a coarse recovery that allows the parser to continue. A typical technique to select such regions is to look for certain marker tokens in the context of an error, such as the fiducial tokens of Pai and Kieburtz [21]. These tokens depend on the language used. For example, for Java, keywords such as class and while could be used. We will take a more language-independent approach in this paper. The method presented in this section is based on the use of indentation to detect code constructs. Indentation typically follows the logical nesting structure of a program, as illustrated in Figure 1. The relation between constructs can be deduced from the layout. An indentation shift to the right indicates a parent-child relation; the same indentation indicates a sibling relation. Indentation usage is not enforced by the language definition. Proper use of layout is a convention, being part of good coding practice. We generally assume that most programmers apply layout conventions, but should keep in mind the possibility of inconsistent indentation usage. Proper recognition of nesting structures prevents bad recoveries, obtained by merging structures that do not belong together. Figure 2 illustrates this idea with the example of a method that is missing a closing brace. The parser tries to parse the method header of the second method as a statement, which leads to a failure at the open brace in the method header. Indentation suggest that both methods should be considered as separate constructs. An indentation-based region selector will detect the erroneous if-block; which leads to the recovery presented in the middle part of the figure. An inferior recovery would be obtained by removing tokens surrounding the error detection point. The example at the right shows the result, merging the erroneous method with the correct method. 3.2 Indentation-Based Region Selection We follow an iterative process to select an appropriate region that encloses a syntax error. Each iteration, a different candidate region is considered. This candidate is then

Natural and Flexible Error Recovery for Generated Parsers class X {

}

class X {

int methodX(){ if(true){ foo(); //} return 5; }

int methodX(){

void methodY(){ int i=5; bar(i); }

void methodY(){ int i=5; bar(i); }

class X { int methodX(){ if(true){ foo(); //} return 5; }

return 5; }

}

209

int i=5; bar(i); } }

Fig. 2. Erroneous code (left), discarded erroneous region (middle), and merged constructs (right)

either validated or rejected; in case of a rejected candidate, another candidate is considered. We show example scenarios in Figure 3. Figure 3(a) shows a syntax error and the point of detection, indicated by a triangle (left figure). A candidate region can be selected based on the alignment of the void keyword and the closing bracket (middle figure). The candidate is then successfully validated by discarding the region, and attempting to parse the remainder of the file (right figure). After validation, the parser can be reset to its previous state (indicated by the circle, which represents a choice point for the parser). A detailed analysis of the region may be used to attempt to repair the erroneous region, as we will see in the following sections. Figure 3(b) illustrates a rejected candidate region. Based on the point of detection, an obvious candidate region may be the m2 method (middle figure). However, an attempt to parse the construct that follows it leads to a premature parse failure; the region is rejected. Figure 3(c) revisits the example. Another candidate region is selected, this time one preceding the point of detection. This region is successfully validated. The region validation criteria should balance the risk of selecting the wrong candidate, which may lead to spurious errors, and the risk of rejecting a correct candidate region. The latter typically occurs in the context of multiple errors, in which a new, unrelated error causes the parser to fail again. Both cases lead to large regions, which should be avoided. We currently consider a region valid if the two lines of code succeeding it parse correctly, which has shown good practical results. Selection Schema. The candidate regions are explored in an ordered fashion, with the aim to find the smallest fragment enclosing the error first. Current structure. The first candidate region is the construct start- while(true)){ ing from the error detection location. The region is recognized by } foo(); a forward skip until the end of the construct is found. The conFig. 4. Extra ) struct ends with the last child (more indentation), including the closing bracket after the last child (same indentation). In Figure 4, the parser fails after reading the mistakenly inserted second brace. Discarding the entire while statement resolves the error.

210

M. de Jonge et al.

(a) A candidate region is validated and successfully discarded.

(b) A candidate region is rejected.

(c) An alternative candidate region is validated and successfully discarded. Fig. 3. Recovery by discarding of regions

Previous structure. The second candidate is the structure preced- void methodX() { foo( ing the error detection location. The region is detected by a backbar(); wards skip, using the indentation information stored in the choice } points. Typical problems that are solved by discarding the previFig. 5. Missing ); ous structure are uncompleted lines and scope errors caused by a missing closing brace. The error in Figure 5 is detected after the bar(); statement, while the preceding line caused the error. Siblings. Regions that are mutually dependent should be dis- if(true){ carded as a whole. A typical example is provided in Figure 6. The elsefoo(); unclosed “then” clause cannot be discarded, because the “else” bar(); clause cannot exist in isolation. The “sibling-procedure” deals Fig. 6. Missing } with this situation. The procedure starts with the current structure as discarded region. Then it successively includes the prior sibling and the next sibling, until a valid erroneous region is found or all siblings have been considered. Parent. The next region to consider is the parent structure, iden- Person Name: John tified through a forward and backward search for a decrease in email: ???? indentation. Identifying the parent structure can be useful when Fig. 7. Illegal email a child that is missing or erroneous. Parent child dependencies property are rarely seen in common programming languages, but they can

Natural and Flexible Error Recovery for Generated Parsers

211

occur in DSLs. The example in Figure 7 shows a simple person data language with an error in the required field email. Apart from solving errors in parent-child dependencies, the parent selection scheme adds some robustness with respect to inconsistent indentation. While the selection schemata have been designed to be generally applicable, the success of our approach depends on assumptions of indentation conventions and language characteristics. Conventions for widely used programming languages seem to meet the assumption that the indentation follows the logical nesting structure of the program. A more problematic issue is the (mis)use of indentation by programmers. Inconsistent indentation usage decreases the quality of the results, although some robustness for small deviations can be expected. The second assumption we make is that programs have free standing blocks, i.e. that discarding a region still yields a valid program. Again, conventional programming languages seem to meet this requirement. However, some (declarative) languages use constructs that cannot be discarded because they are syntactically obligatory. Such languages can lead to large regions. 3.3 Implementation Considerations We implemented the region selection method in SGLR, in order to use it in collaboration with recovery rules [12]. The selection method does not depend on specific features of generalized parsing and can be implemented in other LR parsers as well. Layout conventions for braces. Varying conventions for closing bar(); and opening braces are used. They can be omitted in some sit- while(true) foo(); uations, besides the position of the opening brace can vary. The doX(); figure illustrates the problem with a concrete example. Two code while(true) fragments with the same indentation characteristics, have a differ- { foo(); ent decomposition in regions. The need to cover all different notations in a language inde- } pendent way, has greatly increased the complexity of the imple- Fig. 8. Different inmentation. A simple solution would be provided by an explicit dentation styles recognition of those tokens. This would make the algorithm more precise and the implementation straight forward. However, this will introduce a language dependency. We have chosen to stick with our language independent approach, and deal with the support for various brace-conventions in the code. In case of doubt, we assume the notation including the braces on separate lines. The main disadvantage is that sometimes one or two correct lines are included in the selected region. Whitespace parse. A simple discarding of erroneous regions will offer a recovery and allows the parser to continue the analysis, however the information about line and column numbers may be lost. This will cause problems in the scenario of interactive editor support. A simple solution is offered by whitespace parsing. All symbols, except newlines and tabs are parsed as whitespace. Information about skipped regions can be used to generate error messages.

212

M. de Jonge et al.

Parse tree completion. We maintain only a limited number class X{ .... of choice points to backtrack to, to ensure that there is only void methX(){...} a negligible overhead when parsing (parts of) files without void methY(){...} void methZ( errors. This limitation means that in some cases the layoutbased region selection cannot provide a candidate region. For Fig. 9. An uncompleted example, the class construct in Figure 9 is unfinished, and can class only be discarded as a whole. Complementary to the region selection schemata, we implemented a technique that completes the parse tree for an unfinished code fragment. In this way, at least the already recognized part of the code can be reported to the programmer. A program prefix missing only a few closing tokens at the end, can be completed to a valid program by inspection of the parse table. Although the missing next token is not known, a list with possible tokens can be retrieved from the parse table. The completion method creates separate stack branches for each possible “next state”, deduced from the list of possible tokens. After a number of parse steps using this branching mechanism, an accepting state will be found. Generalized parsers like SGLR and GLR provide native support for branching. The method works efficient if only a few general branching steps are required, which corresponds to a small number of missing tokens. Therefore, we apply the method on the location of the last big reduction, the closing brace of methY, in the example.

4 Fine-Grained Error Recovery We can improve upon the coarse-grained recovery approach by using it in conjunction with a more fine-grained, correcting error recovery method. In this section we outline how the error recovery productions of [12] can be used to perform fine-grained error recovery inside erroneous regions. Error recovery productions allow for a high-level, grammar-oriented way of customizing a recovery strategy [2,10]. Because the language engineer must design them a priori, they have sometimes been criticized for being language-dependent [7]. In [12] we introduced a way to derive recovery rules from a grammar, and added general rules that can simply skip over erroneous code fragments. Following [12], recovery productions are written just as any other production, annotated with the {recover} annotation. We use the flexible SDF syntax definition formalism [29] for the specification of grammars and their recovery rules. As an example of an SDF production, consider the following Java production: "{" BlockStm* "}" -> Block {cons("Block")}

This rule specifies that a { literal, followed by a list of BlockStm symbols and a closing } literal, can be parsed as a Block. The {cons} annotation specifies the name used for the node in the abstract syntax tree. Based on this rule (and taking global properties of the grammar into consideration, as outlined in [12]), the following recovery production rule can be derived: -> "}" {recover, cons("INSERT")}

This production specifies that a possible recovery is to parse the empty string (hence the empty left-hand side) instead of the closing } literal. Annotated {recover}, this insertion recovery rule is only used when recovery is required.

Natural and Flexible Error Recovery for Generated Parsers

213

In addition to insertion recovery rules, [12] also specifies lexical “catch-all” production rules to discard unparsable substrings. Together, these rules could parse any string, distinguishing only “words” and “separators”: [A-Za-z0-9\_]* -> WATERWORD {recover} ~[A-Za-z0-9\_\ \t\r\n] -> WATERSEP {recover}

Each application of recovery rule incurs a cost of 1. A minimal-cost solution may be the best possible match with the programmer’s intention. However, considering all candidate recoveries for a complete file results in a search space that is too large to inspect within reasonable time. In [12] we applied an unbounded, expanding search space to discover a recovery solution with a minimum cost. For most cases, this approach is effective, but for a number of pathological cases, the unbounded search leads to unacceptable recovery times, or to far-fetched, non-local recovery suggestions. By restricting the search space to a selected region of code, recovery performance and locality can be improved. Smaller regions (fewer than four lines) are reparsed, applying a bounded number (three in the current implementation) of recover productions. For larger regions, we assume that the error can be corrected in the three lines nearest to the parse failure, which seems to be the case in most practical examples. If the application of recovery rules does not lead to a successful repair, the entire region can be discarded using the whitespace parse approach discussed in the previous section.

5 Bridge Parsing One of the most common errors made by programmers is omit- 1 class C { m() { ting closing brackets of scopes, since scopes are recursive struc- 23 void int y; tures need to be properly balanced [5]. A parser can recover in 4 int x; these cases by inserting the missing braces. Unfortunately, there 5 } are often many possible locations where a missing brace can be Fig. 10. Missing } inserted. Consider for example the Java fragment of Figure 10. This fragment might be recovered by inserting a closing brace at the start of line 2, 3, or 4. However, the use of indentation suggests the best choice may be just before the int x; declaration. Bridge parsing [20] provides an algorithm to improve error recovery based on indentation. Provided with knowledge of typical usage in Java programs, it can correctly recover cases such as the example above. It can be configured to work for any given language, and works independently of a particular parser technology. Inspired by island grammars [28,19], a bridge parser employs a scanner that only recognizes tokens that make up scoping structures (“islands”) and important tokens for determining how those islands should be connected (“reefs”). All other tokens (“water”) are skipped. Given a list with these kind of tokens, the bridge parser constructs a bridge model, which captures the scopes in the input. A scope in this context corresponds to two islands connected with a bridge. Two islands will only match if a predefined set of conditions is fulfilled. Missing bridges in the bridge model reveal broken scopes. They can be repaired by locating an appropriate “construction site” for inserting a new, artificial island, matching the island in need of recovery. A new bridge can then be constructed. An algorithm for incrementally constructing multiple bridges is given in [20].

214

M. de Jonge et al.

module Java-SQL-Tokenizer context-free start-symbols Class Stm Expr ... context-free syntax %% token list definitions for all start symbols ClassToken Class -> Class {cons("Cons")} -> Class {cons("Nil")} context-free syntax %% tokens and the {cons} name of their production EnumDecHeadToken -> ClassToken {cons("EnumDecHead")} SQLId -> ClassToken {cons("Id")} lexical syntax %% lexical token definitions "enum" -> EnumDecHeadToken [A-Za-z]+ -> SQLId

Fig. 11. A (partial) generalized tokenizer definition for the Java-SQL language

5.1 Scannerless Bridge Parsing Composed languages and languages with a complex lexical structure (such as AspectJ) cannot or can only with great difficulty be parsed using a separate scanner [3]. For example, the scanner for the Java language recognizes enum as a keyword. This means that it can never be parsed as an identifier. When Java is extended or composed with another language, this restriction also applies for the combined language. Using the same scanner, a composition of Java and SQL cannot parse programs where enum is an SQL identifier. Using scannerless parsing [24], these issues can be elegantly addressed [3]. Since bridge parsing as presented in [20] is based on the notion of a scanner, it cannot support languages that depend on scannerless parsing or parsing with a context-sensitive scanner. Still, bridge parsing only depends on a small set of tokens, such as brackets and keywords, not on a full scanner definition. So why can we not just construct a scanner for those literals in the grammar? The problem is that each sequence of characters, there can potentially be many different lexical and literal interpretations. Again consider enum, which is keyword in Java, but may also be an identifier in the composed Java-SQL or Stratego-Java languages. To overcome the difficulties of a scanner-based approach, we introduce the notion of a generalized tokenizer. This tokenizer constructs all possible token interpretations, forming an ambiguous token stream. We implement this tokenizer based on the grammar of a language. For example, given the Java-SQL definition, we mechanically strip all context-free productions and retain only definitions for literals and lexical symbols. For each sort in the grammar, we then generate a start symbol that parses the different lexicals and literals reachable from that state. A (partial) tokenizer grammar for JavaSQL is illustrated in Figure 11. Using the Class start symbol, this grammar constructs the following token stream (a list of Cons and Nil nodes) for the string enum Color{}: [ , , ,

amb([EnumDecHead("enum"), Id("enum"]) LAYOUT(" "), Id("Color") amb([EnumBody("{"), ClassBody("{"), Block("{"), ...]) amb([EnumBody("}"), ClassBody("}"), Block("}"), ...])] ambiguities in the token stream are indicated with an amb

where term. For composed languages, these token streams quickly grow more complex as the number of different token interpretations increases.

Natural and Flexible Error Recovery for Generated Parsers grammar Layout;

grammar SimpleJava;

abstract island LayoutStart; abstract island LayoutEnd; abstract reef Layout;

import Layout;

bridge from LayoutStart to LayoutEnd; attr Layout LayoutStart.indent = [first left Layout]; attr Layout LayoutEnd.indent = [first left Layout]; java-attr int Layout.pos = ...embedded java code...

215

island LBrace : LayoutStartIsland = "{" for-sglr("EnumBody"|"ClassBody"|...); island RBrace : LayoutEndIsland = "}" for-sglr("EnumBody"|"ClassBody"|...); reef Indent : LayoutReef = NEWLINE|(WS|TAB)+ bridge from LBrace to RBrace; ...

recover LayoutStart find [a:Layout] where (a.pos <= this.indent.pos) insert bridge-end before a; ...

Fig. 12. A generic bridge grammar

Fig. 13. A bridge grammar for Java

We simplify the token stream by considering only those tokens that are of interest to the bridge parser, and by flattening the ambiguities to create multiple, possible interpretations that have no deep ambiguities. After that, the bridge parser can assign different island classes, reef classes, and water to the tokens: [KeywordReef("enum"), LayoutReef(" "), Water("Color"), LBrace, RBrace]

In this list, the LBrace and RBrace classes encompass all interpretations for the { literal. In the remainder of this section we will discuss how the binding between these tokens and the bridge parsing island classes is specified. 5.2 The Bridge Parsing Specification A bridge parser is generated from a bridge parser specification (or bridge grammar) [20]. It defines all islands, reefs, and rules for matching and recovering islands. Attributes can be added to islands and reefs to help with matching in rule expressions. These specifications are composable and can be extended in several steps. Generic behavior such as “closest match recovery” or “layout-based recovery” is defined in a generic specification that can be reused and redefined by other grammars. Figure 12 lists parts of a generic bridge grammar that specifies layout-based recovery. The grammar specifies abstract LayoutStart and LayoutEnd islands that must be connected by a bridge. It also adds indent and pos attributes that can be used to do layout-based matching, as described in [20]. The recover rule uses these attributes to construct an artificial LayoutEnd island to repair bridges from LayoutStart islands. We extended the bridge parser specification language for the purpose of integrating it with SGLR, adding a new for-sglr clause to capture the different possible interpretations for one token or character. For example, braces in Java have several possible interpretations according to the SGLR tokenizer grammar. Figure 13 shows how these can be captured using the for-sglr clause. The arguments of the clause correspond to the node types in the generalized token stream, seen in the previous subsection. The grammar in Figure 13 imports the generic layout grammar, and provides concrete implementations for the abstract islands and reef defined in that grammar.

216

M. de Jonge et al.

Using the bridge grammar, the bridge parser derives a bridge model from the token stream (illustrated in the previous subsection). Because of ambiguous interpretations, there may be multiple possible token streams. By assigning island and reef classes that may encompass multiple node types, some of these can be eliminated. In case more than one alternative remains, we currently pick the interpretation with the fewest number of broken bridges. 5.3 Deriving Bridge Parser Specifications SDF grammars are fully declarative, and do not allow semantic actions or callbacks to native code. This property makes SDF grammars well-suited for analysis. In previous work we applied automated analysis of SDF grammars to derive recovery productions [12]. To help language engineers efficiently employ bridge parsing with an SDF grammar, we do the same for a bridge parser specification. Island definitions are central to the bridge parser specification. Typical candidates for island definitions are scoping constructs, such as { } in curly brace programming languages. Scoping constructs are generally nestable structures, which means that their grammar productions are recursive. For example, scopes in Java are defined as follows: "{" BlockStm* "}" -> Block {cons("Block")} Block -> Stm Stm -> BlockStm

We consider a production p α q β r -> S to define a scoping construct for opening literal α and closing literal β to form a scoping construct if the following conditions are satisfied: – the production is recursive; – literals α and β are not identical; – literals α and β appear only in productions of the form p α q β r -> S where α and β are not part of patterns p, q, or r; – the literals do not appear in a production with the {bracket} annotation. The second condition excludes literals such as ` in shell scripts, since they are typically not nestable. The third condition ensures that we only select literals that appear in a balanced fashion throughout the grammar, ensuring that the bridge parser does not try to introduce opening or closing literals for unbalanced literals. The final condition ensures that we do not select constructs that define parentheses. Unlike scopes, parentheses (marked with the {bracket} annotation in SDF) have no direct semantic meaning other than modifying the priority of other operators. Because of this property, parentheses are typically not indented the same way as scopes. For each opening and closing literal, we generate island definitions and bridge rules similar to those in Figure 13. To complete the bridge parsing specification, we also generate reef rules for all reserved words in the language. Reserved words in SDF are defined using a {reject} annotation that indicates they cannot be used as an identifier. For composed languages where these words may not be globally reserved; the bridge parser then considers both interpretations. Automatically deriving recovery rules helps maintain a valid, up-to-date recovery rule set as languages evolve and are extended or embedded into other languages. Grammar engineers may also customize the derived specification to handle further cases and to introduce different indentation styles.

Natural and Flexible Error Recovery for Generated Parsers

217

5.4 Combining Fine-Grained Error Recovery and Bridge Parsing Bridge parsing excels at correcting scope errors, while fine-grained recovery is the designated approach to recover more localized errors like a missing semicolon. In case the erroneous region contains both types of errors, a combination of both techniques is required to find an optimal recovery. To do this, we extend the fine-grained recovery process to handle suggestions provided by the bridge parser. Each suggestion gives rise to an extra stack branch that is explored in parallel with the other recovery branches. In this way, the bridge parser suggestions are taken into account, but only applied if they lead to a least-cost recovery.

6 Evaluation We implemented our approach based on JSGLR, a Java implementation of SGLR [11], extending it with support for coarse-grained recovery and refining the support for recovery rules of [12]. The bridge parser implementation, also written in Java, is based on the implementation of [20], and adapted to support ambiguous token streams and recovery of regions rather than complete files. We evaluate our error recovery according to the criteria set out in Section 2. We study the quantitative criteria through evaluation of the parser using a set of test files written in Java. Java was selected because of its ubiquity in software development and in modern IDEs such as the Eclipse JDT, offering a challenging comparison. We will also argue that our approach satisfies the qualitative criteria of providing good feedback, flexibility, language independence, and transparency. Construction of the Test Set. We evaluate using an extended version of the test set used in [20]. The base test set was originally constructed for testing structural recovery of Java code, and focused on syntax errors such as missing braces pr parenthesis. The extended test set includes tests for both structural and non-structural errors, and is available from [1]. We intentionally included some cases with inconsistent use of indentation, since those are difficult to handle for the bridge parser, i.e. basic layout recovery depends on good indentation information. The test set contains three major categories of tests; missing – structural tokens for grouping, closure or division are missing (65 tests), extra – there are too many structural tokens (8 tests), and other – remaining errors like erroneous statement, or missing comment end (3 tests). Together, these total to a set of 76 test cases. Setting up the Experiment. All tests are run in an automated fashion, comparing the pretty-printed ASTs for the erroneous files to the pretty-printed ASTs for the original, correct files they were derived from. We use two methods for comparison: First, we do a manual inspection, following the quality criteria of Pennello and DeRemer [22]. Following these criteria, an excellent recovery is one that is exactly the same as the intended program, a good recovery is close to this result, and a poor recovery introduces spurious errors. Since this is arguably a subjective comparison, we also count the number of lines of code that changed in the recovered result (the “diff”). The advantage of this approach is that it is objective, and assigns a larger penalty to recoveries where a larger area of the text does not correspond or is placed in an incorrect scope. The resulting figures are also arguably easier to interpret than comparing tree distances.

218

M. de Jonge et al.

JDT BP→JDT CG→FG BP→CG→FG CG→BP+FG CG→BPtun +FG

Excellent Good Poor

0

20

40

60

80

100

Fig. 14. Quality of Recovery The x axis shows percentage of tests. Each bar shows the percentage of recoveries which where excellent, non-excellent or failed for each approach. CG - Coarse Grained, FG - Fine Grained, BP - Bridge Parsing, JDT - Java Developer Toolkit

Various Approaches. We compare the integrated recovery approach presented in this paper to different configurations of the individual techniques and to the parser used by Eclipse’s JDT. We apply the test set with the following parser configurations; the JDT parser; the JDT parser with a bridge parser (BP) preprocessor, as suggested in [20] (BP→JDT); our approach without using bridge parsing (Course Grained (CG) → Fine Grained (FG)); our approach with the bridge parser as a preprocessor (BP→CG→FG); the fully integrated approach (CG→BP+FG); and finally the same approach with a tuned bridge parser specification (CG→BPtun +FG). Except for the final configuration, the three SGLR-based parsers use fully automatically derived recovery specifications. In contrast, specialized, handwritten recovery rules and classes related to recovery are used for the JDT parser. For the tests we used the JDT parser with statement-level recovery enabled, following [15]. In some of the test cases, particularly those with multiple errors, the parser was unable to recover the entire body of a method. For content completion, Eclipse uses a secondary parser that can analyze these method bodies. Because of its specialized nature, we have not included it in our experiments. Both the bridge parser used as a preprocessor and the one integrated into SGLR use the same recovery rules and node types. Results. The diff values acquired for the various approaches are shown in Figure 15 and the same values with a quality distinction are shown in Figure 14. Considering both diagrams, we see that the SGLR parser, parsing using different steps and granularity, consistently outperforms the JDT parser. When fully integrated with the bridge parser, the best results are obtained. Using the bridge parser in a preprocessor setting was shown to be effective for a number of different parsers in [20]. For the JDT parser, we can see that the results are improved using the bridge parser as a preprocessor. When combined with SGLR, however, we see that the preprocessing approach does not work well. We speculate that these results arise because the bridge parser can only insert braces to recover scopes, never remove them, since it does not have enough knowledge of the complete language. However, when it is actually integrated into SGLR, the bridge parser’s suggestions lead to the best results. Manual inspection of the non-excellent results for each approach reveals more in-depth knowledge: – JDT (49 missing, 4 extra, 2 other): A majority of the cases are in the missing category. The most common recovery is for JDT to skip the whole content of a

Natural and Flexible Error Recovery for Generated Parsers BP→JDT

CG→FG

100

100

100

50

50

50

0

0

0

0

L

0

0

0

S M

50

L

50

0

50

S M

100

L

CG→BPtun +FG

CG→BP+FG 100

S M

BP→CG→FG 100

0

JDT

219

Fig. 15. Diffs for Various Approaches The y axis shows percentage of tests and the y axis shows categories of number of diff lines – No diff (0), Small diff (1 − 10), Medium diff (11 − 20) and Large diff (> 20). CG - Coarse Grained, FG - Fine Grained, BP - Bridge Parsing, JDT - Java Developer Toolkit.

block if there is an error. This explains why the diff values for JDT tend to be higher. – BP→JDT (34 missing, 5 extra, 2 other): The bridge parser helps to reduce the number of cases in the missing category. However, it fails to improve cases which are of out of scope for the bridge parser, for example, missing semicolons or extra structural tokens. – CG→FG (24 missing, 4 extra, 3 other): Also has a majority of cases in the missing category, particularly missing braces (both start and end). – BP→CG→FG (27 missing, 5 extra, 3 other): This combination does not work out very well. The bridge parser manages to slightly improve cases in the missing category, but makes things worse in some of the other cases. – CG→BP+FG (8 missing, 3 extra, 3 other): This is the best option. The robustness of SGLR evens out the rough edges of the bridge parser, using it more like a consultant and discarding bad advice. In practice, this means that tests in the missing category see a huge improvement. There is a slight improvement in the extra category, while the others categories stay the same. – CG→BPtun +FG (8 missing, 3 extra, 3 other): There are no visible changes using this tuned bridge parser. The partial recoveries performed by the bridge parser show a small improvement, i.e., if there is more than one error one of the two gets a better recovery but the end result is the same. Our experiments have not indicated that using a “tuned” bridge parser specification helps results. Tuning in this context can be quite tricky due to the various uses of, for example, a keyword. Turning all keywords into reefs will potentially ruin recoveries. For example, considering a for loop missing a left parenthesis, with too many keywords defined as reefs, the bridge parser might insert a left parenthesis too early. If both int and for are keywords and there is a rule stating that a recovery shall not pass a reef, then the left parenthesis will be inserted before the int and not before the for. This indicates that keywords must be chosen with care and the set should probably be quite small.

220

M. de Jonge et al.

Concerning the selection of error fragments by the coarse grained recovery approach, manual inspection revealed that the right segment is identified for most of the test cases – both in position and size. Generally, however, there can be cases where selecting the right error fragment is difficult, which can result in a poor recovery. While we have not performed an in-depth performance study, we set a maximum of 1 second for completing each test run, to allow for good responsiveness when used in an interactive environment (where the parser runs in a background thread). All tests complete within this time limit. The pathological cases previously identified for the Stratego-Java language [30] used to take much longer than this time limit [12], but with the addition of the coarse-grained recovery mechanism now also complete within this time limit. By constraining the expensive fine-grained recovery rules to a small region, setting an upper bound for the number of cases to consider per region, and introducing the possibility to fall back to discarding an entire region, the performance issues seem to have been resolved. The Impact of Indentation Usage. Since our approach depends on layout, one issue to address is robustness in case of inconsistent indentation. The tab size used greatly affects the indentation levels in a file. The tab size might change, and tabs and white spaces are often mixed. IDEs such as Eclipse can automatically insert spaces for tabs and maintain indentation settings per project, avoiding some of these problems. Possible strategies for more robustness are: 1) Using averages to determine the indentation shift, and in that way handle different indentation shifts within a file or project. 2) Rounding off exact indentation offsets to their approximate indentation level. Some times the exact indentation position has an off-by-one position, e.g., there might be three spaces when the indentation shift is four. This situation can cause indentation matching problems. The two strategies can be combined, normalizing the indentation levels to match the indentation shift in the rest of the file or fragment. Qualitative Evaluation. When working with interactive parsing the most important thing is to provide a good service to the user. We integrated our approach into Eclipse based on the Spoofax/IMP editor environment [13]. Based on the recovery productions, the editor gives accurate feedback. Following [12], every class of recovery rule is associated with a particular message (e.g., “} expected”). For the language engineer, flexibility, language independence, and transparency of the approach are important qualitative criteria. Our approach is highly flexible as it allows for customization of the high-level bridge parsing and recovery rules specifications. Yet, it maintains language independence by deriving defaults for these specifications, ensuring it is in line with the expectations of parser generators. By deriving explicit, customizable specifications, the approach is also highly transparent.

7 Related Work In previous work, we introduced error recovery for SGLR, based on parse error productions that can be automatically derived from a grammar [12], and described its integration in Spoofax/IMP [13]. The present paper refines this work, constraining the application of recovery rules to coarse-grained regions and adding support for bridge parsing.

Natural and Flexible Error Recovery for Generated Parsers

221

Bridge parsing was previously applied purely as a preprocessor for other parsers [20], ensuring that it repaired scope-related errors before other errors are recovered. We found that this approach was ineffective in combination with the production-based recovery approach of SGLR (see Section 6). Furthermore, using a scanner, the bridge parser was unable to cope with the lexical complexity of composed languages. The present work introduces a scannerless tokenizer and fully integrates the bridge parser into SGLR to address these issues. Using SGLR parsing, our approach can be used to parse languages with a complex lexical syntax and composed languages. In related work, only a study by Valkering [26], based on substring parsing [23], offered a partial approach to error recovery with SGLR parsing. Composed languages are also supported by parsing expression grammars (PEGs) [9]. PEGs lack the disambiguation facilities [29] that SDF provides for SGLR. Instead, they use greedy matching and enforce an explicit ordering of productions. To our knowledge, no automated form of error recovery has been defined for PEGs. However, existing work on error recovery using parser combinators [25] may be a promising direction for recovery in PEGs. Furthermore, based on the ordering property of PEGS, a “catch all” clause is sometimes added to a grammar, which is used if no other production succeeds. Such a clause can skip erroneous content up to a specific point (such as a newline) but does not offer the flexibility of our approach. There are several different forms of error recovery techniques for LR parsing [7]. These techniques can be divided in correcting and non-correcting techniques. The most common non-correcting technique is panic mode. On detection of an error, the input is discarded until a synchronization token is reached. Then, states are popped from the stack until the state at the top enables the resumption of the parsing process. Our coarsegrained recovery algorithm can be used in a similar fashion, but selects discardable regions discarded based on layout. Correcting recovery methods for LR parsers typically attempt to insert or delete tokens nearby the location of an error, until parsing can resume. Successful recovery mechanisms often combine more than one technique [7]. For example, panic mode is often used as a fall back method if the correction attempts fail. Burke and Fisher [4] present a method based on three phases of recovery. The first phase looks for simple correction by the insertion or deletion of a single token. If this does not lead to a recovery, one or more open scopes are closed. The last phase consists of discarding tokens that surround the parse failure location. We improve on their work by taking indentation into account, for the scope recovery using an adapted version of bridge parsing [20], as well as for the coarse recover technique. In addition, by starting with region selection, the performance as well as the quality of the fine-grained technique [12], is improved. Regional error recovery methods [16,18,21] select a region that encloses the point of detection of an error. Typically, these regions are selected based on nearby marker tokens (also called fiducial tokens [21]), which are language-dependent. In our approach, we assign regions based on layout instead. The LALR Parser Generator (LPG) [5] is incorporated into IMP [6] and is used as a basis for the Eclipse JDT parser. LPG can derive recovery behavior from a grammar, and supports recovery rules in the grammar and through semantic actions. Like our

222

M. de Jonge et al.

approach, LPG detects scopes in grammars. However, unlike our approach, it does not take indentation into account for scope recovery.

8 Conclusion Source code has a hierarchical structure that generally is reflected in the usage of layout and indentation. We have shown that this property can be exploited to confine syntax errors to small regions of code, and to provide better, more natural error recovery suggestions. Our approach to error recovery provides language independence by automatically deriving language-specific recovery behavior from grammars. Yet by allowing customization of the recovery behavior, using fine-grained recovery rules and a highlevel bridge parsing specification, the approach maintains flexibility. Acknowledgements. This research was supported by NWO/JACQUARD projects 612.063.512, TFA: Transformations for Abstractions, and 638.001.610, MoDSE: Model-Driven Software Evolution. We thank Karl Trygve Kalleberg, whose Java-based SGLR implementation has been invaluable for this work, and Mark van den Brand, Martin Bravenboer, Giorgios Rob Economopoulos, Jurgen Vinju, and the rest of the SDF/SGLR team for their work on SDF.

References 1. The permissive grammars project, http://strategoxt.org/Stratego/PermissiveGrammars 2. Aho, A., Peterson, T.G.: A minimum distance error-correcting parser for context-free languages. SIAM Journal on Computing 1, 305 (1972) 3. Bravenboer, M., Tanter, E., Visser, E.: Declarative, formal, and extensible syntax definition for AspectJ. A case for scannerless generalized-LR parsing. In: Cook, W.R. (ed.) OOPSLA 2006, pp. 209–228. ACM Press, New York (2006) 4. Burke, M.G., Fisher, G.A.: A practical method for LR and LL syntactic error diagnosis and recovery. ACM Trans. Program. Lang. Syst. 9(2), 164–197 (1987) 5. Charles, P.: A practical method for constructing efficient LALR(K) parsers with automatic error recovery. PhD thesis, New York University (1991) 6. Charles, P., Fuhrer, R.M., Sutton Jr., S.M.: IMP: a meta-tooling platform for creating language-specific IDEs in Eclipse. In: Stirewalt, R.E.K., Egyed, A., Fischer, B. (eds.) Automated Software Engineering (ASE 2007), pp. 485–488. ACM, New York (2007) 7. Degano, P., Priami, C.: Comparison of syntactic error handling in LR parsers. Software – Practice and Experience 25(6), 657–679 (1995) 8. Efftinge, S., et al.: openArchitectureWare User Guide. Version 4.3 (2008), http://openarchitectureware.org/pub/documentation/ 9. Ford, B.: Packrat parsing: Simple, powerful, lazy, linear time. In: International Conference on Functional Programming (ICFP 2002). SIGPLAN Notices, vol. 37, pp. 36–47. ACM, New York (2002) 10. Graham, S.L., Haley, C.B., Joy, W.N.: Practical LR error recovery. In: SIGPLAN 1979: Symposium on Compiler Construction, pp. 168–175. ACM, New York (1979) 11. Kalleberg, K.T., et al.: JSGLR, http://www.spoofax.org/

Natural and Flexible Error Recovery for Generated Parsers

223

12. Kats, L.C.L., de Jonge, M., Nilsson-Nyman, E., Visser, E.: Providing rapid feedback in generated modular language environments. Adding error recovery to scannerless generalizedLR parsing. In: Leavens, G.T. (ed.) Proceedings of the 24th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2009). ACM SIGPLAN Notices, vol. 44, pp. 445–464. ACM Press, New York (2009) 13. Kats, L.C.L., Kalleberg, K.T., Visser, E.: Domain-specific languages for composable editor plugins. In: Ekman, T., Vinju, J. (eds.) Language Descriptions, Tools, and Applications (LDTA 2009). ENTCS. Elsevier Science Publishers, Amsterdam (2009) 14. Krahn, H., Rumpe, B., V¨olkel, S.: MontiCore: Modular development of textual domain specific languages. In: Paige, R., Meyer, B. (eds.) TOOLS EUROPE 2008. LNBIP, vol. 11, pp. 297–315. Springer, Heidelberg (2008) 15. Kuhn, T., Thomann, O.: Eclipse corner: Abstract syntax tree (2006), http://eclipse.org/articles/ article.php?file¯ Article-JavaCodeManipulationAST/index.html 16. L´evy, J.-P.: Automatic Correction of Syntax Errors in Programming Languages. PhD thesis, Ithaca, NY, USA (1971) 17. Lyon, G.: Syntax-directed least-errors analysis for context-free languages: a practical approach. Commun. ACM 17(1), 3–14 (1974) 18. Mauney, J., Fischer, C.: Determining the extent of lookahead in syntactic error repair. ACM Trans. Program. Lang. Syst. (TOPLAS) 10(3), 456–469 (1988) 19. Moonen, L.: Generating robust parsers using island grammars. In: Working Conference on Reverse Engineering (WCRE 2001), pp. 13–22. IEEE, Los Alamitos (2001) 20. Nilsson-Nyman, E., Ekman, T., Hedin, G.: Practical scope recovery using bridge parsing. In: Gaˇsevi´c, D., L¨ammel, R., Van Wyk, E. (eds.) SLE 2008. LNCS, vol. 5452, pp. 95–113. Springer, Heidelberg (2009) 21. Pai, A., Kieburtz, R.: Global Context Recovery: A New Strategy for Syntactic Error Recovery by Table-Drive Parsers. ACM Trans. Program. Lang. Syst. (TOPLAS) 2(1), 18–41 (1980) 22. Pennello, T.J., DeRemer, F.: A forward move algorithm for LR error recovery. In: Principles of programming languages (POPL 1978), pp. 241–254. ACM, New York (1978) 23. Rekers, J., Koorn, W.: Substring parsing for arbitrary context-free grammars. SIGPLAN Not. 26(5), 59–66 (1991) 24. Salomon, D., Cormack, G.: The disambiguation and scannerless parsing of complete character-level grammars for programming languages. Technical report, TR 95/06, Dept. of Comp. Sci., University of Manitoba, Winnipeg, Canada (1995) 25. Swierstra, S.D., Duponcheel, L.: Deterministic, error-correcting combinator parsers. In: Launchbury, J., Sheard, T., Meijer, E. (eds.) AFP 1996. LNCS, vol. 1129, pp. 184–207. Springer, Heidelberg (1996) 26. Valkering, R.: Syntax error handling in scannerless generalized LR parsers. Master’s thesis, University of Amsterdam (August 2007) 27. van den Brand, M.G.J., Bruntink, M., Economopoulos, G.R., de Jong, H.A., Klint, P., Kooiker, T., van der Storm, T., Vinju, J.J.: Using the Meta-Environment for maintenance and renovation. In: European Conference on Software Maintenance and Reengineering (CSMR 2007), pp. 331–332. IEEE, Los Alamitos (2007) 28. van Deursen, A., Kuipers, T.: Building documentation generators. In: IEEE International Conference on Software Maintenance (ICSM 1999), p. 40. IEEE Computer Society, Los Alamitos (1999) 29. Visser, E.: Syntax Definition for Language Prototyping. PhD thesis, University of Amsterdam (September 1997) 30. Visser, E.: Meta-programming with concrete object syntax. In: Batory, D., Consel, C., Taha, W. (eds.) GPCE 2002. LNCS, vol. 2487, pp. 299–315. Springer, Heidelberg (2002)

PIL: A Platform Independent Language for Retargetable DSLs Zef Hemel and Eelco Visser Software Engineering Research Group, Delft University of Technology, The Netherlands [email protected], [email protected]

Abstract. Intermediate languages are used in compiler construction to simplify retargeting compilers to multiple machine architectures. In the implementation of domain-specific languages (DSLs), compilers typically generate high-level source code, rather than low-level machine instructions. DSL compilers target a software platform, i.e. a programming language with a set of libraries, deployable on one or more operating systems. DSLs enable targeting multiple software platforms if its abstractions are platform independent. While transformations from DSL to each targeted platform are often conceptually very similar, there is little reuse between transformations due to syntactic and API differences of the target platforms, making supporting multiple platforms expensive. In this paper, we discuss the design and implementation of PIL, a Platform Independent Language, an intermediate language providing a layer of abstraction between DSL and target platform code, abstracting from syntactic and API differences between platforms, thereby removing the need for platform-specific transformations. We discuss the use of PIL in an implemementation of WebDSL, a DSL for building web applications.

1 Introduction Intermediate languages have been used in compiler construction since the 1960s [24] to improve the retargetability of compilers. Rather than generating machine architecture specific instructions directly, compilers emit machine-independent instructions written in a low-level intermediate language, which is subsequently translated into machinespecific instructions by machine-specific compiler back-ends. In the context of model-driven engineering, research has been focusing on the development of compilers for domain-specific languages. Domain-specific languages (DSLs) raise the level of abstraction in software development by providing constructs to express high-level concepts from which lower-level implementations are generated. Ideally, compilers that implement the DSL are reused to develop multiple applications for multiple customers. Rather than generating executable machine code, DSL compilers typically generate source code written in languages such as Java or Python. By generating source code rather than machine instructions, DSL compilers abstract from the low-level machine instructions that compilers typically produce. In addition, source code is much simpler to generate and DSL compilers can therefore be developed much more efficiently. M. van den Brand, D. Gaˇsevi´c, J. Gray (Eds.): SLE 2009, LNCS 5969, pp. 224–243, 2010. c Springer-Verlag Berlin Heidelberg 2010

PIL: A Platform Independent Language for Retargetable DSLs

225

Generating source code instead of machine instructions poses a new retargetability challenge at the level of software platforms. A software platform consists of one or more programming languages with a set of libraries and frameworks, deployable on one or more operating systems. Dozens of software platforms compete and companies typically standardize on a single one (e.g. Sun’s Java, Microsoft .NET or LAMP1 ). Consequently, DSL vendors have to choose whether to generate code for a single software platform, or multiple software platforms. Ideally, a DSL compiler targets many platforms, to maximize its potential customer base. Whereas encoding implementation knowledge of domain-specific concepts in a compiler enables the reuse of this knowledge between applications, there is little reuse between the different back-ends of such a compiler, due to language and library discrepancies between platforms. This lack of reuse causes significant maintenance problems. For instance, the ANTLR parser generator [19] has code generator back-ends for over a dozen platforms. However, many of them are not up-to-date with the latest ANTLR release. Similarly, WebDSL [30], a DSL for data-intensive web applications, has back-ends for Java and Python, but whenever a new feature is added to WebDSL, it needs to be implemented and maintained in each back-end individually, in practice leading to incompatible platform back-ends. Back-end maintenance is an even more prominent issue when back-ends heavily rely on the target platform’s syntactic sugar and platform-specific frameworks and libraries. Such platform features are designed to enable developers to be productive coding on that platform. When code is generated, however, such productivity features are of less value. Specifically, these features complicate the implementation of multiple back-ends with consistent behavior, due to incompatible semantics across platforms. Thus, to fully control the behavior of generated code, and consequently the behavior of the DSL, lower-level code is generated using only a subset of the target platform. Conversely, features that are beneficial to code generators are often lacking in programming languages. Therefore, generating monolithic code artifacts, such as complex classes, can result in large and complex code generation rules. Such large rules can be circumvented by extending the target language with code compositionality features such as partial classes and methods enabling small code generation rules that emit smaller artifacts. Similarly, code generation features such as identifier concatenation and expression blocks substantially reduce the size of generation rules. The lack of reuse between compiler back-ends could be circumvented by performing automatic language translation, e.g. translating generated Java code to Python, but this translation is expensive because of the complexity of the Java language and its libraries. Efforts to port dynamic languages, specifically Ruby and Python, to the CLR [1,2] and JVM [3,4] so that software written in these languages is portable to these platforms, are also very complex, often incomplete and have performance issues. In this paper we introduce the intermediate language PIL, a Platform Independent Language providing a level of abstraction between DSL and software platforms, abstracting from discrepancies between platforms, thereby removing the need for platform-specific back-ends. In contrast to intermediate languages in traditional compiler construction, PIL operates on a higher level of abstraction and has a concrete syntax, based on a subset of Java, leveraging the productivity advantages of generating 1

Linux, Apache, MySQL and Perl/Python/PHP

226

Z. Hemel and E. Visser

source code over generating machine instructions. PIL is designed as a small intermediate language, capturing only essential object-oriented constructs and is therefore easier to port to multiple platforms than Java, for instance. In addition, the design of a language specifically targeted at code generators enables the development of codegeneration specific language features. PIL/G, a thin layer of abstraction on top of PIL, provides some of such code generation such as partial classes, partial methods, identifier concatenation and expression blocks. In the future we also see opportunities to integrate DSL debugging support as part of PIL/G. We realized an implementation of the Java and Python backends of the WebDSL compiler using PIL, reducing the maintenance effort of these back-ends. The contributions of this paper are as follows: (1) The design of the PIL language, an intermediate language at the source code level aimed at DSL compilers. (2) PIL/G, a collection of code generation-specific abstractions built on PIL. (3) An evaluation of our approach by implementing a Java and Python back-end for WebDSL through the use of PIL. Outline. In the next section we describe the typical architecture of a DSL generator with a single back-end. In Section 3 we discuss several approaches to extend this architecture to generate code for multiple platforms. Section 4 describes PIL and its design and features. In Section 5 we discuss how PIL interacts with platform-specific code. In Section 6 the applicability of PIL, future work and related work is discussed.

2 Code Generator Architecture In this section we describe the general architecture of a code generator generating code for a single platform. We examine how to cater for multiple platforms in the next section. The initial single back-end generator architecture is depicted in Fig. 1. It is composed of two parts: the front-end, which parses, checks and desugars models described in the DSL, and the back-end, which generates code from the model. We first give a brief overview of the operation of the generator front-end, followed by a discussion of generator back-ends. The generator front-end. The front-end of the generator is responsible for parsing, checking and transforming a model to a simplified representation from which a backend produces executable code. Based on the grammar of a DSL (which also defines the DSL’s meta-model), the parser produces an abstract syntax tree (AST). The AST is subsequently checked for inconsistencies, such as type errors and other deficiencies. A set of model-to-model transformations, also known as desugarings, transform the AST to a simplified, core DSL model. Normalizing transformations perform model simplification, such as adding default values for omitted optional information. More complex transformations transform higher-level constructs to a reduced set of lowerlevel constructs. For example, in WebDSL, access control [14] and workflow [16] are implemented as abstractions on top of the user interface, data model and action sublanguages, implemented through a number of model-to-model transformations. The result of the front-end transformations is a fully checked, normalized model represented in a reduced set of core DSL constructs.

PIL: A Platform Independent Language for Retargetable DSLs

227

The generator back-end. A generator back-end generates code from a core DSL model produced by the frontend. Intuitively, generating code that uses a high-level framework seems an attractive, productive option [23,28]. However, in the long term, frameworks often become too restrictive when more control is required over the exact execution of the generated application [13]. Initially, the WebDSL compiler generated code for the JBoss Seam framework, a high-level Java framework utilizing Enterprise Java Beans (EJBs) for the business logic, Hibernate for models and JSF (Java Server Faces) for constructing views. As the WebDSL language evolved, JSF in particular, became too restrictive. The WebDSL view models no longer were a good match for JSF. Consequently, JSF and EJBs were replaced by plain Java servlets that contain println statements printing HTML code. There is mismatch in platform requirements between developers and code generators. Many modern software platforms (e.g. Java, .NET, Ruby, Python and PHP) are object-oriented at their core. Platforms typically try to differentiate by adding features on top of that core (Fig. 2), to improve the expressivity for developers, e.g. syntactic sugar and high-level frameworks. While improving developer productivity, these features limit flexibility, because Fig. 1. Code generator arthey are only optimized for common use cases. By gen- chitecture erating lower-level code, the execution of the generated application can be more effectively and flexibly controlled. In addition, because the object-oriented core of these platforms is similar, generating code at this level also significantly improves portability, which is discussed extensively in section 3. Although lower-level code is more verbose, the code is not intended to be read or modified, so this is not an issue. Conversely, platform features that enable clean and concise code generation rules are often absent from platforms. Generation rules that generate large code artifacts, typically become very long and complex. Such “God rules” dispatch a large number of smaller generation rules to generate a monolithic target artifact (e.g., a Java class). “God rules”, similar to “God classes” in object-oriented programming, are an anti-pattern

Fig. 2. Platforms and their features

228

Z. Hemel and E. Visser

and can be avoided by the use of code compositional- @Partial ity features, specifically partial classes and methods. public class SomeClass { Partial classes are class fragments that are comprivate int a; bined at compile time by merging their contents. @Partial Similarly, partial methods enable fragments of a public void init() { method to be distributed over multiple partial a = 10; classes. Partial methods are also merged at compile} time. Some languages support partial classes, e.g. } Smalltalk, Objective-C, C# 2.0, Common LISP // ... (CLOS) and Ruby, but many other languages do not @Partial support this feature, e.g. Java and PHP. Partial meth- public class SomeClass { ods are less common. C# 3.0’s partial method supprivate int b; port is different than the partial methods just described; partial methods in C# are an optimization @Partial public void init() { feature for providing hooks into generated code. Parb = 8; tial classes in C#, typically generated by a code gen} erator, can declare the signature of a method as partial, meaning that if the method is implemented in } another partial class with the same name, typically defined by a programmer, it operates as a regular Fig. 3. Partial classes and methmethod. However, if the method is not implemented ods added to Java in another partial class, all calls to the partial method are removed. In previous work we presented the define page blogEntry(e : BlogEntry) { code generation by model transforsection { mation approach [15]. The key idea header { outputString(e.title) } outputText(e.content) of this approach is to represent code } as a model, enabling further transformation of generated code. Com- } positionality features such as partial Fig. 4. A simple page definition in WebDSL classes and methods can be implemented by extending the target language. Fig. 3 shows an example of Java/G, Java extended with partial classes and methods, marked with @Partial annotations (in Fig. 1 the generalized form of this language is referred to as Platform/G). The compiler’s transformation rules emit fragments of Java/G code, which are subsequently merged and written to files. As an example of a code generation rule, we illustrate how Java code is generated from WebDSL page definitions using the Stratego/XT transformation toolset. WebDSL is a domain-specific language for data-intensive web applications [30]. It has sublanguages for the definition of data models, user interfaces, access control, workflow and business logic. The WebDSL page definition in Fig. 4 defines a page blogEntry with an argument of type BlogEntry. The view of the page defines a section, consisting of a header with the title of the blog entry and its content. Fig. 5 defines a Stratego/XT rewrite rule page-to-java, which transforms a WebDSL page to a Java class. When the left-hand side of the rule (before ->) is matched its meta variables x page, farg*, elem* are bound to their corresponding values. The right-hand side of the rule

PIL: A Platform Independent Language for Retargetable DSLs

229

page-to-java : |[ define page x_page (farg* ) { elem* } ]| -> |[ package pages; import javax.servlet.http.*; import java.io.*; @Partial public class x_page extends Page { public void renderPage(Request req, Response res) { PrintWriter out = res.getWriter(); out.print(""+getPageTitle()+""); out.print(""); stat_elem* out.print(""); } } ]| where stat_elem* := <map(elem-to-java)> elem* Fig. 5. Rewrite rule that transforms pages to Java classes

defines the generated Java/G code. In the where condition, individual page elements are mapped to Java statements using the elem-to-java rule. The code pattern in the left and right-hand side of the rule use the concrete syntax [9] of the source and target languages, respectively. Code enclosed in |[ and ]| quotations is internally parsed by Stratego and turned into its AST representation. Consequently, the page-to-java rule matches an AST representation of a page definition and produces a Java/G AST, rather than textual code.

3 Retargeting a DSL Generator In this section, we evaluate three approaches to extend the single platform compiler architecture to support an additional platform. The first approach is copying the existing back-end and porting the transformations to the new target platform. A second approach is translating code generated by the already present back-end to a new platform. As a third aproach, we argue that high-level intermediate languages provide a better approach to supporting multiple platforms in a DSL generator. 3.1 Adding a Backend to a Generator The most intuitive approach to support an additional target platform in a DSL implementation is copying an existing back-end and porting it to generate code for the new platform (Fig. 6). Generalizing this approach, supporting N platforms requires N back-end implementations. Fig. 7 shows how the page-to-java rule (Fig. 5) has been ported to generate Python code. A comparison of the Java and

Fig. 6. Adding a backend to a generator

230

Z. Hemel and E. Visser

page-to-python : |[ define page x_page (farg* ) { elem* } ]| -> |[ @partial class x_page (Page): def renderPage(self, req, res): out = res.writer out.print(""+pageTitle()+"") out.print("") stat_elem* out.print("") ]| where stat_elem := <map(elem-to-python)> elem* Fig. 7. A Python version of the page-to-java rules (Fig. 5)

Python back-ends suggest an additional advantage of generating low-level code: the syntax and low-level APIs do not differ that much between platforms. The main changes that have to be made to port a back-end are syntactic and relate to minor API differences. Although there is conceptual reuse between back-end generation rules, there is no code reuse between them, resulting in large-scale code duplication. In addition, code duplication also occurs in the reimplementation of code compositionality features for each back-end. Code duplication gives rise to maintenance problems. For instance, when the DSL is changed, modifications have to be propagated to all back-ends. 3.2 Language Translation Efforts to translate dynamic languages, specifically Ruby and Python, to the CLR [1,2] and JVM [4,3] bytecode and Microsoft’s Java Language Conversion Assistant to translate Java to C# code appear an attractive option to build retargetable software. Since only one transformation from the DSL to one of these platforms needs to be defined, this approach would solve the code duplication issue in generator back-ends. Fig. 8 depicts this scenario. TransformaFig. 8. Language translation tions that port code from one platform to another are reusable in multiple generators. In addition, code compositionality language extensions have to be implemented only once and need not be ported. However, language ports are problematic due to sheer language complexity, performance issues and the fact that these languages and their platform libraries are not designed to be portable across platforms. Consequently, this approach does not scale well. 3.3 High-Level Intermediate Languages Although not feasible in the general case, porting a language (such as Java, Ruby or Python in the previous section) to multiple platforms is attractive, because multiple similar back-ends need no longer be maintained. When generating code, only a lowlevel subset of the platform is used. Software platforms, at this level, are very similar. Therefore, only a port of a subset of the platform is sufficient to retarget a DSL.

PIL: A Platform Independent Language for Retargetable DSLs

231

One approach is to generate code for an existing platform, e.g. Java, and by convention only use a subset of that platform. Translations to other platforms are defined only for this subset. The problem with this approach is the difficulty to enforce it. In addition, as programming languages are typically not designed to be easily translatable to other languages, there may be hurdles to do so. An example of this is Java’s . (dot) operator, whose meaning at the syntactic level is ambiguous and therefore requires type analysis to disambiguate, requiring language translators to perform such analysis. An alternative approach is to formalize a high- Fig. 9. High-level intermediate level intermediate language. Naturally, this intermedi- language ate language can be based on the subset of an existing language, but it can also be further simplified and extended with code generation features. Typically, the most expensive transformation in a DSL compiler is the transformation from the DSL to target platform code. This transformation is expensive because of the large semantic gap between DSL and platform code. Thus, the number of times that this transformation needs to be implemented should be limited as much as possible. A well-known technique in compiler construction is the use of intermediate languages [24,25]. By using an intermediate language, the maintenance of the compiler is much improved, since only one complex transformation from the DSL to the intermediate language needs to be implemented and maintained. Furthermore, the semantic gap between the intermediate language and the platform is very small, enabling implementations of the intermediate language for new platforms to be developed with little effort. Such intermediate language implementations are reusable in multiple DSL generators. Code compositionality features, as well as other features convenient for code generation can be implemented as an abstraction on top of the intermediate language, implemented as a transformation. The architecture of this scenario is depicted in Fig. 9. 3.4 Evaluation Fig. 10 compares the costs of the three approaches to construct retargetable DSLs. As the transformation from the DSL to platform code is expensive, the first approach, where N supported platforms require N back-end implementations, is not a desirable solution. In the second approach, language translation, only one DSL to code transformation needs to be implemented. However, the language translation C, although reusable in multiple compilers, is very expensive to implement due to the high cost of implementing full language translations. Using an intermediate language requires only one transformation from the DSL to code written in the intermediate language. Implementing translations from the intermediate language to Platform 1 and 2 (A and B ) is cheap because of the small size of the intermediate language. In addition, these translations only need to be implemented once and are reusable in multiple compilers. Therefore, a future DSL compiler is only required to implement the transformation from the DSL to the intermediate language.

232

Z. Hemel and E. Visser

Fig. 10. The scenarios and costs of transformation options

4 PIL: A Platform Independent Language We have developed PIL, a Platform Independent Language designed for code generation, abstracting from syntactic differences between object-oriented languages, slight mismatches between common data types, and providing infrastructure to interact with underlying platforms. In contrast to traditional intermediate languages as used in compiler construction, PIL is used at a higher level of abstraction and has a convenient concrete syntax enabling source code generation through Fig. 11. The PIL architecture the use of code generation rules. Compared to typical programmer-oriented software platforms, PIL is slightly lower level and simpler, making the language easier to port. The concrete syntax of PIL is derived from Java and therefore familiar to Java developers. PIL/G adds a collection of code generation specific abstractions to the small PIL base language, such as code compositionality features. Due to space constraints this paper will not discuss the full PIL language, for that we refer the reader to the PIL website [5]. By generating PIL code, rather than e.g. Java or Python code, the cost of targeting multiple platforms is greatly reduced. Any code generation toolset can be used to generate textual PIL code, which is subsequently translated to either Java or Python code by the PIL compiler (Fig. 11). PIL can also be linked to the generator directly as a library. Currently the PIL compiler library can be linked into Stratego/XT programs, but we are working to enable usage of the PIL library from other tools. The advantage of using PIL as a library is that the overhead of parsing, pretty-printing and I/O can be eliminated by handing an AST to the PIL library rather than a textual representation of the PIL program. While the PIL compiler currently generates Java and Python code, more platforms can be added. Adding a new PIL target platform is cheap. 4.1 PIL: Object-Oriented Programming Essentials Instead of providing a high-level platform targeted at developers, PIL provides a relatively low-level language with a limited set of easy to port built-in data types. At

PIL: A Platform Independent Language for Retargetable DSLs

233

their core, the platforms many DSL generators target are based on the object-oriented paradigm. PIL captures essential object-oriented features and maps them to their specific incarnations on each platform. The concrete syntax and semantics of PIL are based on Java, because it is well known and statically typed. A dynamically typed intermediate language would complicate the mapping to statically typed languages, whereas mapping a statically typed language to a dynamically typed language is simple. Since PIL is a language aimed at code generators rather than developers, Java features not useful from a code generation perspective are discarded, thereby reducing the size of the language and lowering the effort of porting the language to new platforms. From the Java language the following features are omitted: – Visibility modifiers for classes, fields and methods (e.g. public, private, protected): information hiding features serve no purpose in generated code. – Interface and abstract classes: can be replaced with classes with dummy implementations of interface methods. – Inner and anonymous classes: can be implemented as regular classes. – Imports: are syntactic sugar for the use of fully qualified class names. – Checked exceptions: are not supported by most other platforms – Distinction between primitive and object types: in PIL everything is an object. Nevertheless, the Java implementation of PIL does use primitive types and boxes and unboxes as required. – Type coercion (e.g. from int to long): can be made explicit by a code generator. – Enums: can be implemented using e.g. integers. – Array syntax, e.g. byte[] a and new byte[] { ... }. In PIL an array is a regular generic type: Array and can be instantiated with new Array(...). – The one public class per file restriction. This feature is of no use in the context of code generation. The Java syntax had to be slightly adjusted to make the language context free. Java’s . (dot) operator, which is used in package names, static member access, as well as instance member access requires type analysis to disambiguate. In PIL the dot operator is only used for instance member access. In the context of package names, PIL uses ::. PIL has no static member support. Each language requires at least a minimal set of built-in data types, such as integers, strings, characters, arrays and maps. PIL implementations map each of these types to platform-specific implementations. Platform-specific APIs not part of the built-in data types can be accessed through external class declarations, which are further discussed in Section 5. 4.2 PIL/G: Compositionality of Code Generation PIL is a small, easy to port language, but it lacks some features that greatly simplify code generation. In section 2 we discussed that most general purpose programming languages lack compositionality features such as partial classes and methods, which enable concise and modular code generation rules. PIL/G adds such compositionality features to PIL. Through a number of model-to-model transformations the PIL/G model is normalized to regular PIL and then mapped to platform code. In addition to partial classes and methods, PIL/G also adds identifier concatenation and expression blocks.

234

Z. Hemel and E. Visser

@partial class page::SomePage extends Page { Int inputCounter; @partial void init() { inputCounter = 0; } }

@partial class page::SomePage extends Page { String pageTitle; @partial void init() { pageTitle = "welcome"; } }

Fig. 12. Partial classes and methods

Partial classes and methods. Partial classes and methods enable small code generation rules to emit pieces of code that together define a larger artefact. Fig. 12 shows two examples of PIL/G code that use partial classes and methods. Normalization of PIL/G to PIL results in a single SomePage class containing two fields (inputCounter and pageTitle) and one init() method in which inputCounter and pageTitle are set. The order in which partial methods’ code bodies are concatenated is not defined. Identifier concatenation. A common pattern in transformations is composing two or more identifiers into one. For instance, generating getters and setters for a class property (Fig. 14) requires repetitive invocation of helper rules to render proper names for the getter and setter method. PIL/G adds a special # identifier concatenation operator to achieve the same result in a more concise manner, as demonstrated in Fig. 15. The operator adheres to the Java naming convention meaning that the concatenation of get and name results in getName.

var be : BlogEntry := BlogEntry { blog := b }

⇓ BlogEntry be = {| BlogEntry e0 = new BlogEntry(); e0.setBlog(b); | e0 |}

⇓ BlogEntry be = exprBlock0(b); ... BlogEntry exprBlock0(Blog b) { BlogEntry e0 = new BlogEntry(); e0.setBlog(b); return e0; }

Expression blocks. DSL constructs typi- Fig. 13. Transformation from WebDSL entity cally have a higher expressivity than the constructor expressions to PIL implementation target platform language that implements them. Therefore, it is common that an expression in the DSL requires multiple statements of implementation code. In WebDSL this problem occurs while transforming entity constructor expressions to PIL expressions. Fig. 13 shows the transformation steps required to implement a simple example, in which an instance of the BlogEntry entity is created, initializing its blog property with blog b, assigning the result to a new variable be. The implementation of this example in PIL requires a variable declaration statement with an initialization expression derived from the entity constructor expression. In order to realize the entity construction, two PIL statements are required: one statement to create an instance of the BlogEntry class, and a second to set the blog property. Implementing such a transformation is complex, requiring statement lifting. To simplify this type of transformation, PIL/G provides expression block syntax [8]: {| stat* | returnvalue |}, as

PIL: A Platform Independent Language for Retargetable DSLs

235

page-farg-to-pil : |[ x_prop : srt ]| -> |[ t x_prop ; t x_get () { return x_prop ; } void x_set (t value) { this.x_prop = value; }]| where x_get := x_prop ; x_set := x_prop ...

page-farg-to-pil : |[ x_prop : srt ]| -> |[ t x_prop ; t get#x_prop () { return x_prop ; } void set#x_prop (t value) { this.x_prop = value; }]| where ...

Fig. 14. Transformation without identifier concatenation

Fig. 15. Transformation with identifier concatenation

demonstrated in the second transformation step in Fig. 13. During the PIL/G to PIL normalization, expression blocks are lifted to separate methods, receiving closure variables as arguments. 4.3 Developing PIL Back-Ends PIL back-ends can be developed for any language in which it is possible to implement basic OOP features such as objects, classes with single inheritance, virtual method dispatch and garbage collection. Consequently many advanced object-oriented features of the targeted languages remain unused, but this is not an issue as long as the features that PIL requires of a language are a subset of the features offered by the targeted language. Although PIL assumes its target platform to provide garbage collection, it has no assumptions on how this is implemented. Therefore, it is possible to implement a simple garbage collector as part of the back-end transformation. For instance, a language such as Objective-C already provides reference counting using a retain and release mechanism. The sequence of retain and releases can be derived from the PIL code based on scopes and data flow analysis. Depending on the targeted platform, implementing new PIL back-ends is relatively cheap. A back-end implementation requires a grammar of the target language specified in SDF and around 1200 lines of Stratego/XT code, much of which can be based on existing back-ends.

5 PIL/Platform Interaction PIL has a number of built-in data types such as integers, strings, lists and maps. Any interaction with the platform beyond those is performed through external class interfaces. For example, code generated from WebDSL models accesses web request information provided by the web request API. Similarly, code generated by a parser generator uses IO libraries to read a file to be parsed. For data persistence, generated code often interacts with an object-relational mapper framework such as Hibernate or SQLAlchemy.

236

Z. Hemel and E. Visser

external class webdsl::Request { webdsl::Session getSession(); String getParameter(String name); } external class webdsl::Response { webdsl::util::StringWriter getWriter(); void redirect(String url); void setContentType(String ct); }

Fig. 16. Web request and response interface in PIL

package webdsl; import javax.servlet.http.*; public class Response { private HttpServletResponse r; public Response(HttpServletResponse r) { this.r = r; } public String getParameter(String name) { return r.getParameter(name); } // ... }

Fig. 17. Part of Java wrapper of Request interface

Generated platform-specific code typically interacts directly with platform-specific APIs. In contrast, when using PIL to target multiple platforms, direct interaction with platform-specific APIs is not an option. The interfaces of APIs of each supported platform need to be wrapped behind a single consistent PIL interface with consistent behavior across platforms. This section discusses three scenarios that demonstrate how platform interaction can be achieved in a platform-independent manner. The section ends with an example of glue code that is often required to combine pieces of platformspecific code with generated code in order to build a runnable application. 5.1 API Wrapping APIs such as I/O, threading and networking libraries are typically available and similar across platforms. For instance, the API to handle HTTP requests looks slightly different on each platform, but behaves the same. On each platform there is a method to retrieve a GET or POST parameter, get or set a cookie and get access to sessions. Thus, these APIs can be wrapped behind an interface, such as depicted in Fig. 16 which shows PIL external class declarations for a simple Web API. On the Java platform, this interface is implemented using the Java Servlet APIs (Fig. 17) and on the Python platform it is implemented wrapping its CGI module. The external class declaration as seen in Fig. 16 exposes these classes to PIL code. After the code is generated, the defined wrapper APIs and generated code are combined and compiled by the platform compiler, or interpreted by a platform interpreter. 5.2 Missing API on Some Platforms It can occur that a particular API is not available on one or more platforms. In this scenario there are two options. The first is to simply not support the part of the DSL that relies on the particular API on every platform. The second option is to port an implementation of the API to the platforms where it is not already available. The latter can be achieved in two ways, either by porting the API to other platforms directly, or porting the framework to PIL and generating platform implementations from that. Although PIL is intended to be used as a code generation language, it can be used as a

PIL: A Platform Independent Language for Retargetable DSLs

237

language to port an API to as well. The advantage of using PIL over building custom ports for each language, again, is that PIL implementations are portable. Because we could not find a suitable objectrelational mapper for Python that is compatible with Hibernate, and because Hibernate did not suit our needs entirely anyway, we implemented a simple ORM framework in PIL (Fig. 18). Although Hibernate’s implementation is substantial, WebDSL only requires a fraction of its features. A significant part of Hibernate’s implementation is dedicated to framework usability, such as its extensive conFig. 18. ORM implemented in PIL figuration and annotation support. Such features are of little value when code is generated. Therefore, the ORM library we implemented in PIL, provides only the features and behavior that WebDSL requires. PIL makes the ORM framework very portable, because each platform only requires the wrapping of a low-level database API, enabling the execution of SQL queries (Fig. 19). external class pil::db::Database { new(String hostName, String username, String password, String database); pil::db::Connection getConnection(); ... } external class pil::db::Connection { List query(String query, Array

Software Language Engineering: Second International Conference, SLE 2009, Denver, CO, USA, October 5-6, 2009 Revised Selected Papers (Lecture Notes ... Programming and Software Engineering)

Models in Software Engineering: Workshops and Symposia at MODELS 2009, Denver, CO, USA, October 4-9, 2009. Reports and Revised Selected Papers ... Programming and Software Engineering)

Model Driven Engineering Languages and Systems: 12th International Conference, MODELS 2009, Denver, CO, USA, October 4-9, 2009, Proceedings (Lecture ... Programming and Software Engineering)

Software Language Engineering -- SLE 2010

Software language engineering first international conference; revised selected papers SLE <1. 2008. Toulouse>

Fundamentals of Software Engineering: Third IPM International Conference, FSEN 2009, Kish Island, Iran, April 15-17, 2009, Revised Selected Papers ... Programming and Software Engineering)

Software Language Engineering, 1 conf., SLE 2008

Component-Based Software Engineering: 12th International Symposium, CBSE 2009 East Stroudsburg, PA, USA, June 24-26, 2009 Proceedings (Lecture Notes ... Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Sensor Systems and Software: First International ICST Conference, S-CUBE 2009, Pisa, Italy, September 7-9, 2009, Revised Selected Papers (Lecture Notes ... and Telecommunications Engineering)

Web Services and Formal Methods: 6th International Workshop, WS-FM 2009, Bologna, Italy, September 4-5, 2009, Revised Selected Papers (Lecture Notes ... Programming and Software Engineering)

Lecture notes on empirical software engineering

Mobile Networks and Management: First International Conference, MONAMI 2009, Athens, Greece, October 13-14, 2009. Revised Selected Papers (Lecture Notes ... and Telecommunications Engineering)

Reliable Software Technologies - Ada-Europe 2009: 14th Ada-Europe International Conference, Brest, France, June 8-12, 2009, Proceedings (Lecture Notes ... Programming and Software Engineering)

Software Language Engineering: Second International Conference, SLE 2009, Denver, CO, USA, October 5-6, 2009 Revised Selected Papers (Lecture Notes ... Programming and Software Engineering)

Models in Software Engineering: Workshops and Symposia at MODELS 2009, Denver, CO, USA, October 4-9, 2009. Reports and Revised Selected Papers ... Programming and Software Engineering)

Model Driven Engineering Languages and Systems: 12th International Conference, MODELS 2009, Denver, CO, USA, October 4-9, 2009, Proceedings (Lecture ... Programming and Software Engineering)

Software Language Engineering -- SLE 2010

Software language engineering first international conference; revised selected papers SLE <1. 2008. Toulouse>

Fundamentals of Software Engineering: Third IPM International Conference, FSEN 2009, Kish Island, Iran, April 15-17, 2009, Revised Selected Papers ... Programming and Software Engineering)

Software Language Engineering, 1 conf., SLE 2008

Component-Based Software Engineering: 12th International Symposium, CBSE 2009 East Stroudsburg, PA, USA, June 24-26, 2009 Proceedings (Lecture Notes ... Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Programming and Software Engineering)

Sensor Systems and Software: First International ICST Conference, S-CUBE 2009, Pisa, Italy, September 7-9, 2009, Revised Selected Papers (Lecture Notes ... and Telecommunications Engineering)

Web Services and Formal Methods: 6th International Workshop, WS-FM 2009, Bologna, Italy, September 4-5, 2009, Revised Selected Papers (Lecture Notes ... Programming and Software Engineering)

Lecture notes on empirical software engineering

Mobile Networks and Management: First International Conference, MONAMI 2009, Athens, Greece, October 13-14, 2009. Revised Selected Papers (Lecture Notes ... and Telecommunications Engineering)

Reliable Software Technologies - Ada-Europe 2009: 14th Ada-Europe International Conference, Brest, France, June 8-12, 2009, Proceedings (Lecture Notes ... Programming and Software Engineering)

Recommend Documents