Performance and Dependability in Service Computing: Concepts, Techniques and Research Directions Valeria Cardellini Universitá di Roma “Tor Vergata,” Italy Emiliano Casalicchio Universitá of Roma “Tor Vergata,” Italy Kalinka Regina Lucas Jaquie Castelo Branco Universidade de São Paulo, Brazil Júlio Cezar Estrella Universidade de São Paulo, Brazil Francisco José Monaco Universidade de São Paulo, Brazil
Senior Editorial Director: Director of Book Publications: Editorial Director: Acquisitions Editor: Development Editor: Production Editor: Typesetters: Print Coordinator: Cover Design:
Kristin Klinger Julia Mosemann Lindsay Johnston Erika Carter Mike Killian Sean Woznicki Jennifer Romanchak, Mike Brehm Jamie Snavely Nick Newcomer
Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.igi-global.com Copyright © 2012 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Performance and dependability in service computing: concepts, techniques and research directions / Valeria Cardellini ... [et al.], editors. p. cm. Includes bibliographical references and index. Summary: “This book focuses on performance and dependability issues associated with service computing and these two complementary aspects, which include concerns of quality of service (QoS), real-time constraints, security, reliability and other important requirements when it comes to integrating services into real-world business processes and critical applications”-Provided by publisher. ISBN 978-1-60960-794-4 (hbk.) – ISBN 978-1-60960-795-1 (ebook) -- ISBN 9781-60960-796-8 (print & perpetual access) 1. Service-oriented architecture (Computer science) 2. Web services. 3. Management information systems. I. Cardellini, Valeria, 1973TK5105.5828.P47 2012 004.6’54--dc23 2011017832
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
Editorial Advisory Board Salima Benbernou, Université Paris Descartes, France Gerson G.H. Cavalheiro, Universidade Federal de Pelotas, Brazil Chi-Hung Chi, Tsinghua University, China Bruno Ciciani, Università di Roma “La Sapienza”, Italy Michele Colajanni, Università di Modena and Reggio Emilia, Italy Erol Gelenbe, Imperial College London, UK Ricardo Jiménez-Peris, Universidad Politécnica de Madrid, Spain Antonio A.F. Loureiro, Universidade Federal de Minas Gerais, Brazil Junichi Suzuki, University of Massachusetts Boston, USA Ramin Yahyapour, Technische Universität Dortmund, Germany
List of Reviewers Mauro Andreolini, University of Modena and Reggio Emilia, Italy Danilo Ardagna, Politecnico di Milano, Italy Ivona Brandic, Vienna University of Technology, Austria Giuliano Casale, Imperial College, UK Gerson G. H. Cavalheiro, Universidade Federal de Pelotas, Brazil Bruno Ciciani, University of Roma “La Sapienza”, Italy Michele Colajanni, University of Modena and Reggio Emilia, Italy Marco Cova, University of California Santa Barbara, USA Márcio Eduardo Delamaro, Universidade de São Paulo, Brazil Vittoria De Nitto Personè, University of Roma “Tor Vergata”, Italy Simone do Rocio Senger de Souza, Universidade de São Paulo, Brazil Salvatore Distefano, University of Messina, Italy Abdelkarim Erradi, Qatar University, Qatar Luis Ferreira Pires, University of Twente, The Netherlands Mauricio Figueiredo, FUCAPI, Brazil Lorenz Froihofer, Vienna University of Technology, Austria Saurabh Garg, University of Melbourne, Australia Alfredo Goldman vel Lejbman, Universidade de São Paulo, Brazil Vincenzo Grassi, University of Roma “Tor Vergata”, Italy
Stefano Iannucci, University of Roma “Tor Vergata”, Italy Ricardo Jimenez-Peris, Universidad Politecnica de Madrid, Spain Marek Kowalkiewicz, SAP Research Brisbane, Australia Riccardo Lancellotti, University of Modena and Reggio Emilia, Italy Yinsheng Li, Fudan University, China Francesco Lo Presti, University of Roma “Tor Vergata”, Italy Huiye Ma, Eindhoven University of Technology, The Netherlands Michael Menzel, Hasso Plattner Institute for Software Systems Engineering, Germany Raffaela Mirandola, Politecnico di Milano, Italy Jose Monteiro, Universidade de São Paulo, Brazil Regina Lúcia de Oliveira Moraes, Universidade Estadual de Campinas, Brazil Paolo Romano, INESC-ID, Lisbon, Portugal Florian Rosenberg, CSIRO, Australia Romain Rouvoy, INRIA, France Jacques Sauvé, Universidade Federal do Rio Grande do Sul, Brazil Valerio Senni, INRIA, France Francisco Silva, Universidade Federal do Maranhão, Brazil Luca Silvestri, University of Roma “Tor Vergata”, Italy Edmundo Sérgio Spoto, Universidade Federal de Goiás, Brazil Mário Antônio Meireles Teixeira, Univesidade Federal do Maranhão, Brazil Taisy Silva Weber, Universidade Federal do Rio Grande do Sul, Brazil Xiaohui Zhao, Eindhoven University of Technology, The Netherlands
Table of Contents
Preface................................................................................................................................................. xvii Acknowledgment................................................................................................................................ xxii Section 1 Foundations Chapter 1 Service Level Agreement (SLA) in Utility Computing Systems............................................................. 1 Linlin Wu, The University of Melbourne, Australia Rajkumar Buyya, The University of Melbourne, Australia Chapter 2 SLA-Aware Enterprise Service Computing........................................................................................... 26 Longji Tang, University of Texas at Dallas, USA Jing Dong, University of Texas at Dallas, USA Yajing Zhao, University of Texas at Dallas, USA Chapter 3 Dependability Modeling........................................................................................................................ 53 Paulo R. M. Maciel, Federated University of Pernambuco, Brazil Kishor S. Trivedi, Duke University, USA Rivalino Matias Jr., Federal University of Uberlândia, Brazil Dong Seong Kim, Duke University, USA Chapter 4 Trends and Research Issues in SOA Validation..................................................................................... 98 Antonia Bertolino, Consiglio Nazionale delle Ricerche, Italy Guglielmo De Angelis, Consiglio Nazionale delle Ricerche, Italy Antonino Sabetta, Consiglio Nazionale delle Ricerche, Italy Andrea Polini, University of Camerino, Italy
Chapter 5 Service-Oriented Collaborative Business Processes............................................................................ 116 Lai Xu, Bournemouth University, UK Paul de Vrieze, Bournemouth University, UK Athman Bouguettaya, CSIRO ICT Centre, Australia Peng Liang, Wuhan University, China Keith Phalp, Bournemouth University, UK Sherry Jeary, Bournemouth University, UK Section 2 Performance Chapter 6 Performance Management of Composite Applications in Service Oriented Architectures................. 134 Vinod K. Dubey, Booz Allen Hamilton, USA Daniel A. Menascé, George Mason University, USA Chapter 7 High-Quality Business Processes Based on Multi-Dimensional QoS................................................. 152 Qianhui Liang, HP Labs, Singapore Michael Parkin, Tilburg University, The Netherlands Chapter 8 A Game Theoretic Solution for the Optimal Selection of Services..................................................... 172 Salah Merad, Office for National Statistics, UK Rogério de Lemos, University of Kent, UK Tom Anderson, Newcastle University, UK Chapter 9 A Tool Chain for Constructing QoS-Aware Web Services.................................................................. 189 Bernhard Hollunder, Furtwangen University of Applied Sciences, Germany Ahmed Al-Moayed, Furtwangen University of Applied Sciences, Germany Alexander Wahl, Furtwangen University of Applied Sciences, Germany Chapter 10 Performance, Availability and Cost of Self-Adaptive Internet Services.............................................. 212 Jean Arnaud, INRIA – Grenoble, France Sara Bouchenak, University of Grenoble & INRIA, France
Section 3 Dependability Chapter 11 Performability Evaluation of Web-Based Services.............................................................................. 243 Magnos Martinello, Federal University of Espírito Santo (UFES), Brazil Mohamed Kaâniche, CNRS; LAAS & Université de Toulouse, France Karama Kanoun, CNRS; LAAS & Université de Toulouse, France Chapter 12 Measuring and Dealing with the Uncertainty of the SOA Solutions................................................... 265 Yuhui Chen, University of Oxford, UK Anatoliy Gorbenko, National Aerospace University, Ukraine Vyacheslav Kharchenko, National Aerospace University, Ukraine Alexander Romanovsky, Newcastle University, UK Chapter 13 Achieving Dependable Composite Services Through Two-Level Redundancy.................................. 295 Hailong Sun, Beihang University, China Jin Zeng, China Software Testing Center, China Huipeng Guo, Beihang University, China Xudong Liu, Beihang University, China Jinpeng Huai, Beihang University, China Chapter 14 Building Web Services with Time Requirements................................................................................ 317 Nuno Laranjeiro, University of Coimbra, Portugal Marco Vieira, University of Coimbra, Portugal Henrique Madeira, University of Coimbra, Portugal Chapter 15 Dependability and Security on Wireless Self-Organized Networks: Properties, Requirements, Approaches and Future Directions....................................................................................................... 340 Michele Nogueira, Universidade Federal do Paraná, Brazil Aldri Santos, Universidade Federal do Paraná, Brazil Guy Pujolle, Sorbonne Universités, France
Section 4 Security Chapter 16 Engineering Secure Web Services....................................................................................................... 360 Douglas Rodrigues, Universidade de São Paulo, Brazil Julio Cezar Estrella, Universidade de São Paulo, Brazil Francisco José Monaco, Universidade de São Paulo, Brazil Kalinka Regina Lucas Jaquie Castelo Branco, Universidade de São Paulo, Brazil Nuno Antunes, Universidade de Coimbra, Portugal Marco Vieira, Universidade de Coimbra, Portugal Chapter 17 Approaches to Functional, Structural and Security SOA Testing........................................................ 381 Cesare Bartolini, Consiglio Nazionale delle Ricerche, Italy Antonia Bertolino, Consiglio Nazionale delle Ricerche, Italy Francesca Lonetti, Consiglio Nazionale delle Ricerche, Italy Eda Marchetti, Consiglio Nazionale delle Ricerche, Italy Chapter 18 Detecting Vulnerabilities in Web Services: Can Developers Rely on Existing Tools?....................... 402 Nuno Antunes, University of Coimbra, Portugal Marco Vieira, University of Coimbra, Portugal Compilation of References ............................................................................................................... 427 About the Contributors .................................................................................................................... 459 Index.................................................................................................................................................... 472
Detailed Table of Contents
Preface................................................................................................................................................. xvii Acknowledgment................................................................................................................................ xxii Section 1 Foundations Chapter 1 Service Level Agreement (SLA) in Utility Computing Systems............................................................. 1 Linlin Wu, The University of Melbourne, Australia Rajkumar Buyya, The University of Melbourne, Australia In recent years, extensive research has been conducted in the area of Service Level Agreement (SLA) for utility computing systems. An SLA is a formal contract used to guarantee that consumers’ service quality expectation can be achieved. In utility computing systems, the level of customer satisfaction is crucial, making SLAs significantly important in these environments. Fundamental issue is the management of SLAs, including SLA autonomy management or trade off among multiple Quality of Service (QoS) parameters. Many SLA languages and frameworks have been developed as solutions; however, there is no overall classification for these extensive works. Therefore, the aim of this chapter is to present a comprehensive survey of how SLAs are created, managed and used in utility computing environment. The authors discuss existing use cases from Grid and Cloud computing systems to identify the level of SLA realization in state-of-art systems and emerging challenges for future research. Chapter 2 SLA-Aware Enterprise Service Computing........................................................................................... 26 Longji Tang, University of Texas at Dallas, USA Jing Dong, University of Texas at Dallas, USA Yajing Zhao, University of Texas at Dallas, USA There is a growing trend towards enterprise system integration across organizational and enterprise boundaries on the global Internet platform. The Enterprise Service Computing (ESC) has been adopted by more and more corporations to meet the growing demand from businesses and the global economy. However the ESC as a new distributed computing paradigm poses many challenges and issues of qual-
ity of services. For example, how is ESC compliant with the quality of service (QoS)? How do service providers guarantee services which meet service consumers’ needs as well as wants? How do both service consumers and service providers agree with QoS at runtime? In this chapter, SLA-Aware enterprise service computing is first introduced as a solution to the challenges and issues of ESC. Then, SLA-Aware ESC is defined as new architectural styles which include SLA-Aware Enterprise Service-Oriented Architecture (ESOA-SLA) and SLA-Aware Enterprise Cloud Service Architecture (ECSA-SLA). In addition, the enterprise architectural styles are specified through our extended ESOA and ECSA models. The ECSA-SLA styles include SLA-Aware cloud services, SLA-Aware cloud service consumers, SLA-Aware cloud SOA infrastructure, SLA-Aware cloud SOA management, SLA-Aware cloud SOA process and SLA-Aware SOA quality attributes. The main advantages of viewing and defining SLA-Aware ESC as an architectural style are (1) abstracting the common structure, constraints and behaviors of a family of ESC systems, such as ECSA-SLA style systems and (2) defining general design principles for the family of enterprise architectures. The design principles of ECSA-SLA systems are proposed based on the model of ECSA-SLA. Finally, the authors discuss the challenges of SLA-Aware ESC and suggest that the autonomic service computing, automated service computing, adaptive service computing, real-time SOA, and event-driven architecture can help to address the challenges. Chapter 3 Dependability Modeling........................................................................................................................ 53 Paulo Romero Martins Maciel, Universidade Federal de Pernambuco, Brazil Kishor S. Trivedi, Duke University, USA Rivalino Matias Jr, Federal University of Uberlandia, Brazil Dong Seong Kim, Duke University, USA This chapter presents modeling method and evaluation techniques for computing dependability metrics of systems. The chapter begins providing a summary of seminal works. After presenting the background, the most prominent model types are presented, and the respective methods for computing exact values and bounds. This chapter focuses particularly on combinatorial models although state space models such as Markov models and hierarchical models are also presented. Case studies are then presented in the end of the chapter. Chapter 4 Trends and Research Issues in SOA Validation..................................................................................... 98 Antonia Bertolino, Consiglio Nazionale delle Ricerche, Italy Guglielmo De Angelis, Consiglio Nazionale delle Ricerche, Italy Antonino Sabetta, Consiglio Nazionale delle Ricerche, Italy Andrea Polini, University of Camerino, Italy Service Oriented Architecture (SOA) is changing the way in which software applications are designed, deployed and maintained. A service-oriented application consists of the runtime composition of autonomous services that are typically owned and controlled by different organizations. This decentralization impacts on the dependability of applications that consist of dynamic services agglomerates, and challenges their validation. Different techniques can be used or combined for the verification of dependability aspects, spanning over traditional off-line testing approaches up till monitoring and on-line testing. In this chapter
the authors discuss issues and opportunities of SOA validation, identify three different stages for validation along the service life-cycle model, and overview some proposed research approaches and tools. The emphasis is on on-line testing, which to us is the most peculiar stage in the SOA validation process. Finally, the authors claim that on-line testing is only possible within an agreed governance framework. Chapter 5 Service-Oriented Collaborative Business Processes............................................................................ 116 Lai Xu, Bournemouth University, UK Paul de Vrieze, Bournemouth University, UK Athman Bouguettaya, CSIRO ICT Centre, Australia Peng Liang, Wuhan University, China Keith Phalp, Bournemouth University, UK Sherry Jeary, Bournemouth University, UK The ability to rapidly find potential business partners as well as rapidly set up a collaborative business process is desirable in the face of market turbulence. Traditional linking of business processes has a large ad hoc character. Implementing service-oriented business process mashup in an appropriate way will deliver the collaborative business process more flexibility, adaptability and agility. In this chapter, the authors describe new landscape for supporting collaborative business processes. The different solutions and tools for collaborative business process applications are presented. A new approach for supporting situational collaborative business process, process-oriented mashup is introduced. The authors have highlighted the security and scalability challenges of process-oriented mashups. Further, benefits of using process-oriented mashup are discussed. Section 2 Performance Chapter 6 Performance Management of Composite Applications in Service Oriented Architectures................. 134 Vinod K. Dubey, Booz Allen Hamilton, USA Daniel A. Menascé, George Mason University, USA The use of Service Oriented Architectures (SOA) enables the existence of a market of service providers delivering functionally equivalent services at different Quality of Service (QoS) and cost levels. The QoS of composite applications can typically be described in terms of metrics such as response time, availability, and throughput of the services that compose the application. A global utility function of the various QoS metrics is the objective function used to determine a near-optimal selection of service providers that support the composite application. This chapter describes the architecture of a QoS Broker that manages the performance of composite applications. The broker continually monitors the utility of the applications and triggers a new service selection when the utility falls below a pre-established threshold or when a service provider fails. A proof-of-concept prototype of the QoS broker demonstrates how it maintains the average utility of the composite application above the threshold in spite of service provider failures and performance degradation.
Chapter 7 High-Quality Business Processes based on Multi-Dimensional QoS.................................................. 152 Qianhui Liang, HP Labs, Singapore Michael Parkin, Tilburg University, The Netherlands An important area of services research gathering momentum is the ability to take a generic business process and instantiate it by selecting services that meet both the functional and non-functional requirements of the process owner. These non-functional or quality-of-service (QoS) requirements may describe essential performance and dependability requirements and apply across different logical layers of the application, from business-related details to system infrastructure; i.e., they are cross-cutting and considered multidimensional. Configuring an abstract business process with the “best” services to meet the process owner’s multidimensional end-to-end QoS requirements is a challenging task as there may be many services that match to the functional requirements but provide differentiated QoS characteristics. In this chapter the authors explore an approach to discover services, differentiated by their QoS attributes, to configure an abstract business process by selecting an optimal configuration of the “best” QoS combinations. The approach considered takes into account the optimal choice of multi-dimensional QoS variables. The authors present and compare two solutions based on heuristic algorithms to illustrate how this approach would work practically. Chapter 8 A Game Theoretic Solution for the Optimal Selection of Services..................................................... 172 Salah Merad, Office for National Statistics, UK Rogério de Lemos, University of Kent, UK Tom Anderson, Newcastle University, UK This chapter considers the problem of optimally selecting services during run-time with respect to their non-functional attributes and costs. Commercial pressures for reducing the cost of managing complex software systems are changing the way in which systems are designed and built. The reason behind this shift is the need for dealing with changes efficiently and effectively, which may include removing the human operator from the process of decision-making. In service-oriented computing, in particular, the run-time selection and integration of services may soon become a reality since services are readily available. Assuming that each component service has a specific functional and non-functional profile, the challenge now is to define a decision maker that is able to select services that satisfy the system requirements and optimise the quality of services under cost constraints. The approach presented in this chapter describes a game theoretic solution by formulating the problem as a bargaining game. Chapter 9 A Tool Chain for Constructing QoS-aware Web Services................................................................... 189 Bernhard Hollunder, Furtwangen University of Applied Sciences, Germany Ahmed Al-Moayed, Furtwangen University of Applied Sciences, Germany Alexander Wahl, Furtwangen University of Applied Sciences, Germany Web services play a dominant role in service computing and for realizing service-oriented architectures (SOA), which define the architectural foundation for various kinds of distributed applications. In many business domains, Web services must exhibit quality attributes such as robustness, security, dependability,
performance, scalability and accounting. As a consequence, there is a high demand to develop, deploy and consume Web services equipped with well-defined quality of service (QoS) attributes – so-called QoS-aware Web services. Currently, there is only limited development support for the creation of QoSaware Web services, though. In this work the authors present a tool chain that facilitates development, deployment and testing of QoS-aware Web services. The tool chain has following features: i) integration of standard components such as widely used IDEs, ii) usage of standards and specifications, and iii) support for various application servers and Web services infrastructures. Chapter 10 Performance, Availability and Cost of Self-Adaptive Internet Services.............................................. 212 Jean Arnaud, INRIA – Grenoble, France Sara Bouchenak, University of Grenoble & INRIA, France Although distributed services provide a means for supporting scalable Internet applications, their ad-hoc provisioning and configuration pose a difficult tradeoff between service performance and availability. This is made harder as Internet service workloads tend to be heterogeneous, and vary over time in amount of concurrent clients and in mixture of client interactions. This chapter presents an approach for building self-adaptive Internet services through utility-aware capacity planning and provisioning. First, an analytic model is presented to predict Internet service performance, availability and cost. Second, a utility function is defined and a utility-aware capacity planning method is proposed to calculate the optimal service configuration which guarantees SLA performance and availability objectives while minimizing functioning costs. Third, an adaptive control method is proposed to automatically apply the optimal configuration to the Internet service. Finally, the proposed model, capacity planning and control methods are implemented and applied to an online bookstore. The experiments show that the service successfully self-adapts to both workload mix and workload amount variations, and present significant benefits in terms of performance and availability, with a saving of resources underlying the Internet service. Section 3 Dependability Chapter 11 Performability Evaluation of Web-Based Services.............................................................................. 243 Magnos Martinello, Federal University of Espírito Santo (UFES), Brazil Mohamed Kaâniche, CNRS; LAAS & Université de Toulouse, France Karama Kanoun, CNRS; LAAS & Université de Toulouse, France The joint evaluation of performance and dependability in a unique approach leads to the notion of performability which usually combines different analytical modeling formalisms (Markov chains, queueing models, etc.) for assessing systems behaviors in the presence of faults. This chapter presents a systematic modeling approach allowing designers of web-based services to evaluate the performability of the service provided to the users. We have developed a multi-level modeling framework for analyzing the user perceived performability. Multiple sources of service unavailability are taken into account, particularly i) hardware and software failures affecting the servers, and ii) performance degradation due to e.g. overload of servers and probability of loss. The main concepts and the feasibility of the proposed
framework are illustrated using a web-based travel agency. Various analytical models and sensitivity studies are presented considering different assumptions with respect to users profiles, architecture, faults, recovery strategies, and traffic characteristics. Chapter 12 Measuring and Dealing with the Uncertainty of the SOA Solutions................................................... 265 Yuhui Chen, University of Oxford, UK Anatoliy Gorbenko, National Aerospace University, Ukraine Vyacheslav Kharchenko, National Aerospace University, Ukraine Alexander Romanovsky, Newcastle University, UK The chapter investigates the uncertainty of Web Services performance and the instability of their communication medium (the Internet), and shows the influence of these two factors on the overall dependability of SOA. The authors present our practical experience in benchmarking and measuring the behaviour of a number of existing Web Services used in e-science and bio-informatics, provide the results of statistical data analysis and discuss the probability distribution of delays contributing to the Web Services response time. The ratio between delay standard deviation and its average value is introduced to measure the performance uncertainty of a Web Service. Finally, the authors present the results of error and fault injection into Web Services. The authors summarise our experiments with SOA-specific exception handling features provided by two web service development kits and analyse exception propagation and performance as the major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services. Chapter 13 Achieving Dependable Composite Services Through Two-Level Redundancy.................................. 295 Hailong Sun, Beihang University, China Jin Zeng, China Software Testing Center, China Huipeng Guo, Beihang University, China Xudong Liu, Beihang University, China Jinpeng Huai, Beihang University, China Service composition is a widely accepted method to build service-oriented applications. However, due to the uncertainty of infrastructure environments, service performance and user requests, service composition faces a great challenge to guarantee the dependability of the corresponding composite services. In this chapter, the authors provide an insightful analysis of the dependability issue of composite services. And the authors present a solution based on two-level redundancy: component service redundancy and structural redundancy. With component service redundancy, the authors study how to determine the number of backup services and how to guarantee consistent dependability of a composite service. In addition, structural redundancy aims at further improving dependability at business process level through setting up backup execution paths. Chapter 14 Building Web Services with Time Requirements................................................................................ 317 Nuno Laranjeiro, University of Coimbra, Portugal Marco Vieira, University of Coimbra, Portugal Henrique Madeira, University of Coimbra, Portugal
Developing web services with timing requirements is a difficult task, as existing technology does not provide standard mechanisms to support real-time execution, or even to detect and predict timing violations. However, in business-critical environments, an operation that does not conclude on due time may be completely useless, and may result in service abandonment, reputation, or monetary losses. This chapter presents a framework that allows deploying web services with temporal failure detection and prediction capabilities. Detection is based on timing restrictions defined at execution time and historical data is used for failure prediction according to prediction modules. Additional modules can be added to the framework to provide more advanced failure detection and prediction capabilities. The framework enables providers to easily develop and deploy time-aware web services, with the failure detection code decoupled from the application logic, and allows consumers to express their timeliness requirements. Chapter 15 Dependability and Security on Wireless Self-Organized Networks: Properties, Requirements, Approaches and Future Directions....................................................................................................... 340 Michele Nogueira, Universidade Federal do Paraná, Brazil Aldri Santos, Universidade Federal do Paraná, Brazil Guy Pujolle, Sorbonne Universités, France Wireless communication technologies have been improved every day, increasing the dependence of people on distributed systems. Such dependence increases the necessity of guaranteeing dependable and secure services, particularly, for applications related to commercial, financial and medial domains. However, on wireless self-organized network context, providing simultaneously reliability and security is a demanding task due to the network characteristics. This chapter provides an overview of survivability concepts, reviews security threats in wireless self-organized networks (WSONs) and describes existing solutions for survivable service computing on wireless network context. Finally, this chapter presents conclusions and future directions. Section 4 Security Chapter 16 Engineering Secure Web Services....................................................................................................... 360 Douglas Rodrigues, Universidade de São Paulo, Brazil Julio Cezar Estrella, Universidade de São Paulo, Brazil Francisco José Monaco, Universidade de São Paulo, Brazil Kalinka Regina Lucas Jaquie Castelo Branco, Universidade de São Paulo, Brazil Nuno Antunes, Universidade de Coimbra, Portugal Marco Vieira, Universidade de Coimbra, Portugal Web services are key components in the implementation of Service Oriented Architectures (SOA), which must satisfy proper security requirements in order to be able to support critical business processes. Research works show that a large number of web services are deployed with significant security flaws, ranging from code vulnerabilities to the incorrect use of security standards and protocols. This chapter discusses state of the art techniques and tools for the deployment of secure web services, including standards and
protocols for the deployment of secure services, and security assessment approaches. The chapter also discusses how relevant security aspects can be correlated into practical engineering approaches. Chapter 17 Approaches to Functional, Structural and Security SOA Testing........................................................ 381 Cesare Bartolini, Consiglio Nazionale delle Ricerche, Italy Antonia Bertolino, Consiglio Nazionale delle Ricerche, Italy Francesca Lonetti, Consiglio Nazionale delle Ricerche, Italy Eda Marchetti, Consiglio Nazionale delle Ricerche, Italy In this chapter, the authors provide an overview of recently proposed approaches and tools for functional and structural testing of SOA services. Typically, these two classes of approaches have been considered separately. However, since they focus on different perspectives, they are generally non-conflicting and could be used in a complementary way. Accordingly, the authors make an attempt at such a combination, briefly showing the approach and some preliminary results of the experimentation. The combined approach provides encouraging results from the point of view of the achievements and the degree of automation obtained. A very important concern in designing and developing web services is security. In the chapter the authors also discuss the security testing challenges and the currently proposed solutions. Chapter 18 Detecting Vulnerabilities in Web Services:Can Developers Rely on Existing Tools?........................ 402 Nuno Antunes, University of Coimbra, Portugal Marco Vieira, University of Coimbra, Portugal Although web services are becoming business-critical components, they are often deployed with software bugs that can be maliciously exploited. Numerous developers are not specialized on security and the common time-to-market constraints limit an in-depth testing for vulnerabilities. In this context, vulnerability detection tools have a very important role helping the developers to produce less vulnerable code. However, developers usually select a tool to use and rely on its results without knowing its real effectiveness. This chapter presents two case studies on the effectiveness of several well-known vulnerability detection tools and discusses their strengths and limitations. Based on lessons learned, the chapter also proposes a benchmarking technique that can be used to select the tool that best fits a specific scenario. The main goal is to provide web service developers with information on how much they can rely on widely used vulnerability detection tools and on how to select the most adequate tool. Compilation of References ............................................................................................................... 427 About the Contributors .................................................................................................................... 459 Index.................................................................................................................................................... 472
xvii
Preface
Service-oriented computing has emerged as a computing paradigm capable to change the way systems are designed, architected, deployed, and used. It decomposes computation into a set of loosely-coupled and possibly cooperating services that can be arranged and integrated in a new manner to create flexible and dynamic applications. When deployed as infrastructural components of business processes, service-oriented systems elicit the proper addressing of performance and dependability issues. While recent developments in service computing have come a long way in many aspects, ranging from semantics and ontologies to frameworks and design processes, performance and dependability issues remain a research demanding field. Since the activities of our daily lives are increasingly dependent on service-oriented systems, the analysis and assessment of the Quality of Service (QoS) delivered by these systems, in terms of their performance, dependability, and security is crucially important. Furthermore, with the spread of cloud computing as a service delivery platform and the envisioned “Internet of Services”, where everything that is needed to use software applications is available as a service on the Internet, performance and dependability issues are expected to become even more critical. In these scenarios, a number of factors, including the nondeterministic dynamics of network environment, the diversity of resources and requirements, and the loose coupling architectural principle of service-oriented computing, pose various challenges to the methodologies and techniques that aim at meeting both functional and non-functional requirements of service-oriented systems and demand a joint effort from diverse but related research communities. The book, entitled “Performance and Dependability in Service Computing: Concepts, Techniques and Research Directions” is conceived under this perspective and offers some state-of-the-art contributions covering concepts, principles, methodologies, modeling, engineering approaches, applications, and recent technological developments that can be adopted to improve the performance and dependability of service-oriented systems. The book builds on academic and industrial research efforts that are been carried out at many different institutions around the world. In addition, the book identifies potential research directions that can drive future innovations. We expect the book to serve as a valuable and useful reference for researchers, practitioners, and graduate level students. It deals with performance and dependability issues in service computing that are certainly of interest to the practitioners in understanding some practical problems that may be faced during the management of real-world service-oriented systems. At the same time, the book is of interest for researchers as it contains a set of contributions that were consciously selected for their novelty and for their research value. In addition, the book contains some fundamental methods, concepts, and principles that can be valuable for graduate level students who wish to learn and spot the opportunities for their studies in this emerging research and development area. Researchers and students can also find useful references for further study.
xviii
ORGANIZATION OF THE BOOK All the contributions have been reviewed, edited, processed and placed in the appropriate order to maintain consistency so that any reader would get the most out of it. The organization we describe below ensures the smooth flow of material, as successive chapters build on previous ones. However, each chapter is self-contained to provide greatest reading flexibility. The book is organized in four sections. Section 1, namely Foundations, introduces the concepts of Service Level Agreement (SLA) in utility computing and enterprise service computing, the theoretical foundation of dependability modeling, the concepts and research issues in SOA validation as well as in collaborative business processes. Section 2, namely Performance, encompasses research works presenting issues and solutions related to the modeling and performance-oriented design of service-oriented systems and, more in general, Internet services. This section of the book is mainly related to QoS aspects. Section 3, namely Dependability, embraces chapters discussing approaches for the modeling, evaluation, and enforcement of dependability and performability (that is, the integration of performance and dependability in a unique approach) in service-oriented systems. The third section concludes with a chapter discussing dependability and security issues in pervasive service-oriented systems. Section 4, namely Security, is devoted to research works on security engineering, testing, and vulnerability detection in service-oriented systems. The topics discussed in the book are the following: • • • • • • • • • • • • •
SLA definition, management and use SLA models for SOA and Cloud QoS-aware SOA and Web services Models and methodologies for service composition and selection Models and methodologies for dependability evaluation of service-oriented systems Capacity planning in service-oriented systems Performance and dependability issues for pervasive service-oriented computing Self-adaptive SOA and Internet services Performance, dependability, and security assessment of service-oriented systems Methodologies and tools for SOA monitoring, testing, and validation Security in service-oriented systems and Web services Real-time issues in service-oriented computing Engineering of service-oriented systems
Section 1: Foundations In Chapter 1, Wu and Buyya present a comprehensive survey of how SLAs are created, managed and used in utility computing environment. The chapter introduces the foundation of SLA and utility computing architecture. The authors discuss existing use cases from Grid and Cloud computing systems to identify the level of SLA realization in state-of-the-art systems and the emerging challenges for future research. In Chapter 2, Tang et al. propose the SLA-aware Enterprise Service Computing (ESC) as a solution to the challenges and issues of ESC. SLA-Aware ESC is defined as a new architectural style and it is specified through the extended enterprise SOA and enterprise cloud service architecture models previously proposed by the authors. The chapter discusses also the challenges of SLA-aware ESC and
xix
suggests that the autonomic, automated, and adaptive service computing as well as real-time SOA and event-driven architectures can help to address the identified challenges. In Chapter 3, Maciel et al. present modeling methods and evaluation techniques for computing dependability metrics of systems. The chapter provides an extensive summary of seminal works; the most prominent model types are presented, as well as the respective methods for computing exact values and bounds. Moreover, the chapter is enriched with case studies related to the dependability evaluation in multiprocessor subsystems and virtualized subsystems that are the basic architectural elements for platforms running service-oriented systems. In Chapter 4, Bertolino et al. discuss issues and opportunities of SOA validation, identify three different stages for validation along the service life-cycle model, and provide an overview of related research approaches and tools. The emphasis is on on-line testing, which is the most peculiar stage in the SOA validation process and turns out to be possible only within an agreed governance framework. In Chapter 5, Xu et al. describe a new landscape for supporting collaborative business processes. The authors presented different solutions and tools for collaborative business process applications as well as propose a new approach for supporting situational collaborative business processes. The chapter also discusses the benefits of using process-oriented mashups and highlights their security and scalability challenges.
Section 2: Performance In Chapter 6, Dubey and Menascé describe the architecture of a QoS broker that manages the performance of composite applications. The proposed solution relies on continuous monitoring of the utility of the applications and threshold-based triggering of new service selection when the utility collapses or when a service provider fails. A proof-of-concept prototype of the QoS broker demonstrates how it maintains the average utility of the composite application above the threshold in spite of service provider failures and performance degradation. In Chapter 7, Liang and Parkin explore an approach to discover services, differentiated by their QoS attributes, and to configure an abstract business process by selecting an optimal configuration of the “best” QoS combinations. The proposed approach takes into account the optimal choice of multi-dimensional QoS attributes. The authors present and compare two solutions based on heuristic algorithms to illustrate how this approach would work practically. In Chapter 8, Merad et al. consider the problem of run-time optimal service selection with respect to their non-functional attributes and costs. This problem, which has been largely investigated in literature, is addressed by the authors in a new way through a game theoretic solution. After having introduced some useful background on game theory and bargaining games, the authors present their problem formulation as a bargaining game. In Chapter 9, Hollunder et al. present a tool chain that facilitates development, deployment, and testing of QoS-aware Web services. The tool chain presents as features the integration of standard components such as widely used IDEs, the usage of standards and specifications, and the support for various application servers and Web services infrastructures. Moreover, the authors exemplify the usage of the tool chain by means of three examples: robustness with respect to erroneous input, application-specific QoS attributes, and accounting. In Chapter 10, Arnaud and Bouchenak present an approach for building self-adaptive Internet services. Their solution is based on an analytic model to predict Internet service performance, availability and
xx
cost, and on a utility-aware capacity planning method to calculate the optimal service configuration that guarantees SLA objectives while minimizing functioning costs. An adaptive control method is proposed to automatically apply the optimal configuration to Internet services. The chapter is enriched with an extensive experimental evaluation of the proposed solution applied to an online bookstore.
Section 3: Dependability In Chapter 11, Martinello et al. present a systematic modeling approach that allows the designers of a web-based service to evaluate its performability. The authors develop a multi-level modeling framework to analyze the user-perceived performability. The main concepts and the feasibility of the proposed framework are illustrated using a web-based travel agency. Various analytical models and sensitivity studies are presented, considering different assumptions with respect to users profiles, architecture, faults, recovery strategies, and traffic characteristics. In Chapter 12, Chen et al. investigate the uncertainty of Web services performance and the instability of their communication medium and show the influence of these two factors on the overall dependability of SOA. Benchmarking and measuring the behavior of a number of existing Web services used in escience and bio-informatics, the authors provide the results of statistical data analysis and characterize the distribution of the delays that contribute to the Web services response time. The chapter also introduces a new metric to measure the performance uncertainty of a Web service and present experimental results of error and fault injection into Web services. In Chapter 13, Sun et al. provide an insightful analysis of the dependability issue of composite services. The authors present a solution based on two-level redundancy: component service redundancy and structural redundancy. Component service redundancy is used to determine the number of backup services and to guarantee consistent dependability of a composite service, while structural redundancy aims at further improving the dependability at the business process level by setting up backup execution paths. In Chapter 14, Laranjeiro et al. present a framework to deploy Web services with temporal failure detection and prediction capabilities. The failure detection is based on timing restrictions defined at execution time and historical data are used for failure prediction. The framework enables providers to easily develop and deploy time-aware Web services, with the failure detection code decoupled from the application logic, and allows consumers to express their timeliness requirements. In Chapter 15, Nogueira et al. present a general discussion about survivability in wireless self-organized networks (WSONs), its concepts and properties. The authors emphasize open issues and survivability requirements for WSONs and their effects on the network characteristics. Further, this work surveys the main solutions that have applied survivability concepts to WSONs, such as architectures of network management, routing protocols and key management systems.
Section 4: Security In Chapter 16, Rodrigues et al. discuss state-of-the-art techniques and tools for the deployment of secure web services, including standards and protocols for the deployment of secure services, and security assessment approaches. The chapter also discusses how relevant security aspects can be correlated into practical engineering approaches. In Chapter 17, Bartolini et al. provide an overview of recently proposed approaches and tools for functional and structural testing of SOA services. Although these two classes of approaches have been
xxi
considered separately, being focused on different perspectives, they are generally non-conflicting and could be used in a complementary way. Therefore, the authors propose their combination, briefly showing the approach and some preliminary results of the experimentation. In addition, the authors also discuss security testing challenges and the proposed solutions to address them. Finally, in the last chapter of the book, Antunes and Vieira present two case studies on the effectiveness of several well-known vulnerability detection tools and discuss their strengths and limitations. Based on the lessons learned, the chapter also proposes a benchmarking technique that can be used to select the tool that best fits a specific scenario. The main goal of the authors is to provide to the Web service developer information on how much she/he can rely on widely used vulnerability detection tools and on how to select the most adequate tool. Despite the main organization above introduced, we also provide alternative keys to drive the readers interested in more specific arguments. The reader interested in modeling of performance and dependability in service-oriented computing can directly go through Chapters 3, 6, 8, 10, 11, and 13. Engineering aspects of performance and dependability are mainly discussed in Chapters 4, 5, 9, 12, 14, 15, 16, 17 and 18. Finally, the reader interested in QoS and SLA specific matters can leaf through Chapters 1, 2, 6, 7 and 9. We hope that this book will serve as a useful text for graduate students and a valuable reference for researchers and practitioners that address performance and dependability issues in service computing. Valeria Cardellini Universitá di Roma “Tor Vergata,” Italy Emiliano Casalicchio Universitá di Roma “Tor Vergata,” Italy Kalinka Regina Lucas Jaquie Castelo Branco University de São Paulo, Brazil Julio Cezar Estrella University de São Paulo, Brazil Francisco José Monaco University de São Paulo, Brazil
xxii
Acknowledgment
First of all, we thank all the authors who submitted their work to the book. We are very grateful to the members of the editorial advisory board and the reviewers who agreed to help us in the book preparation. They helped us to select the book chapters and provided numerous comments that significantly influenced their final form. The book came into light due also to the indirect involvement of many researchers, developers, and industry practitioners. Therefore, we thank the contributing authors, research institutions, and companies whose papers, reports, articles, notes, Web sites have been referred to in this book. We offer our special appreciation to IGI Global and its editorial assistant, Michael Killian, for helping us during the book development. Acknowledgments are also due to the research funding institutions that supported the activities involved in this process, particularly FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) under the grants n. 09/06355-0 and 06/55207-5, and the Italian PRIN project D-ASAP. INCT-SEC is also recognized by its support under grants CNPq 573963/2008-8, FAPESP 08/57870-926. Finally, we would like to thank our institutions (University of Roma “Tor Vergata”, Italy and University of São Paulo, Brazil) for having supported our work, which has been carried out in the framework of a cooperation agreement between our two universities. Valeria Cardellini Universitá di Roma “Tor Vergata,” Italy Emiliano Casalicchio Universitá di Roma “Tor Vergata,” Italy Kalinka Regina Lucas Jaquie Castelo Branco University de São Paulo, Brazil Julio Cezar Estrella University de São Paulo, Brazil Francisco José Monaco University de São Paulo, Brazil
Section 1
Foundations
1
Chapter 1
Service Level Agreement (SLA) in Utility Computing Systems Linlin Wu The University of Melbourne, Australia Rajkumar Buyya The University of Melbourne, Australia
ABSTRACT In recent years, extensive research has been conducted in the area of Service Level Agreement (SLA) for utility computing systems. An SLA is a formal contract used to guarantee that consumers’ service quality expectation can be achieved. In utility computing systems, the level of customer satisfaction is crucial, making SLAs significantly important in these environments. Fundamental issue is the management of SLAs, including SLA autonomy management or trade off among multiple Quality of Service (QoS) parameters. Many SLA languages and frameworks have been developed as solutions; however, there is no overall classification for these extensive works. Therefore, the aim of this chapter is to present a comprehensive survey of how SLAs are created, managed and used in utility computing environment. We discuss existing use cases from Grid and Cloud computing systems to identify the level of SLA realization in state-of-art systems and emerging challenges for future research.
INTRODUCTION Utility computing (Yeo and Buyya 2006) delivers subscription-oriented computing services on demand similar to other utilities such as water, electricity, gas, and telephony. With this new service model, users no longer have to invest DOI: 10.4018/978-1-60960-794-4.ch001
heavily on or maintain their own computing infrastructures, and they are not constrained to any specific computing service provider. Instead, they can outsource jobs to service providers and just pay for what they use. Utility computing has been increasingly adopted in many fields including science, engineering, and business (Youseff et. al. 2008). Grid, Cloud, and Service-oriented computing are some of the paradigms that have
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Service Level Agreement (SLA) in Utility Computing Systems
Figure 1. A typical architectural view of utility computing system
made delivery of computing as a utility. In these computing systems, different Quality of Service (QoS) parameters have to be guaranteed to satisfy user’s request. A Service Level Agreement (SLA) is used as a formal contract between service provider and consumer to ensure service quality (Buco et. al. 2004). Figure 1 shows typical utility computing system architecture: User/Broker, SLA Management, Service Request Examiner, and Resource/Service Provider. User or Broker submits its requests via applications to the utility computing system, which includes bottom three layers. Service Request Examiner is responsible for Admission Control. SLA Management layer manages Resource Allocation. Resource or Service Provider offers resources or services. In the above architecture, SLAs are used to identify parties who engage in the electronic business, computation, and outsourcing processes and to specify the minimum expectations and obligations that exist between parties (Buco et. al. 2004). The most concise SLA includes both general and
2
technical specifications, including business parties, pricing policy, and properties of the resources required to process the service (Yeo et. al. 2006). According to Sun Microsystems Internet Data Center Group’s report (2002), a good SLA sets boundaries and expectations of service provisioning and provides the following benefits: •
•
•
Enhanced customer satisfaction level: A clearly and concisely defined SLA increases the customer satisfaction level, as it helps providers to focus on the customer requirements and ensures that the effort is put on the right direction. Improved Service Quality: Each item in an SLA corresponds to a Key Performance Indicator (KPI) that specifies the customer service within an internal organisation. Improved relationship between two parties: A clear SLA indicates the reward and penalty policies of a service provision. The consumer can monitor services according to Service Level Objectives (SLO) speci-
Service Level Agreement (SLA) in Utility Computing Systems
fied in the SLA. Moreover, the precise contract helps parties to resolve conflicts more easily. A clearly defined lifecycle is essential for effective realisation of an SLA. Ron, S. et. al. (2001) define SLA lifecycle in three high level phases, which are the creation phase, operation phase, and removal phase. Sun Microsystems Internet Data Center Group (2002) defines a practical SLA lifecycle in six steps, which are ‘discover service providers’, ‘define SLA’, ‘establish agreement’, ‘monitor SLA violation’, ‘terminate SLA’, and ‘enforce penalties for violation’. The realization of an SLA can be traced back to 1980s in telecommunication companies. Furthermore, the advent of Grid computing reinforces the necessity of using SLA (Yeo and Buyya 2006). Specifically, in service-oriented commercial Grid computing (Buyya et. al. 2001), resources are advertised and traded as services based on an SLA after users specify various levels of service required for processing their jobs (Rashid et. al. 2004). However, SLAs have to be monitored and assured properly (Sahai et. al. 2003). These works identified some challenges in SLA management, such as SLA violation control, which have been partially addressed by frameworks such as WSAgreement (Andrieux et. al. 2007) and WSLA (Keller et. al. 2003). Still, in dynamic environments such as Clouds several challenges have to be addressed: automatic negotiation and dynamic SLA management according to environmental changes are the most important examples. Recently, Cloud computing has emerged as a new platform for delivering utility computing services. In Clouds, infrastructure, platform and application services are available on-demand and companies are able to access their business services and applications anywhere in the world whenever they need. In this environment, massively scalable systems are made available to end users as
a service (Brandic 2009). In this scenario, where both request arrival rate and resources availability continuously vary, SLAs are used to ensure that service quality is kept at acceptable levels despite such dynamicity. This chapter reveals key design factors and issues that are still significant in utility computing platforms such as Grids and Clouds. It provides insights for extending and reusing components of the existing SLA management frameworks and it aims to be a guide in designing and implementing enhanced SLA-oriented management systems. This work guides the design and implementation of enhanced SLA-oriented management systems. The use cases selected for the chapter have been proposed recently (since 2004), and reflect the latest technological advances. The design concepts and architectures of these works are well-documented in publications to facilitate comprehensive investigation. The rest of the chapter is organised as follows: Utility architecture and SLA foundational concepts are summarized in the second section. In the third section, the key challenges and solutions for SLA management are discussed. SLA use cases are proposed in the fourth section. The ongoing works addressing some of the issues in current systems are presented in the fifth section. Finally, the chapter concludes with the open challenges in SLA management.
UTILITY ARCHITECTURE AND SLA FOUNDATIONS In this section, initially, a typical utility computing architecture is presented. SLA definitions from different areas are summarized in Section “SLA Definitions”. SLA components are described in Section “SLA Components”. In Section “SLA Lifecycle”, two types of SLA lifecycle are presented and compared.
3
Service Level Agreement (SLA) in Utility Computing Systems
Figure 2. SLA-oriented utility computing system architecture
Utility Architecture The layered architecture of a typical utility computing system is shown in Figure 2. From top to bottom it is possible to identify four layers, a User or Broker submits its requests using various applications to the utility computing system, the Service Request Examiner is responsible for admission control, SLA Management balances workloads, and a Resource or Service Provider offers resources or services. Users or Brokers, who act on users’ behalf, submit their service requests and applications, from anywhere in the world, to be processed by utility computing systems. When a service request is submitted, the Service Request Examiner uses Admission Control mechanism to interpret its QoS requirements before determining whether to accept or reject it. Thus, it ensures that there is no overloading of resources whereby many service requests cannot be fulfilled successfully due to limited availability of resources/services.
4
Then, the Service Request Examiner interacts with the SLA Management to decide whether to accept or reject the request. The SLA Management component is responsible for resource allocation and consists of several components: Discovery, Negotiation/Renegotiation, Pricing, Scheduling, Monitoring, SLA Enforcement, Dispatching and Accounting. The Discovery component is responsible for discovering service providers that can satisfy user requirements. In order to define mutually agreed terms between parties, it is common to put in place price negotiation mechanisms or to rely on quality metrics. The Pricing mechanism decides how service requests are charged. Pricing serves as a basis for managing supply and demand of computing resources within the utility computing system, and facilitates in prioritizing resource allocations. Once the negotiation process is completed, the Scheduling mechanism uses algorithms or policies to decide how to map requests to re-
Service Level Agreement (SLA) in Utility Computing Systems
Table 1. Summary of SLA definitions classified by the area Area
Definition
Source
Web Services
“SLA is an agreement used to guarantee web service delivery. It defines the understanding and expectations from service provider and service consumer”.
HP Lab (Jin et. al. 2002)
Networking
“An SLA is a contract between a network service provider and a customer that specifies, usually in measurable terms, what services the network service provider will supply and what penalties will assess if the service provider can not meet the established goals”.
Research Project
Internet
“SLA constructed the legal foundation for the service delivery. All parties involved are users of SLA. Service consumer uses SLA as a legally binding description of what provider promised to provide. The service provider uses it to have a definite, binding record of what is to be delivered”.
Internet NG (Ron et. al.2001)
Data Center Management
“SLA is a formal agreement to promise what is possible to provide and provide what is promised”.
Sun Microsystems Internet Data Center group (2002)
source providers. Then the Dispatching mechanism starts the execution of accepted service requests on allocated resources. The Monitoring component consists of a Resource Monitoring mechanism and a Service Request Monitoring mechanism. The Resource Monitoring mechanism keeps track of the availability of Resource Providers and their resource entitlements. On the other hand, the Service Request Monitoring mechanism keeps track of the execution progress of service requests. The SLA enforcement mechanism manages violation of contract terms during the execution. Due to the SLA violation, sometimes Renegotiation is needed in order to keep ongoing trading. The Accounting mechanism maintains the actual usage of resources by requests so that the final cost can be computed and charged to the users. At the bottom of the architecture, there exists a Resource/Service Provider that comprises multiple services such as computing services, storage services and software services in order to meet service demands.
textual and vary from area to area. Some of the main SLA definitions in Information Technology related areas are summarised in Table 1.
SLA Definitions
•
Dinesh et. al. (2004) define an SLA as: “An explicit statement of expectations and obligations that exist in a business relationship between two organizations: the service provider and customer”. Since SLA has been used since 1980s in a variety of areas, most of the available definitions are con-
SLA Components An SLA defines the delivery ability of a provider, the performance target of consumers’ requirement, the scope of guaranteed availability, and the measurement and reporting mechanisms (Rick, 2002). Jin et. al. (2002) provided a comprehensive description of the SLA components, including: (Figure 3): • •
• •
•
Purpose: Objectives to achieve by using an SLA. Restrictions: Necessary steps or actions that need to be taken to ensure that the requested level of services are provided. Validity period: SLA working time period. Scope: Services that will be delivered to the consumers, and services that will not be covered in the SLA. Parties: Any involved organizations or individuals involved and their roles (e.g. provider and consumer). Service-level objectives (SLO): Levels of services which both parties agree on. Some service level indicators such as availability, performance, and reliability are used.
5
Service Level Agreement (SLA) in Utility Computing Systems
Figure 3. SLA components
•
• •
Penalties: If delivered service does not achieve SLOs or is below the performance measurement, some penalties will occur. Optional services: Services that are not mandatory but might be required. Administration: Processes that are used to guarantee the achievement of SLOs and the related organizational responsibilities for controlling these processes.
SLA Lifecycle Ron et. al. (2001) define the SLA life cycle in three phases (Figure 4). Firstly, the creation phase, in which the customers find service provider who matches their service requirements. Secondly, the operation phase, in which a customer has read-only access to the SLA. Thirdly, the removal phase, in which SLA is terminated and all associated configuration information is removed from the service systems. A more detailed life cycle has been characterized by the Sun Microsystems Internet Data
6
Center Group (2002), which includes six steps for the SLA life cycle: the first step is ‘discover - service providers’, in where service providers are located according to consumer’s requirements. The second step is ‘define – SLA’, which includes definition of services, parties, penalty policies and QoS parameters. In this step it is possible to negotiate between parties to reach a mutual agreement. The third step is ‘establish – agreement’, in which an SLA template is established and filled in by specific agreement, and parties are starting to commit to the agreement. The fourth step is ‘monitor – SLA violation’, in which the provider’s delivery performance is measured against to the contract. The fifth step is ‘terminate – SLA’, in which SLA terminates due to timeout or any party’s violation. The sixth step is ‘enforce - penalties for SLA violation’, if there is any party violating contract terms, the corresponding penalty clauses are invoked and executed. These steps are illustrated in Figure 5. The mapping between three high level phases and six steps of SLA lifecycle is shown in Table
Service Level Agreement (SLA) in Utility Computing Systems
Figure 4. SLA high level lifecycle phases, according to the description of Ron et. al. (2001)
Figure 5. SLA life cycle six steps, as defined by Sun Microsystems Internet Data Center Group (2002)
2. The ‘creation’ phase of three phase lifecycle maps to the first three steps of the other lifecycle. In addition, the ‘operation’ phase of three phase lifecycle is the same as the fourth step of the other lifecycle. The rest of the phases and steps map to each other. The six steps SLA lifecycle is more reasonable and provides detailed fine grain information, because it includes important processes, such as
re/negotiation and violation control. During the service negotiation or renegotiation, a consumer exchanges a number of contract messages with a provider in order to reach a mutual agreement. The result of these processes leads to a new SLA (Youseff et. al. 2008). In six steps lifecycle, steps 2 and 3 map to these processes. However, the three phase’s lifecycle does not include them. Furthermore, the ‘Enforce Penalties for SLA vio-
7
Service Level Agreement (SLA) in Utility Computing Systems
Table 2. Mapping between two types of SLA lifecycle Three Phases 1. Creation Phase
Six Steps 1. Discover Service Provider 2. Define SLA 3. Establish Agreement
2. Operation Phase
4. Monitor SLA Violation
3. Removal Phase
5. Terminate SLA 6. Enforce Penalties for SLA Violation
lation’ phase is important because it motivates parties adhere to follow the contract. We believe that the six steps formalization of the SLA life cycle provides a better characterization of the phenomenon and from here onwards we will refer to this as SLA life cycle.
SLA IN UTILITY COMPUTING SYSTEMS As highlighted by Patterson (Patterson, 2008), there are many challenges involved in developing software for a million users to use as a service via a data center as compared to distributing software for a million users to run on their individual personal computers. Using SLAs to define service parameters that are required by users, the service provider knows how users value their service requests, hence it provides feedback mechanisms to encourage and discourage service request submissions. In particular, utility models are essential to balance the supply and the demand of computing resources by selectively accepting and fulfilling limited service requests out of many competing service requests submitted. However, in the case of service providers making available a commercial offer to enable crucial business operations of companies, there are other critical QoS parameters to be considered in a service request, such as reliability and trust/
8
security. In particular, QoS requirements cannot be static and need to be dynamically updated over time due to continuing changes in business operations and operating environments. In short, there should be greater importance on customers since they pay for accessing services. Therefore, the emphasis of this section is to describe SLA management in utility computing systems.
SLA Management in Utility Computing Systems SLA management includes several challenges and in this section we will discuss them as part of the steps of the SLA life cycle.
Discover - Service Provider In current utility computing environments, especially Grid and Cloud, it is important to locate resources that can satisfy consumers’ requirement efficiently and optimally (Gong et. al. 2003). Such computing environments contain a large collection of different types of resources, which are distributed worldwide. These resources are owned and operated by various providers with heterogeneous administrative policies. Resources or services can join and leave a computing environment at anytime. Therefore, their status changes dynamically and unpredictably. Solutions for service provider discovery problems must efficiently deal with scalability, dynamic changes, heterogeneity and autonomous administration.
Define - SLA Once service providers have been discovered, it is necessary to identify the various elements of an SLA that will be signed by agreeing metrics. These elements are called service terms and include QoS parameters, the delivery ability of the provider, the performance target of diversity components of user’s workloads, the bounds of guaranted availability and performance, the
Service Level Agreement (SLA) in Utility Computing Systems
measurement and reporting mechanisms, the cost of the service, the data set for renegotiation, and the penalty terms for SLA violation. In this stage of the SLA lifecycle, measurement metrics and definition of each of these elements is done by a negotiation process between both parties (Blythe et. al. 2004) (Chu et. al. 2002). Other challanges are related tothe negotiation process. Firstly, parties may use different negotiation protocols or they may not have the common definition of the same service (Brandic et. al. 2008). Secondly, service descriptions, in an SLA, must be defined unambiguously and be contextually specified by the means of its domain and actor. Therefore, an SLA language must allow the parameterisation of service description (Loyall et. al. 1998). Moreover it should allow a high degree of flexibility and enable a precise formalisation of what a service guarantee means. Another aspect is how to keep SLA definition consistent throughout the entire SLA lifecycle.
Establish - Agreement In this step an SLA template is constructed. A template has to include all aspects of SLA components. In utility computing environments, to facilitate dynamic, versatile, and adaptive IT infrastructures, utility computing systems have to promply react to environmental changes, software failures, and other events which may influence the system’s behavior. Therefore, how to manage SLA-oriented adaptive systems, which exploit self-renegotiation after system failure, becomes an open issue (Brandic et. al. 2009). Although most of the works recognise SLA negotiation as a key aspect of SLA managemet, recent works only provide little insight on how negotiation (especially automated negotiation) can be realised. In addition, it is difficult to reflect the quality aspects of SLA components in a template.
Monitor - SLA Violation SLA violation monitoring begins once an agreement has been established. It plays a critical role in determining whether SLOs are achieved or violated. There are three main concerns. Firstly, which party should be in charge of this process. Secondly, how fairness can be assured between parties. Thirdly, how the boundaries of SLA violation are defined. SLA violation means ‘un-fulfillment’ of service agreement. According to the Principles of European Contract Law, the term ‘un-fulfillment’ is defined as defective performance (parameter monitored at lower level than agreed), late performance (service delivered at the appropriate level but with unjustified delays), and no performance (service not provided at all). There are three broad provisioning categories based on the above definition (Rana et. al. 2008). ‘All-orNothing’ provisioning, characterizes the case in which all SLOs must be satisfied or delivered by the provider. ‘Partial’ provisioning identifies some SLOs as mandatory ones, and must be met for the successful service delivery by both parties. ‘Weighted Partial’ provisioning, is the case in which the “provision of a service meets SLO if it has a weight greater than a threshold (defined by the client)” (Rana et. al. 2008). ‘Allor-Nothing’ provisioning is used in most cases of SLA violation monitoring, because violation leads to complete failure and negotiation to create a new SLA. An SLA contains mandatory SLOs that must be delivered by the provider. Hence, in ‘Partial’ provisioning, all parties assign these SLOs the highest priority to reduce violation risk. How much the SLO affects the ‘Business Value’ a measure of the importance of a particular SLO term. The more important the violated SLO, the more difficult it is to renegotiate the SLA, because any party does not want to lose their competitive advantages in the market.
9
Service Level Agreement (SLA) in Utility Computing Systems
Terminate - SLA In terminating an SLA, a key aspect is to decide when it should be terminated, and once decided, all associated configuration information is removed from the service systems. If the termination is due to an SLA violation, two questions need to be answered, who is the party that triggered this activity and what are the consequences of it.
Enforce Penalties for SLA Violation In order to enforce penalties for SLA violation, penalty clauses are need to be defined. In utility computing systems, where consumers and providers are globally distributed, the penalty clauses work differently in various countries. This leads to two problems, which particular clause should be used and whether it is fair for both sides. Moreover, due to the different types of violations, the penalty clauses need to be comprehensive. Recently, some works used the linear model for penalty enforcement of SLA violations in simple contexts (Lee et. al., 2010) (Yeo et. al., 2008). The linear model exhibits a poor performance, thus, the selection of these best models for SLA violation penalty clauses enforcement is still an open problem.
Solutions for SLA Management in Utility Computing Systems This section introduces solutions for the problems presented in the previous section. Six SLA management languages and frameworks are analyzed, because they can be used as solutions in multiple steps of SLA lifecycle.
SLA Management Frameworks and Languages SLA can be represented by specialized languages for easing SLA preparation, automating SLA negotiation, adapting services automatically ac-
10
cording to SLA terms, and reasoning about their composition. In this section we introduce six languages for SLA specification and management. Among them, the WS-Agreement and Web Service Level Agreement (WSLA) are the most popular and widely used in research and industry. The comparison among all of these languages is shown in Table 3. Bilateral Protocol: (Srikumar et. al. 2008) presented a negotiation mechanism for advanced resource reservation. It is a protocol for negotiating SLAs based on Rubinsteins Alternating Offers protocol for bargaining between parties. Any party is allowed to modify the proposal in order to reach a mutually-agreed contract. The authors implemented this protocol by using the Gridbus Broker on the customer’s side and Aneka on the provider’s side. Web services enable platform independence, and are therefore used to communicate between consumers and providers because the Gridbus Broker is implemented in Java, and Aneka is a.Net based enterprise Grid. The advantage of these high level languages is that they are object oriented and web services enable semantic definition. Thus, this protocol supports SLA component reuse, and type and semantic definition. WS-Agreement: Open Grid Forum (OGF) has defined a standard for the creation and the specification of SLAs called Web Services Agreement Specification (WS-Agreement) (Andrieux et. al. 2007). It is a language and a protocol for establishing, negotiating, and managing agreements on the usage of services at runtime between providers and consumers. It uses an Extensible Markup Language (XML) based language for specifying the nature of an agreement template, which facilitates discovery of compatible providers. Its interaction is based on request and response. Moreover, it helps parties in exposing their status, so SLA violation can be dynamically managed and verified. Originally the language did not support negotiation and currently it has been complemented. WS-Agreement Negotia-
Provide language; Framework; runtime architecture
language
WSLA
QML
CORBA specific framework
XML Language
QUO
SLAng
XML
XML language; Framework; A protocol
WS-Agreement
WSOL
Type
Java,.Net and Web S er vice based protocol
Name
Bilateral Protocol
Domain
Originally for Internet DS environment
Any domain
Originally for Web Services
Any Domain
Originally for Web services
Any domain
Originally for resource reservation in Grids.
NA
Yes
Yes
Yes
Establish and manage dynamically
Establish and manage dynamically
Yes
Dynamic Establish / Management
Yes
Yes
Originally do not support, but support now.
Yes
Allows creation of new metrics
Re/negotiation.
No But based on behavior of SLA parties
NA
NA
Allows creation of new metrics
Do not define specification of metrics associated with agreement parameters.
Yes
Metrics
Re/negotiation with WS-Agreement Negotiation
Yes
Negotiation
Table 3. Comparison of SLA management frameworks and languages
NA
Yes
Yes
Yes
Yes
Yes
Yes
Define Management Actions
Yes
Yes
Yes
Yes
Yes
Yes
Yes.
Support Reuse
Yes
Yes
Yes
Yes, allows definition of new type systems
NA
Yes
Yes
Provide Type Systems
Yes
No
No
Yes
Step 1 to Step 4
Step 1 to step 4
Step 1 to step 4
Step 1 to step 4
Step 1 to step 6
Step 1 to step 6
Not formally defined
Not formally defined
Step 1 to Step 4.
Cope with SLA lifecycle Support by Web Service.
Define Semantic
Service Level Agreement (SLA) in Utility Computing Systems
11
Service Level Agreement (SLA) in Utility Computing Systems
tion, which lies on the top of WS-Agreement and describes the re/negotiation of the SLA. Its main feature is the robust signaling protocol for the negotiation. Web Service Level Agreement (WSLA): WSLA (Keller et. al. 2003) is a framework developed by IBM to specify and monitor SLA for Web Services. It provides a formal XML schema based language to express SLAs, and architecture to interpret this language at runtime. It can measure, and monitor QoS parameters and report violations to the parties. It separates monitoring clauses from contractual terms for outsourcing purposes. It provides the capability to create new metrics over existing metrics to implement multiple QoS parameters (Keller et. al. 2003). However, the semantic of metrics is not formally defined, hence, there are limitations for the creation of new terms based on existing terms. WSOL: Web Service Offerings Language (WSOL) defines a syntax for service offers’ interaction (Sakellariou et. al. 2005). It provides template instantiation and reuse of definitions (Buyya et. al. 2009). WSOL and WSLA support definition of management information and actions, such as violation notifications. However, they are not defined by a formal semantic. WSOL and QML (Quality of Service Management Language) support type systems allowing the same SLA to be described either in abstract or specific values to create a new SLA. The generalisation relationships between SLAs facilitate definitions of SLA types. SLAng: Skeneet et. al. (2004) propose Service Level Agreement Language (SLAng), which uses XML to define SLAs. It is motivated by the fact that federated distributed systems must manage the quality of all aspects of their deployment. SLAng is different from other languages and frameworks. Firstly, it defines an SLA vocabulary for internet services. Secondly, its structure is based on the specific industry requirement, aiming to provide usable terms. Thirdly, it is modeled using Unified
12
Markup Language (UML) and defined according to the behavior of services and consumers involved in service usage, unlike other languages, such as WSLA and WSOL, where QoS definition is based on metrics. Moreover, it supports third party monitoring schemes. However, it lacks of the ability to define management information, such as associated financial terms. Thus, it is not suitable for commercial computing environments. QML: QML (Frolund et. al. 1998) defines a type system for SLAs, allowing users to define their own dimension types. However, it does not support extension of individual defined metrics because the exchange of SLAs between parties requires a common understanding of metrics. QML defines semantic for both its type system and its notion of SLA conformance. QuO: Quality Objects (QuO) is a CORBA specific framework for QoS adaption based on proxies (Loyall et. al. 1998). It includes a quality description language used for describing QoS parameters, adaptations and notifications. QuO properties are the response of invoking instrumentation methods on remote objects. Like WSLA, no formal constraints are placed on the implementation of these methods.
Discover - Service Provider In the Grid computing community, Fitzgerald (1997) introduced the Monitoring and Discovery System, Gong et. al. (2003) proposed the VEGA Grid Project and also relevant is the work of Iamnitchi et. al. (2001). Monitoring and Discovery System (MDS) is the information service described in the Globus project (Fitzgerald 1997). In its architecture, Lightweight Directory Access Protocol (LDAP) is used as directory service, and information stored in information servers are organised in tree topology. In utility computing systems, resources’ availability and capability are dynamic in nature. However, in MDS, the relationship
Service Level Agreement (SLA) in Utility Computing Systems
between information and information servers is static. In addition, service provider’s information is frequently updated in these dynamic changing environments, whilst LDAP is not designed for writing and updating information. VEGA Infrastructure for Resource Discovery (VIRD) follows three-level hierarchy architecture. The top level is a backbone, which is responsible for the inter-domain resource discovery and consists of Border Grid Resource Name Servers (BGRNS). The second level consists of several domains and each domain consists of Grid Resource Name Servers (GRNS). The third level includes all clients and resource providers. There is no central control in this architecture, thus resource providers register themselves to GRNS server within a domain. When clients submit requests, GRNS respond to them with requested resources. The limitation of this architecture is that it only focuses on the issue of scalability and dynamic environmental changes but not on heterogeneity and autonomous administration. Iamnitchi et. al. (2001) propose a resource discovery framework using peer-to-peer (P2P) technologies in Grids. P2P architecture is fully distributed and all the nodes are equivalent. However, one major limitation of their work is that every node has little knowledge about resources distribution and their status. Specifically, when there is a large number of resource types or the work-set is very large, the opportunity for inaccurate results increases, because the framework is not able to use historical data to accurately discover resources.
Define - SLA and Establish - Agreement ‘Define – SLA’ and ‘Establish – Agreement’ are two dependent steps, and SLA languages facilitate their development. For example, WSLA and WS-Agreement are the most widely used languages in these steps. Creation and Monitoring of Agreements (CREMONA) is a WS-Agreement
framework implemented by IBM (Dan et. al. 2004). It proposes a Commitment Agreement and architecture for the WS-Agreement. All of these agreements are normal WS-Agreements, following a certain naming convention. This protocol basically aims at solving problems related to the creation of agreements on multiple sites. However, it is unable to solve limitations when service providers and consumers have different standards, policies, and languages during negotiations. For example, if a consumer uses WSLA but a provider uses WS-Agreement, the interaction is actually not possible. In order to solve this, Brandio et. al. (2008) proposed a Meta-Negotiation Architecture for SLA-Aware Grid Services based on meta-negotiation documents. These documents record supported protocols, document languages, and the prerequisites for starting negotiations and establishing agreements for all participants. SLA-oriented Resource Management Systems (RMS) have been developed for addressing negotiation problems in Grids, for example, Wurman et. al. (1998) state a set of auction parameters and a price-based negotiation platform, which serves as an auction server for humans and software agents. Nevertheless, their solution only supports one-dimensional auction (only focus on price), but not multiple-dimensional auctions, which are important in utility computing environments.
Monitor - SLA Violation Monitoring infrastructures are used to measure the difference between the pre-agreed and actual service provision between parties (Rana et. al. 2008). There are three types of monitoring infrastructures, which are trusted third party (TTP), trusted module on the provider side, and trusted module on the client side. Nowadays, TTP provides most of the functionalities for monitoring in most typical situations to detect SLA violation.
13
Service Level Agreement (SLA) in Utility Computing Systems
Terminate - SLA There are two scenarios in which an SLA may be terminated. The first is termination due to normal time out. The second one is termination because any party violated its contract terms. Normally, in Clouds, this step is conducted by customers and termination typically is caused by normal time out or the provider’s SLA violation. Sometimes, providers also terminate SLAs depending on the task priorities. If the reason for SLA termination is violation, then the ‘Enforce Penalties for SLA Violation’ step of the SLA lifecycle has to be applied. Usually this step is performed manually..
Enforce Penalties for SLA Violation A penalty clause can be applied to the party who violates SLA terms. First is a direct financial compensation being negotiated and agreed between parties. Second is a decrease in price along with the extra compensation for any subsequent interaction. In other words, this option is according to the value of loss caused by the violation. In this case, TTP is usually used as a mediator. The workflow for this option is that clients transfer their deposit, bond, and any other fees into the Third Party’s account, and then if the SLOs have been met, the money is paid to provider via TTP. Otherwise, the TTP returns the amount of fees back to the consumer as compensation for SLA violations. The SLA violation has two indirect side impacts on providers. The first is that consumers will use less service from the provider in the future. The second is that provider’ reputation decreases and it affects other clients’ willingness to choose this provider subsequently. The major indirect influence on consumer is that future request will be rejected due to bad credit record. A major issue, in the above discussion, is the variety of laws enforced in different countries. This problem can be solved by a ‘choice of law clause’, which indicates explicitly which country’s laws are applied when a conflict occurs between
14
parties. ‘Legal templates’ (Dinesh, 2004) can be used to refine these clauses (Rana et. al. 2008).
SLA USE CASES IN UTILITY COMPUTING SYSTEMS Utility computing provides access to on-demand delivery of IT capabilities to the consumer according to cost-effective pricing schema. Typically, a resource in a Data Center is idle during 85% of time (Yeo et. al. 2008). Utility computing provides a way for enterprises to lease this 85% of idle resource or to use outsourcing to pay for resources according to their usage. Two approaches of utility computing that achieve above goals are Grid and Cloud. In the remaining part of this section, we present use cases in Grid and Cloud computing environments.
SLA in Grid Computing Systems In this section we introduce the definition of Grid computing, and some recent significant Grid computing projects that have focused on SLAs and enabled them in their frameworks. According to Buyya et. al. (2009) “A Grid is a type of parallel and distributed system that enables the sharing, selection, and aggregation of geographically distributed ‘autonomous’ resources dynamically at runtime depending on their availability, capability, performance, cost, and users’ quality-of-service requirements.” Grid computing is a paradigm of utility computing, typically used for access to scientific resources, even though it has been also used in the industry as well. SLA has been adopted in Grid computing, and many Grid projects are SLA oriented. We classify them into three categories, which are SLA for business collaboration, SLA for risk assessment, and SLA renegotiation supporting dynamic changes. SLA for Business Collaboration: GRIA (The GRIA Project) is a service-oriented infrastructure designed to support B2B collaborations across
Service Level Agreement (SLA) in Utility Computing Systems
organizational boundaries by providing services. The framework includes a service manager with the ability to identify the available resources (e.g. CPUs and applications), assign portions of the resources to consumers by SLAs, and charge for resource usage. Furthermore, a monitoring service is responsible for monitoring the activity of services with respect to agreed SLOs. The BREIN consortium (The BREIN Project, 2006-2009) defines a business framework prototype for electronic business collaborations. Some capabilities of this framework prototype include Service Discovery with respect to SLA capabilities, SLA negotiation in a single-round phase, system monitoring and evaluation, and SLA evaluation with respect to the agreed SLA. The WSLA/WS-Agreement specifications are suggested for SLAs management. The project focuses on dynamic SLAs. This initiative shows that the industry is demonstrating their interest in SLA management. In the work of Joita et. al. (2005), WS-Agreement specification is used as a basis to conduct negotiation between two parties. An agent-based infrastructure takes care of the agreement offer made by the requesting party. In this scenario, many one-to-one negotiations are considered in order to find the service that matches the offer best. Risk Assessment: The AssessGrid (Battre et. al. 2007) project focuses on risk management and assessment in Grid. It aims at providing service providers with risk assessment tools, which help them to make decisions on the suitable SLA offer by assigning, mapping, and associating the risk of failure to penalty fees. Similarly, end-users get knowledge about the risk of an SLA violation by a resource provider that helps them to make appropriate decisions regarding acceptable costs and penalty fees. A broker is the matchmaker between end-users and providers. WS-AgreementNegotiation protocol is responsible for negotiating SLAs with external contractors. SLA renegotiation supporting dynamic changes: Frankova et. al. (2006) propose an
extension of WS-Agreement allowing a runtime SLA renegotiation. Some modifications are proposed in the ’GuaranteeTerm’ section of the agreement schema and a new section is added to define possible negotiations, to be agreed by parties before the offer is submitted. The limitation is that it does not support run-time renegotiation to adapt dynamic operational and environmental changes, because after the agreement’s acceptance, there is no interaction between the provider and the consumer. Sakellariou et. al. (2005) specify the guarantee terms of an agreement as variable values rather than fixed values. This work aims at minimizing the number of re-negotiations to reach consensus with agreement terms. BabelNet, is a Protocol Description Language for automated SLA negotiation, has been proposed (Hudert et. al. 2009) to handle multiple-dimensional auctions.
SLA in Cloud Computing Cloud computing is a paradigm of service oriented utility computing. In this section we introduce a definition of cloud computing and SLA use cases in industry and academia. Finally, we compare SLA usage difference between Cloud computing and traditional web services.
Cloud Computing Based on the observation of the essence of what Clouds are promising to be, Buyya et. al. (2009) propose the following definition: “A Cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resource(s) based on service-level agreements established through negotiation between the service provider and consumer”. Hence, Clouds fit well into the definition of utility computing. Figure 6 shows the layered design of Cloud computing architecture. Physical Cloud resources along with core middleware capabilities form
15
Service Level Agreement (SLA) in Utility Computing Systems
Figure 6. Layered Cloud computing architecture. (Buyya et. al 2009)
the bottom layer needed for delivering IaaS. The user-level middleware aims at providing PaaS capabilities. The top layer focuses on application services (SaaS) by making use of services provided by the lower layer services. PaaS/SaaS services are often provided by 3rd party service providers, who are different from IaaS providers. (Buyya et. al. 2009) User-Level Applications: this layer includes the software applications, such as social computing applications and enterprise applications, which will be deployed by PaaS providers renting resources from IaaS providers. User-Level Middleware: Cloud programming environments and tools are included in this layer facilitate creation of applications and their mapping to resources using Core Middleware Layer services. Core Middleware: this layer provides runtime environment enabling capabilities to application services built using User-Level Middleware. Dynamic SLA management, Accounting, Monitoring and Billing are examples of core services in this layer. The commercial examples for this layer are Google App Engine and Aneka.
16
System Level: physical resources including physical machines and virtual machines sit in this layer. These resources are transparently managed by higher level virtualization services and toolkits that allow sharing of their capacity among virtual instances of servers.
Use Cases In this section, we present industry and academic use cases in Cloud computing environments. Industry Use Cases: In this section, we present how Cloud providers implement SLA. Important parameters are summarised in Table 4. All elements in Table 4, obtained from formal published SLA documents of AmazonEC2 and S3 (IaaS provider), and Windows Azure1 Compute and Storage (IaaS/PaaS provider). A characterization of systems studied following the six steps of SLA lifecycle model is summarized in Table 5. From the users’ perspective, the process of activating SLA lifecycle with Amazon and Microsoft is simple because the SLA has been pre-defined by the provider. According to SLA lifecycle, the first step is to find the service
Service Level Agreement (SLA) in Utility Computing Systems
Table 4. SLA Use Cases of the most famous Cloud Provider and related characteristics in SLAs Cloud Provider Name
Service Commitment
Effective Date
Monthly Uptime Percentage (MUP)%
Service Credits Percentage (%)
Amazon AWS EC2
“AWS use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage of at least 99.95% during the Service Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit ”(AWS EC2 Service Level Agreement).
October 23, 2008
MUP1<99.95%
10%
Amazon AWS S3
“AWS use commercially reasonable efforts to make Amazon S3 available with a Monthly Uptime Percentage (defined below) of at least 99.9% during any monthly billing cycle (the “Service Commitment”). In the event Amazon S3 does not meet the Service Commitment, you will be eligible to receive a Service Credit “(AWS S3 Service Level Agreement).
October 1, 2007
99%=<MUP<99.9%
10%
MUP<99
25%
Windows Azure Compute
“Windows Azure has separate SLA’s for compute and storage. For compute, we guarantee that when you deploy two or more role instances in different fault and upgrade domains your Internet facing roles will have external connectivity at least 99.95% of the time. Additionally, we will monitor all of your individual role instances and guarantee that 99.9% of the time we will detect within two minutes when a role instance’s process is not running and initiate corrective action.” (Windows Azure Service Level Agreement)
NA
<99.95%
10%
<99%
25%
NA
<99.9%
10%
<99.5%
25%
Windows Azure Storage
Table 5. From users’ perspective SLA use cases of cloud provider follows six steps SLA lifecycle
Cloud Service Provider
Service Type
Amazon EC2
IaaS (Computing)
Amazon S3
IaaS (Storage)
Microsoft Azure Compute Microsoft Azure Storage
Step 1: DiscoverService Provider
Step 2: Define-SLA
Step 3: EstablishAgreement
Step 4: Monitor-SLA Violation
Step 5: TerminateSLA
Step 6: Enforce Penalties for SLA Violation
Discover manually (e.g. via web site)
Pre-defined SLA terms and QoS parameters
Pre-defined SLA document by provider
Can use third party monitor systems (e.g. CloudWatch)
By user, or provider programmatically or manually
Service Credit given by provider
Discover manually
Pre-defined SLA terms and QoS parameters
Pre-defined SLA document by provider
Can use third party monitor systems (e.g. CloudStatus)
By user, or provider programmatically or manually
Service Credit given by provider
PaaS
Discover manually (e.g. via web site)
Pre-defined SLA terms and QoS parameters
Pre-defined SLA document by provider
Can use third party monitor systems (e.g. Monitis)
By user, or provider programmatically or manually
Service Credit given by provider
PaaS
Discover manually
Pre-defined SLA terms and QoS parameters
Pre-defined SLA document by provider
Can use third party monitor systems (e.g. Monitis)
By user, or provider programmatically or manually
Service Credit given by provider
17
Service Level Agreement (SLA) in Utility Computing Systems
providers according to users’ requirements. For example, users find the provider via searching on the Internet, and then explore the providers’ web site for collecting further information. Most Cloud service providers offer pre-defined SLA documents. In this case, the second step and third step are pre-defined and always be entwined together. The check for SLA violation monitoring can be done by third party tools, such as Cloudwatch, Cloudstatus, Monitis, and Nimsoft. Developers are able to develop their own monitoring systems by using these tools. For what concerns the termination of an SLA we can consider IaaS services as a reference example. In this case three scenarios may occur. The normal termination of an SLA is constituted by the release of Cloud release of Cloud resources by the user. An SLA can also be actively terminated by a provider if the resource usage lasts beyond the predefined expiry time. A termination with penalty may occur in case the provider is unable to provide resources according to the expected Quality of Service. The last step of SLA lifecycle will be invoked if any party violates contract terms. Currently most of the service providers give service credits to customer if they violate SLA. Academic Use Cases: In this section, we present SLA-Oriented projects and algorithms as academy use cases. SLA-Oriented Resource Allocation for Data Centers and Cloud Computing Systems: The Cloud Computing and Distributed Systems (CLOUDS) Laboratory, at the University of Melbourne has proposed the use of market-based resource management to support utility-based resource management for cluster computing (Yeo C. S. et. al. 2005) (Yeo C. S. et. al. 2007). The initial work successfully demonstrated that market-based resource allocation strategies are able to deliver better utility for users than traditional system-centric strategies. However, early research focused on satisfying only two static Quality of Service (QoS) parameters: the deadline
18
for completing a service request and the budget that the consumer is willing to pay for completing the request before the deadline. In the commercial computing environment, there are other critical QoS parameters to consider in a service request, such as reliability and trust/security. In particular, QoS requirements cannot be static and need to be dynamically updated over time due to continuing changes in business operations and operating environments. SLA@SOI: A European Union funded Framework 7 research project, SLA@SOI (SLA@SOI project), is researching aspects of multi-level, multi-provider SLAs within service-oriented infrastructure and cloud computing. Currently, this project aims to build an ad-hoc architecture and integration approach for a basic SLA management framework. It provides a major milestone for the further evolution towards a service-oriented economy, where IT-based services can be flexibly traded as economic goods, i.e. under well defined and dependable conditions and with clearly associated costs. SLA@SOI provides two major benefits to the provisioning of services. First, service predictability and dependability means that the quality characteristics of service can be predicted and enforced at run-time. Second, automation means that the whole process of negotiating SLAs and provisioning, delivery and monitoring of services can be automated allowing highly dynamic and scalable service consumption. SLAbased Management and Scheduling:Lee et. al. (2010) propose profit-driven SLA based scheduling algorithms in Clouds to maximize the profit for service providers. The application model used in this work can be classified as SaaS and PaaS. The service types supported by their algorithm are dependent services, which mean one sub-service can not start until the prerequisite services are completed. However, their work does not support multiple providers and full simulation configuration is not available. We recommend possible future research direction is SLA management with multiple providers, since it is required
Service Level Agreement (SLA) in Utility Computing Systems
for emerging research in InterCloud.We define InterCloud as multiple Cloud providers with peer agreement to support collaborative activities.
SLA Related Difference Between Cloud and Web Service In this section we compare the differences between SLAs applied in cloud computing and in traditional web services as follows: QoS Parameters: Most web services focus on parameters such as response time, SLA violation rate for the task, reliability, availability, levels of user differentiation, and cost of service. In Cloud computing more QoS parameters than traditional web services need to be considered, for example, energy related QoS, Security related QoS, Privacy related QoS, trust related QoS. More than 20 QoS parameters are defined by the SMI (Service Management Index) consortium to be used in the industry and academia as standard benchmark. Automation: The whole process of SLA negotiation and provisioning, service delivery and monitoring needs to be automated for highly dynamic and scalable service consumption. Researchers in traditional web services explored this topic, for example, Jin L.J et. al. (2002) proposed a model for SLA analysis of Web Services. Nevertheless, SLA automation is a rapidly growing area in Cloud computing. In fact there are some research projects starting to focus on it, such as CLOUDS Lab at the University of Melbourne and SLA@SOI. Resource Allocation: SLA oriented resource allocation in Cloud computing is possible different from allocation in traditional web services, because web services have a Universal Description Discovery and Integration (UDDI) for advertising and discovering between web services. However, in Clouds, resources are allocated and distributed globally without central directory, so the strategy and architecture for SLA based resource allocation in such environment are different from traditional web services.
ONGOING WORKS SLA management must provide ways for reliable provisioning of services, monitoring of SLA violations and detection of any potential performance decrease during service execution (Kuo et. al. 2006) (Marilly et. al. 2002). The goal of SLA management is to establish a scalable and automatic SLA management framework that can adapt to dynamic environmental changes by considering multiple QoS parameters. In addition, an SLA has to be suitable for multiple domains with heterogeneous resources. Some of the research are works towards to this direction. The VIRD architecture is a three-level hierarchy focused on scalability. Wurman et. al. (1998) state a set of auction parameters and price-based negotiation platform. Nevertheless, this solution only supports one-dimensional auction, thus could not handle multiple-dimensional auctions, which are important in utility computing environments. Recently, BabelNet handles multiple-dimensional auctions. Nevertheless, somehow consumers still need to be involved in the management process to certain extent. Moreover, multiple QoS parameters have been investigated by CLOUDS Lab’s initial work. Whilst that work only focused on the most common QoS parameters (price and deadline), there are other critical QoS parameters that should be considered in a service request, such as reliability and trust/security. In particular, QoS parameters are must be updated dynamically over time due to continuing changes in business operating environments. Thus, multiple QoS parameters should be investigated in the future research work. More specifically, there are some open challenges for SLA management. First and foremost, different SLA negotiation protocols and processes constrain the negotiation for establishing SLAs, the modification of an implemented SLA, and SLA negotiation between distinct administrative domains. Second, The SLA has to be established between providers and consumers from different end-to-end viewpoint. For example, if the system
19
Service Level Agreement (SLA) in Utility Computing Systems
service has been outsourced from one provider to another, there should be SLA agreement between them as well. Third, admission control policies need to be defined, because decision on which user request to accept affects the performance, profit, and reputation of the resource provider. Moreover, the resource allocation management has to be considered carefully, because it addresses which resource is best suitable for currently admitted requests from both parties’ point of view. Some termination related problems are management of QoS metrics, different parties use different parameters, and the failure to manage becomes an issue especially for the automatic handling, such as cause analysis, automatic problem resolution. We can also mention, performance forecast management is another open question in utility computing environments because it enables the recommendation for performance improvement.
SUMMARY This chapter presented the literature survey, issues and solutions of SLA management in utility computing systems and how SLAs have been used in these systems. An SLA is a formal contract between service providers and consumers to guarantee that the service quality is delivered to satisfy pre-agreed consumers’ expectations. SLA management is important in utility computing systems because it helps to improve the customer satisfaction level and to define clear relationship between parties. In this chapter, we have summarised the main fundamental concepts of SLA and analyzed two types of SLA lifecycle. One is the three phase high level lifecycle, which includes creation phase, operation phase and removal phase; the other is more specific lifecycle including six steps, which are ‘discover-service provider’, ‘define-SLA elements’, ‘establish-agreement’, ‘monitor-SLA violation’, ‘terminate-SLA’ and ‘SLA violation control’. The second type of lifecycle is more comprehensive, and introduces the characteriza-
20
tion of SLA violation that is a foundation in utility computing environments where services are consumed on a pay-as-you-go basis. The analysis carried out in this book chapter has identified four major goals in case of SLAoriented utility computing. First, supporting customer-driven service management based on customer profiles and requested service requirements. Second, defining computational risk management tactics to identify and manage risks involved in the execution of applications with regards to service requirements and customer needs. Third, deriving appropriate market-based resource management strategies encompassing both customer-driven service management and computational risk management to sustain SLAoriented resource allocation. Fourth, incorporating autonomic resource management models and self-manage changes in service requirements to satisfy both new service demands and meet existing service obligations. To achieve these goals, we discussed the main challenges and solutions of SLA implementation and management in utility computing environments by following the steps of SLA lifecycle. In the ‘discover-service provider’, the main issues are scalability, dynamic changes, heterogeneity, and autonomous administration. Some architectures and algorithms have been proposed to cope with them, such as MDS architecture and the VIRD architecture. Effective negotiation protocols and processes are main challenges for the ‘defineSLA’ and ‘establish- agreement’ steps, because two parties need to negotiate before they agree on the terms that have to be included in SLAs. SLA frameworks and languages are used as solutions. Currently the most widely used languages are WSLA and WS-Agreement. However, there are not many effective solutions for the automatic negotiation. Thus, the automatic negotiation is still an open issue. Regarding the ‘monitor SLA violation’, which party should be responsible for the monitoring process is a debate issue. The most popular solution for this problem is using Third
Service Level Agreement (SLA) in Utility Computing Systems
Party (TTP) who provides most of functionalities for monitoring a service in most typical situations to detect SLA violations. The main issues for the last two steps ‘terminate SLA’ and ‘enforce penalties for SLA violation’, are automatic failure management, such as cause analysis, penalty clauses invocation, and automatic failure resolution. Some penalty strategies have been presented. However, automatic problem resolution and cause analysis are still open challenges and more investigation is needed in the future. In conclusion, SLA in utility computing systems is a rapidly moving target although some works have been explored in the past. Therefore, there are still some open challenges such as scalability, dynamic environmental changes, heterogeneity, SLA management automation, multiple QoS parameters, and SLA suitable for cross domains need to be explored in future research.
ACKNOWLEDGMENT The authors would like to acknowledge all researchers of works described in this book chapter and thank them for their outstanding work. We also thank for Yoganathan Sivaram, Christian Vecchiola, Saurabh Kumar Garg, Rodrigo Calheirós, William Voorsluys, Tong Zou, Shanshan Wu and Daryl de Penha for their comments to improve the quality of this book chapter.
REFERENCES Andrieux, A., Czajkowski, K., Dan, A., Keahey, K., Ludwig, H., Nakata, T., Pruyne, J., Rofrano, J., Tuecke, S., & Xu, M. (2007). Web Services Agreement Specification (WSAgreement). OGF proposed recommendation (GFD.107). AWS. EC2 Service Level Agreement. Retrieved 03 28, 2010, from AWS: http://aws.amazon.com/ ec2-sla/
AWS S3. Service Level Agreement. Retrieved 03 28, 2010, from AWS: http://aws.amazon.com/ s3-sla/ Battre’. D., Hovestadt, M., Kao, O., Keller, A., & Voss, K. (2007). Planning-based scheduling for SLA-awareness and grid integration. PlanSIG, (pp. 1). Blythe, J., Deelman, E., & Gil, Y. (2004). Automatically Composed Workflows for Grid Environments. IEEE Intelligent Systems, 16–23. doi:10.1109/MIS.2004.24 Bonell, M. (1996). The UNIDROIT Principles of International Commercial Contracts and the Principles of European Contract Law: Similar Rules for the Same Purpose (pp. 229–246). Uniform Law Review. Boniface, M., Phillips, S., Sanchez-Macian, A., & Surridge, M. (2009). Dynamic service provisioning using GRIA SLAs. Service-Oriented Computing-ICSOC 2007 Workshops, (pp. 56-67). Vienna, Austria. Brandic, I. Music, D., & Dustdar, S. (2009). Service Mediation and Negotiation Bootstrapping as First Achievements Towards Self-adaptable Grid and Cloud Services. In Grids and Service-Oriented Architectures for Service Level Agreements. P. Wieder, R. Yahyapour, and W. Ziegler (eds.), Springer, New York, USA. Brandic, I., Venugopa, S., Mattess, M., & Buyya, R. (2008). Towards a Meta-negotiation Architecture for SLA-Aware Grid Services. International Workshop on Service-Oriented Engineering and Optimization, (pp. 17). Bangalore, India. Buco, M. J., Chang, R. N., Luan, L. Z., Ward, C., Wolf, J. L., & Yu, P. S. (2004). Utility computing SLA management based upon business objectives. IBM Systems Journal, 43(1), 159–178. doi:10.1147/sj.431.0159
21
Service Level Agreement (SLA) in Utility Computing Systems
Buyya, R., & Alexida, D. (2001). A case for economy Grid architecture for service oriented Grid computing. In Proceedings of the 10th International Heterogeneous Computing Workshop(HCW). San Francisco, CA.
Fitzgerald, S., Foster, I., & Kesselman, C. (1997). A directory service for configuring high-performance distributed computations. In Proceedings of the 6th IEEE Sympusium on High-Performance Distributed Computing. (pp. 365-375).
Buyya, R., Pandey, S., & Vecchiola, C. (2009). Cloudbus Toolkit for Market-Oriented Cloud Computing, In Proceedings of the 1st International Conference on Cloud Computing (CloudCom 2009, Springer, Germany). Beijing, China.
Foster, A. K. (2003). The Grid 2: Blueprint for a New Computing Infrastructure. San Francisco, CA: Morgan Kaufmann.
Buyya, R., Ranjan, R., & Calheiros, R. N. (2009). Modeling and Simulation of Scalable Cloud Computing Environments and the CloudSim Toolkit: Challenges and Opportunities. In Proceedings of the 7th High Performance Computing and Simulation Conference (HPCS 2009), ISBN: 978-1-4244-4907-1, IEEE Press, New York, USA, Leipzig, Germany. Buyya, R., Yeo, Ch., Venugopal, S., Broberg, J., & Brandic, I. (2009, June). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599–616. doi:10.1016/j.future.2008.12.001 Chu, X., Nadiminti, K., Jin, Ch., Venugopal, S., & Buyya, R. (2002). Aneka: Next-Generation Enterprise Grid Platform for e-Science and eBusiness Applications. In Proceedings of the 3rd IEEE International Conference on e-Science and Grid Computing, (pp. 10-13). Bangalore, India. Dan, A., Ludwig, H., & Kearney, R. (2004). CREMONA: an architecture and library for creation and monitoring of WS-Agreements. In Proceedings of the Second International Conference on ServiceOriented Computing, (pp. 65-74). NY, USA. Dinesh, V. (2004). Supporting Service Level Agreements on IP Networks. In Proceedings of IEEE/IFIP Network Operations and Management Symposium, 92(9), (pp. 1382-1388). NY, USA.
22
Frey, N. (2000). A Guide to Successful SLA Development and Management. Stamford, CT: Gartner Group Research, Strategic Analysis Report. Frolund, S., & Koistinen, J. O. (1998). A language for quality of service specification. HP Labs Technical Report. California, USA. Gong, Y. L., Dong, F. P., Li, W., & Xu, Zh. W. (2003). VEGA Infrastructure for Resource Discovery in Grids. Journal of Computer Science and Technology, 18(4), 413–422. doi:10.1007/ BF02948915 Hiles, A. (1999/2000). The Complete IT Guide to Service Level Agreements-Matching Service Quality ot Business Needs. Oxford, UK: Elsevier Advanced Technology. Hudert, S., Wirtz, G., & Eymann, T. (2009). BabelNeg-A Protocol Description Language for automated SLA Negotiations, In Procedings of the IEEE Conference on Commerce and Enterprise Computing, (pp. 162-169). ShangHai, China. Iamnitchi, A., & Foster, I. (2001). On fully decentralized resource discovery in grid environments. In Proceedings of the 2nd International Workshop on Grid Computing, (pp. 51-62). Denver, Colorado. Jin, L. J., & Machiraju, V. A. (2002). Analysis on Service Level Agreement of Web Services. Technical Report HPL-2002-180, Software Technology Laboratories, HP Laboratories.
Service Level Agreement (SLA) in Utility Computing Systems
Joita, L., Rana, O. F., Chacn, P., Chao, I., & Ardaiz, O. (2005). Application deployment using catallactic grid middleware. In Proceedings of the 3rd International Workshop on Middleware for Grid Computing. (pp. 1-6). Grenoble, France. Karaenke, P., & Kirn, St. (2010). Towards Model Checking & Simulation of a Multi-tier Negotiation Protocol for Service Chains. In Proceedings of the 9th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2010), Toronto, Canada, May 10-14, 2010. Keller, A., Kar, G., Ludwig, H., Dan, A., & Hellerstein, J. L. (2002). Managing dynamic services: A contract based approach to a conceptual architecture. In Proceedings of the 8th IEEE/IFIP Network Operations and Management Symposium, (pp. 513-528). Florence, Italy, April 15-19, 2002. Keller, A., & Ludwig, H. (2003). The WSLA framework: specifying and monitoring service level agreements for web services. Network and Systems Management Special Limitation on EBusiness Management, 11(1), (pp, 57-81). USA. Kuo, D., Parkin, M., & Brooke, J. (2006). A framework & negotiation protocol for service contract. In Proceedings of the 2006 IEEE International Conference on Services Computing (SCC 2006), (pp. 253-256). Chicago, USA. Lee, Y. C., Wang, C., Zomaya, A. Y., & Zhou, B. B. (2010). Profit-driven Service Request Scheduling in Clouds. In Proceedings of the International Symposium on Cluster Computing and the Grid (CCGRID). Melbourne, Australia. Loyall, J. P., Schantz, R. E., Zinky, J. A., & Bakken, D. E. (1998). Specifying and measuring quality of service in distributed object systems. In Proceedings of the 1st International Symposium on ObjectOriented Real-Time Distributed Computing, (pp. 43-54). Kyoto, Japan.
Ludwig, A., & Franczyk, B. (2006). SLA Lifecycle Management in Services Grid-Requirements and Current Efforts Analysis. In Proceedings of the 4th International Conference on Grid Services Engineering and Management (GSEM), (pp. 219-246). LeipZig, Germany. Marilly, E., Martinot, O., Papini, H., & Goderis, D. (2002). Service Level Agreements: A Main Challenge For Next Generation Networks. In Proceedings of the 2nd European Conference on Universal Multiservice Networks, (pp. 297-304). Toulouse, France. Mobach, D. G. A., Overeinder, B. J., & Brazier, F. M. T. (2006). A ws-agreement based resource negotiation framework for mobile agents. Scalable Computing: Practice and Experience, 7(1), (pp. 23-26). March 2006. Patterson, D. A. (2008). The Data Center Is The Computer. [). NY. USA.]. Communications of the ACM, 105. doi:10.1145/1327452.1327491 Philipp, W., Jan, S., Oliver, Z., Wolfgang, Z., & Ramin, Y. (2005). Using SLA for Resource Management And Scheduling. Grid Middleware and Services-Challenges and Solutions, 8(1), 335–347. Rana, O. F., Warnier, M., Quillinan, T. B., Brazier, F., & Cojocarasu, D. (2008). Managing Violations in Service level agreements. In Proceedings of the 5th International Workshop on Grid Economics and Business Models (GenCon), (pp. 349-358). Gran Canaria, Spain. Rashid, A. A., Hafid, A., Rana, A., & Walker, D. (2004). An approach for quality of service adaptation in service-oriented Grids. Concurrency and Computation, 16, 401–412. doi:10.1002/cpe.819 Rick, L. (2002). IT Services Management a Description of Service Level Agreements. RL Consulting.
23
Service Level Agreement (SLA) in Utility Computing Systems
Ron, S., & Aliko, P. (2001.). Service level agreements. Internet NG. Internet NG project (19992001) http://ing.ctit.utwente.nl/WU2/
Windows Azure Service Level Agreement. Retrieved 03 28, 2010, from http://www.microsoft. com/windowsazure/sla/
Rosenberg, I., & Juan, A. (2009). The BEinGRID SLA framework, Report available at http://www. gridipedia. eu/slawhitepaper.html
Wurman, P. R., Wellman, M. P., & Walsh, W. E. (1998). The Michigan Internet Auctionbot: A configurable auction server for human and software agents. In Proceedings of the 2nd International Conference on Autonomous Agents, (pp.301-308). Irsee, Germany.
Sahai, A., Graupner, S., Machiraju, V., & Van Moorsel, A. (2003). Specifying and Monitoring Guarantees in Commercial Grids through SLA. In Proceedings of the Third IEEE International Symposium on Cluster Computing and the Grid, (pp. 292). Tokyo, Japan. Sakellariou, R., & Yarmolenko, V. (2005). On the flexibility of WS-Agreement for job submission. In Proceedings of the 3rd International Workshop on Middleware for Grid Computing (MGC05), (pp. 1-6). Grenoble, France. Service Level Agreement in the Data Center. (April 2002). Retrieved 03 28, 2010, from Sun Microsystems: http://www.sun.com/blueprints Skene, J., Lamanna, D. D., & Emmerich, W. (2004). Precise Service Level Agreements. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04), (pp. 179-188). Tosic, V., Pagurek, B., Patel, K., Esfandiari, B., & Ma, W. (2005). Management applications of the web service offerings language (wsol) (pp. 564–586). Galway, Ireland: Web Services, EBusiness, and the Semantic Web. Venugopal, S., Chu, X., & Buyya, R. A Negotiation Mechanism for Advance Resource Reservation using the Alternate Offers Protocol. In Proceedings of the 16th International Workshop on Quality of Service (IWQoS2008, IEEE Communications Society Press, New York, USA), Twente, NL. Wieder, P., Seidel, J., Yahyapour, R., Waldrich, O., & Ziegler, W. (2008). Using SLA for Resource Management and Schedurling-A Survey. GRID Middleware and Services, 4, 335–347. doi:10.1007/978-0-387-78446-5_22
24
Yeo, C. S., & Buyya, R. (2005). Service Level Agreement based Allocation of Cluster Resources: Handling Penalty to Enhance Utility. In Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster 2005), (pp. 1-10). MA, USA. Yeo, C. S., & Buyya, R. (2006). A Taxonomy of Market-based Resource Management Systems for Utility-driven Cluster Computing. Software: Practice and Experience (SPE), 36 (13), (pp.13811419). Jan. 2006. Yeo, C. S., & Buyya, R. (2007, Nov.). Pricing for Utility-driven Resource Management and Allocation in Clusters. International Journal of High Performance Computing Applications, 21(4), 405–418. doi:10.1177/1094342007083776 Yeo, C. S., & Buyya, R. (2007). Integrated Risk Analysis for a Commercial Computing Service. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), (pp. 1-10). CA, USA. Yeo, C. S., DeAssuncao, M. D., Yu, J., Sulistio, A., Venugopal, S., Placek, M., & Buyya, R. (2006). Utility computing on Global Grids. In Bidgoli, H. (Ed.), Handbook of Computer Networks. New York, USA: John Wiley & Sons. Youseff, L., Butrico, M., & Da Silva, D. (2008). Toward a unified ontology of cloud computing. Grid Computing Environments Workshop, (pp. 1-10). Austin, Texas.
Service Level Agreement (SLA) in Utility Computing Systems
ADDITIONAL READING Broberg, J., Venugopal, S., & Buyya, R. (2008) Market-oriented Grids and Utility Computing: The state-of-the-art and future directions. Journal of Grid Computing, 6(3), (pp. 255-276), ISSN: 1570-7873, Springer Verlag, Germany. Buyya, R., & Venugopal, S. (2004). The Gridbus Toolkit for Service Oriented Grid and Utility Computing: An Overview and Status Report. In Proceedings of the 1st IEEE International Workshop on Grid Economics and Business Models (GECON 2004), (pp. 19-36), ISBN 0-7803-8525X, IEEE Press, New Jersey, USA. Buyya, R., Venugopal, S., Ranjan, R., & Yeo, C. S. (2009). The Gridbus Middleware for MarketOriented Computing. In Buyya, R., & Bubendorfer, K. (Eds.), Market Oriented Grid and Utility Computing. Hoboken, New Jersey, USA: Wiley Press. doi:10.1002/9780470455432.ch26 Guitart, J., Macías, M., Rana, O., Wieder, P., Yahyapour, R., & Ziegler, W. (2009). SLA-based Resource Management and Allocation. In Buyya, R., & Bubendorfer, K. (Eds.), Market Oriented Grid and Utility Computing. Hoboken, New Jersey, USA: Wiley Press. doi:10.1002/9780470455432. ch12
Koller, B. Oliveros, & E. Sánchez-Macián, A. (2009). Service Level Agreements in the Grid Environment. Market Oriented Grid and Utility Computing, R. Buyya and K. Bubendorfer (eds), ISBN: 978-0470287682, Wiley Press, Hoboken, New Jersey, USA. McKee, P., Taylor, S., Surridge, M., & Lowe, R. (2009). SLAs, Negotiation and Potential Problems. In Buyya, R., & Bubendorfer, K. (Eds.), Market Oriented Grid and Utility Computing. Hoboken, New Jersey, USA: Wiley Press. Netto, M. A. S., Bubendorfer, K., & Buyya, R. (2007). SLA-based Advance Reservations with Flexible and Adaptive Time QoS Parameters. In Proceedings of the 5th International Conference on Service-Oriented Computing (ICSOC 2007), LNCS Volume 4749, Springer-Verlag Press, Berlin, Germany. Ranjan, R., Harwood, A., & Buyya, R. (2006). SLA-Based Coordinated Superscheduling Scheme for Computational Grids. In Proceedings of the 8th IEEE International Conference on Cluster Computing (Cluster 2006), IEEE CS Press, Los Alamitos, CA, USA.
25
26
Chapter 2
SLA-Aware Enterprise Service Computing Longji Tang University of Texas at Dallas, USA Jing Dong University of Texas at Dallas, USA Yajing Zhao University of Texas at Dallas, USA
ABSTRACT There is a growing trend towards enterprise system integration across organizational and enterprise boundaries on the global Internet platform. The Enterprise Service Computing (ESC) has been adopted by more and more corporations to meet the growing demand from businesses and the global economy. However the ESC as a new distributed computing paradigm poses many challenges and issues of quality of services. For example, how is ESC compliant with the quality of service (QoS)? How do service providers guarantee services which meet service consumers’ needs as well as wants? How do both service consumers and service providers agree with QoS at runtime? In this chapter, SLA-Aware enterprise service computing is first introduced as a solution to the challenges and issues of ESC. Then, SLA-Aware ESC is defined as new architectural styles which include SLA-Aware Enterprise ServiceOriented Architecture (ESOA-SLA) and SLA-Aware Enterprise Cloud Service Architecture (ECSA-SLA). In addition, the enterprise architectural styles are specified through our extended ESOA and ECSA models. The ECSA-SLA styles include SLA-Aware cloud services, SLA-Aware cloud service consumers, SLA-Aware cloud SOA infrastructure, SLA-Aware cloud SOA management, SLA-Aware cloud SOA process and SLA-Aware SOA quality attributes. The main advantages of viewing and defining SLA-Aware ESC as an architectural style are (1) abstracting the common structure, constraints and behaviors of a family of ESC systems, such as ECSA-SLA style systems and (2) defining general design principles for the family of enterprise architectures. The design principles of ECSA-SLA systems are proposed based on the model of ECSA-SLA. Finally, we discuss the challenges of SLA-Aware ESC and suggest that the autonomic service computing, automated service computing, adaptive service computing, real-time SOA, and event-driven architecture can help to address the challenges. DOI: 10.4018/978-1-60960-794-4.ch002
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
SLA-Aware Enterprise Service Computing
INTRODUCTION Enterprise Service Computing (ESC) is a new distributed computing and architectural style that has been adopted by more and more enterprises. ESC primarily includes Enterprise Service-Oriented Architecture (ESOA) (Tang, L., Dong, J., & Peng, T., 2008) (Tang, L. et al., SOSE 2010) (Tang, L. et al. SOCA 2010) and Enterprise Cloud Service Architecture (ECSA) (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J., 2010). Because of complicated business requirements and high customer demands, ESC poses many challenges and issues, such as performance (latency, loss, and jitter) and dependability (security, trust). The Quality of Service (QoS) becomes crucial for ESC to achieve its vision and meet business requirements and customer demands. Nowadays, most enterprises will only invest in IT when there is a clear return on investment, lower total cost of ownership, and a clear demonstration of cost savings. Investments made in services, web services and cloud service initiatives offer the opportunity to realize these requirements, but these investments need to be deployed in a consistent, repeatable, and manageable fashion. Traditional operation management is incapable of offering the unique management functionality that can help achieve these requirements as compared to service-oriented management which is based on QoS. Service Level Management (SLM) is one of the most important and fundamental service- oriented management. SLM provides mechanisms and tools for managing individual service and the SOA processes composed of a set of services designed to meet enterprises and their customers QoS requirements and demands. The Service Level Agreement (SLA) is a specification of service or service process functional provisioning and non functional goals - QoS which is agreed to by both service providers and service consumers. The Service Level Objectives (SLO) are key elements of SLA, which are specific and
measurable quality attributes in the SLA, such as availability, throughput, frequency, performance (response time), and other quality attributes. SLA has been employed in industry such as networking and telecommunication for several decades. However, adoption of dynamic SLA in ESOA systems is relatively immature and suffers from lack of standards. Recently, cloud computing and ECSA have become the next generation enterprise service computing. The SLA and SLM have become more and more important because of the dynamic service computing environment and infrastructure. Dynamic and automated SLM provides a SLA-Aware approach in ESOA or ECSA architecture. An architectural style is a coordinating set of architectural constraints. The SOA quality attributes are the architectural constraints of ESOA and ECSA. The QoS and SLA can be part of architectural constraints and contracts at the service level in ESOA and ECSA. Therefore, at the architectural style level, adding SLA-Awareness to ESOA or ECSA generates a kind of specific architectural style, which is called SLA-Aware ESOA or SLA-Aware ECSA. At the ESOA and ESCA system (instance) level, the approach allows SLA to play a QoS role between each service consumer and service provider, which greatly improves the service visibility. It also brings service quality control intelligence and capacity into ESOA or ESCA systems, so that it greatly enhances SOA management capabilities. Therefore ESC can meet service or service process functional provisioning and non functional goals – QoS so that service providers satisfy service consumers with specific services. In addition, enterprises gain revenue from the services and avoid troubles caused by disputed services. In this chapter, we first discuss the challenges and issues of ESC. Second, we discuss general QoS and SLA concepts, their ontology, standards (such as WS-Agreement), languages (such as WSLA), and classification in enterprise service computing. Third, we define SLA-Aware ESOA and ESCA architectural styles. The styles include:
27
SLA-Aware Enterprise Service Computing
•
•
•
•
•
•
•
SLA-Aware SOA Quality Attributes: The SLA-Aware quality attributes are fundamental to the design of SLO and SLA for ESC. SLA-Aware Services: The measurable SLA quality attributes are the service constraints of which the service provider is aware in the service at runtime. SLA-Aware service consumers: The service consumer is aware of the SLA and can visit it through client-side self-management portal. SLA-Aware service process: The SLAAware SOA process consists of a set of SLA-Aware services for executing business processes. The SOA process itself is also aware of a process-wide SLA. SLA-Aware SOA infrastructure: We define a SLA-Aware SOA infrastructure as a set of SLA-Aware infrastructure services such as SLA-Aware (or QoS-Aware) network services and SLA-Aware storage services. SLA-Aware SOA management: SLAAware SOA management is defined as a set of SLA-Aware management services which provide SOA system services, including SLA management services, SLA monitoring/measuring services, SLA negotiation services, and SLA reporting services. SLA-Aware Cloud Service Provision and Subscription: SLA-Aware cloud service provisioning and subscription will be discussed. The end-to-end SLA-Aware cloud service architectural style is also described
Finally, we discuss the challenges of the SLA-Aware approach in both research and practice including automatic service computing and self-adaptive service computing. In this chapter, we assume all services are web services unless otherwise stated.
28
MOTIVATION Web services are increasingly adopted by enterprises with the spurt in the growth of e-commerce. The web services can be differentiated by the following standard and dynamic characteristics: •
• •
•
•
They can be accessed on the web with the Uniform Resource Locator (URL) and message/document exchange protocol SOAP. They are discoverable through the service registry by using standard UDDI. They are composable in a standard way. The web services composition can be either static or dynamic. They have formal interfaces with their consumers, which are described by standard service language, such as XML-based WSDL. They follow basic agreements on listed protocols and standards for communicating and interoperating with each other.
Recent cross-enterprise dynamic services and web service compositions have become reality, such as Amazon’s EC2 (Amazon Web Services, 2010)(Amazon, EC2 SLA, 2010) web service cloud. The agreements on standard language and message exchange protocols are not enough for dynamic environment and dynamic service demand; therefore some issues have emerged, which are: • •
•
•
How will web service providers agree upon what to provide to their service consumers? How will web services agree on how good the service is (Quality of Service – performance, availability, security, etc)? Who will complete the required tasks and who will be responsible for failures to execute the tasks? How will web services trust each other?
SLA-Aware Enterprise Service Computing
Service Level Agreement (SLA) is a way to address these issues in web service based enterprise architecture. The traditional SLAs between organizations and/or enterprises define the agreements on QoS, including cost and penalty. However, they are mostly static and are not machine-processable, so that the static SLA restricts the dynamic nature of web services in a cross-domain and cross-enterprise environment. Let us consider the following scenario. A travel reservation service company named TravelRes provides the online service of airlines ticket reservations for travel agents via using web services running in its data center on a SOA infrastructure consisting of multi-tiered clusters with web servers, application servers and databases. The ticket reservation web applications of travel agents are clients of the web services provided by TravelRes data center. Clearly, the performance and availability of the web services are critical for their clients. We assume the QoS guarantees (along with pricing and penalties that are specified in a static SLA) an absolute maximum ticket process response time, such as 40 seconds, and availability, defined as up-time greater than or equal to 99.5% of web services. Moreover, different clients have different guarantee requirements based on their QoS. Since better QoS guarantees require more resources for implementing web services and infrastructure, QoS guarantees are also associated with a number of requests per minute on client-side, such as 1000 requests per minutes. If the number of requests per minutes is greater than 1000 then performance guarantees will not be given. Finally, the client request demands vary daily and seasonally. To satisfy clients’ different QoS guarantees, different endpoints are given to different clients. TravelRes builds an enterprise service-oriented data center. The web service cluster connects to a storage area network (SAN) where data is managed. It uses an off-site data center as data backup through a VPN network. A monitoring system watches the web service execution and
transactions and checks the compliance with QoS guarantees defined in the SLA. If any of QoS guarantees are not satisfied, the client’s monthly bill will be reduced according to the rules defined in the SLA. Moreover, to minimize resource consumption, all clients’ requests will be routed to workload managers which prioritize requests according to the QoS level. If demand exceeds its cluster capacity, requests with lower penalty are relayed. Therefore, cluster’s capacity and networks are adjusted for profit maximization and not for serving clients’ peak demand. However, if clients want to increase their web service capacity, they have to call the department of TravelRes and make a request, and then the company needs to purchase the necessary hardware and software to increase the demand of capacity. As a result, TravelRes needs to schedule a configuration change in order to take the additional workload into account. Therefore, increasing capacity demand may take a long time, and impact the business of both clients and service providers. To satisfy clients’ planned demand, TravelRes needs to build a standard interface for its clients in order to automate the additional web service capacity request. Because of increasing market activities or various travel seasons, unpredicted traffic increase is sometimes beyond the current capacity. TravelRes needs to be able to manage a sudden onset of demand at runtime, such that its SOA system should be fully automated in order to reconcile the unplanned demand increase in close to real-time fashion. From this simple case study, we can see several requirements for both service providers and service consumers. First, performance parameters (response time and throughput) in QoS change with the web services client workload, given a fixed number of allocated resources. If the service provider wants to guarantee a QoS level, it has to foresee its clients’ workload and increase resource dynamically. A viable SLA in a crossorganizational scenario should provide a mechanism for managing clients’ workload requests on
29
SLA-Aware Enterprise Service Computing
demand. Second, service consumers may want to establish SLAs ahead of time in order to ensure that they can get their desired QoS in an SLA. Third, if service consumers require more web service capacity at runtime, they will search multiple service providers to get the best price. Thus, they need to have a mechanism to select better service providers and to get agreements with them. Fourth, to serve short-term capacity requests as shown in the previous example, the service provider needs to support fully automated resource management based on SLA. Finally, service consumers have to monitor their web service activities in order to identify the real service requests and must be capable of delivering their requests to service providers based on contracted capacity in SLA. To meet the requirements of dynamically managing service capacity from service providers and service consumers we need to establish dynamic SLA mechanisms in a standard and automated way that is integrated with traditional ESOA and ECSA. The mechanism must be SLA-Aware. We define SLA-Awareness as a capacity and a design principle with the machine-processable SLA plus dynamically automated SLA management (SLM) in this chapter. Adding SLA-Awareness to ESOA and ECSA extends the ESOA and ECSA architectural styles. It is a refinement of ESOA and ESCA. We will define them as SLA-Aware ESOA and SLA-Aware ECSA in the forth Section. Therefore, treating SLA-Aware ESC architectural style as its refinement is helpful for analyzing and designing higher quality and dynamic ESOA or ECSA systems.
RELATED WORK A body of research exists related to our work, which can be categorized as follows: (1) SLA standards and languages; (2) Modeling SLA and QoS; (3) SLA-Aware SOI; (4) SLA Management and SLM; and (5) Adaptive and Automated Computing.
30
SLA Frameworks, Standards and Languages There are several SLA frameworks, standards and languages for SOA systems based on web services. This section introduces SLA frameworks, standards and languages as well as some related research work. The Web Service Level Agreement (WSLA) (Dan, A., Ludwig, H., & Pacifici, G., 2003) (Kephart, J.O. & Chess, D., 2010) (Ludwig, H et al., 2003) is a specification and reference implementation proposed by IBM. The WSLA provides a framework for specifying and monitoring SLA for web services, which includes: • •
A Runtime WSLA architecture, and A XML-based WSLA language
The WS-Agreement (Ludwig, H., 2009) is a specification from the Open Grid Forum (OGF) which provides an agreement protocol between service consumers and service providers. It uses an extensible XML language for specifying the agreement which includes a negotiation constraint. The specification mainly includes three parties: • •
•
A schema for specifying an agreement; A schema for specifying agreement templates to facilitate discovery of compatible agreement parties; A set of port types and operations for managing agreement life-cycle which includes creation, expiration and monitoring of agreement states.
The WS-Policy and WS-Policy Attachment (Liu, Y., Ngu, A.H., & Zeng, L.Z., 2004) are specifications of service qualities which are part of SLA developed by World Wide Web (W3C). It is often used in conjunction with other web service specifications such as WS-Security policy, WS-ReliableMessage policy, and WS-Transaction
SLA-Aware Enterprise Service Computing
policy. The specification is not based on agreement but on service quality requirements. The SLAng (Skene, J., Lamanna, D.D., & Emmerich, W, 2004) is an XML language for defining SLA which is part of the contracts between web service clients and web services. It is developed by the TAPAS project at UCL. The Web Service Offering Language (WSOL) (Tosic, V., Patel, K., & Pagurek, B., 2002) (Tosic, V. et al. 2005) is a formal XML language compatible with the Web Services Description Language (WSDL). While WSDL is used for describing operations provided by web services, WSOL provides a formal specification of multiple classes of service for one web service. The classes of service for a web service are distinguished by different combination of functional provisions and QoS constraints (non-functional requirements (Chung, L., et al., 2000)), such as response time, simple access right and cost/performance. It allows service consumers to select different classes of service in depth, or based on cost; therefore, it can be applied to enable service provider’s provisioning models and consumer pay-as-you-go business models.
Modeling and Formalizing SLA and QoS Modeling and formalizing SLA and QoS receives much attention in enterprise service computing research community. Traditional SLA is typically specified by plain-text document such as Amazon’s EC2 Service Level Agreement (Amazon, EC2 SLA, 2010). The machine unreadable format could not be used for QoS management and automated negotiation in today’s dynamic and on-demand service computing environment. Enterprise cloud service computing provides a pay-as-you-use business model. Consumers pay for the services and QoS. Without using machineprocessable SLA, the service billing system could not automatically calculate charges when users are using the cloud service. Moreover the service
billing system could not automatically reduce the customers’ charges when the system fails or exhibits slower performance. Therefore, much research focuses on specifying SLA and QoS as machine readable and processable languages. Moreover, service-oriented enterprises are hard to manage and it is difficult to monitor quality of their systems, to satisfy their customers, and to reduce service cost. WSLA (Ludwig, H. et al. 2003), WS-Agreement (Andrieux, A. et al., 2004), SLAng (Skene, J., Lamanna, D.D., & Emmerich, W) and WSOL (Tosic, V., Patel, K., & Pagurek, B. 2002), introduced in Section “SLA Frameworks, Standards and Languages”, not only make SLA and QoS machine readable and processable, but also provide formal specifications for system modeling and management. Keller and Ludwig describe a novel WSLA framework for specifying and monitoring SLA for Web services (Keller, A. & Ludwig, H. 2003). In addition, Tosic and colleagues developed a management infrastructure to show how WSOL manages web service applications (Tosic, V. et al., 2005). There is ontology-based SLA and QoS modeling research. Dobson and Sánchez-Macián proposed a unified QoS and SLA ontology (Dobson, G., & Sanchez-Macian, A., 2006). Zhou et al. developed DAML-QoS ontology (Zhou, C., Chia, L.-T., and Lee, B.-S., 2004) to provide better QoS metric models. They proposed a semantic modeling framework for QoS specification (Zhou, L., Pung, H.K. & Ngoh, L.H., 2006)). Zhou and Niemela (Zhou, J. & Niemela, E., 2006) extended OWL-S by including a QoS specification ontology. In addition, they proposed a novel matchmaking algorithm, which is based on the concept of QoS profile compatibility. Fritikos and Plexousakis developed a semantic QoS-based framework for web server description and discovery using OWL-Q (Kritikos, K., & Plexousakis, D., 2008). Rigorous modeling is helpful towards reasoning the structure and behavior of SLA as well as QoS based systems and investigating the issue of the description of SLA. Meng proposed a QCCS
31
SLA-Aware Enterprise Service Computing
(Meng, S., 2007) formal model to enforce QoS requirements in service composition based on Milner’s CCS (Tang, L., Dong, J., & Peng, T., 2008). Nicola et al. defined a process calculus for QoS-Aware applications (De Nicola, R., et al., 2005). Chothia and Kleijn introduced Q-Automata (Chothia, T. & Kleijn, J., 2007) for modeling QoS on trust and other quality attributes, such as availability and response time.
SLA-Aware Enterprise Service Computing SLA-Aware enterprise service computing is receiving attention from many researchers since SLA-Awareness brings software quality management and QoS into enterprise service computing and implements the enterprise non-functional requirements. Zeng et al. proposed a QoS-Aware middleware Agflow (Zhang, Z., Dey, D., & Tan, Y., 2006) for supporting web service composition based on the QoS model they developed. McGough et al. defined an end-to-end workflow pipeline – Workflow Management Service (WfMS) (McGough, A.S. et al., 2009) which is a real-time QoS aware workflow management system based on both strict and loose QoS guarantees. The guarantee requirements are defined in an XPath document, which is connected to a BPEL engine. Wada et al. proposed a multi-objective optimization framework E3 for SLA-Aware service composition. SLA-Aware or QoS-Aware approach is also applied to web service selection (Liu, Y., Ngu, A.H., & Zeng, L.Z., 2004). The aforementioned work does not include SLA negotiation and dynamic resource scheduling. Brandic et al. presented novel meta-negotiation architecture for SLA-Aware grid services (Brandic, I. et al., 2009). Song et al. proposed a framework which supports resource scheduling in a virtualization environment for achieving QoS (Song, Y. et al., 2008).
32
SLA Management and SLM SLA management and Service Level management (SLM) play important roles in SLA-Aware enterprise service computing. While some research focuses on aspects such as SLA-Aware service composition and workflow, SLA modeling, and specification; there are some research works which emphasize SLA management which addresses end-to-end scenarios across all layers, including internal and external service interfaces, in an enterprise service computing stack. The SLA@SOI consortium published a series of their research works (SLA@SOI, 2010) about SLA-Aware Service Oriented Infrastructure (SOI) empowering the service economy in a flexible and dependable way. Their research works include general as well as multi-level SLA management for SOI (SLA@ SOI, 2010) and SLA-Aware resource management (Comuzzi, M. et al. 2009). The Open group published the SLA Management Handbook (The Open Group, 2004) from Enterprise Perspective as Volume 4 of a series of SLA management handbooks edited by TeleManagement FORUM. The book is based on a lot of research and practice in SLA management and aims at a true end-to-end SLA. Yeom et al. proposed a contract-based web service QoS management system architecture (Yeom, G. et al., 2009). Badidi et al. presented a broker-based architecture for web service QoS management (WS-QoSM (Badidi, E. et al., 2006) which is QoS-Aware web service management architecture based on the common concept of brokerage service to mediate between web services providers and consumers. The management operations are executed by the QoS broker. Bhoj et al. described SLA management architecture in the federated environments which share selective management information across administrative boundaries (Bhoj, P., Singhal, S., and Chutani, S., 2001). The SLM focuses on managing SLA commitments at the service level according to the SLA. Figure 1 describes the relationship of Key Quality Indicators (KQI), Key Performance
SLA-Aware Enterprise Service Computing
Figure 1. Relationship of KQI, KPI, SLA in SLM
Indicators (KPI), SLA and SLA Monitoring in SLM (The Open Group, 2004): Traditional SLM architectures fail to cope with the dynamic runtime nature of enterprise service oriented architecture (ESOA). Schmid and Froeger (Schmid, M. & Kroeger, R., 2008) proposed a decentralized QoS-Management architecture in SOA based on the self-management framework of Service Component Architecture (SCA). Nurmela (Nurmela, T. & Kutvonen, L., 2007) developed an evaluation framework for SLM in the federated service management context. The SLM not only provides service management for achieving the QoS required by service consumers (enterprise business customers), but also differentiates services (Dan, A., Ludwig, H., & Pacifici, G., 2003) (Gibbens, R., Mason, R., & Steinberg, R., 2000) (Zhang, Z., Dey, D., & Tan, Y., 2006). For instance, a web service can be differentiated into Gold, Silver and Bronze service classes based on KQI and KPI, as defined in the SLO and SLA, with the price of service being associated with each of the service classes. This approach provides a dynamic service provisioning framework and is playing an important role in enterprise cloud service computing.
Adaptive and Automated Computing SLA-Aware enterprise service computing provides a way to allow enterprises to achieve higher quality assurance and cost-effectiveness in their service oriented architecture systems. However, it also brings challenges to distributed service computing in enterprises. The challenges include higher adaptability and automation of enterprise service computing. There is a body of research around the challenges. Yau and An discussed the challenges of adaptive resource allocation for service-based systems (Yan, S.S. and An, H., 2009). Gao and colleagues presented a QoS analysis technology of adaptive SOA based on a dynamic reconfiguration approach (Gao, T. et al., 2005). Wang and colleagues proposed a SLM framework by using QoS monitoring, diagnostics and adaptation for networked enterprise service oriented systems (Wang, G., et al. 2005) (Wang, H., Wang, G. & Wang, C., 2007). Self-management (Kephart, J.O. & Chess, D., 2010) and self-adaptive automatic computing (Chung, L., & Subramanian, N., 2003) (Gao, T. et al., 2005) (Yan, S.S. & An, H., 2009) are new challenges for today’s SLA-Aware enterprise cloud service computing, such as ECSA (Tang, L. et al., 2010).
33
SLA-Aware Enterprise Service Computing
Table 1. Dynamic SLA vs. Static SLA Type of SLA Dynamic SLA
Description Defined by formal languages, such as WSLA, WS-Agreement
Machine processing Yes
Measurement & Monitoring • • •
Static SLA
Specified in a document
No
• •
Execution & Negotiation
Changing
Termination
Measure by SLA metrics and auto measure system Monitoring by SLA monitor Dynamic reporting
Dynamic SLM controls execution and negotiation between service provider and consumer automatically
Executing by dynamic SLM automatically
Executing by dynamic SLM automatically
Measure by SLA metrics Monitoring by monitor
Traditional SLM is lack of automatic control and negotiation
Executing by traditional SLM manually
Executing by traditional SLM manually
SLA-AWARE ESC ARCHITECTURAL STYLES: EXTENDING THE ESOA AND ECSA Unlike existing research in the area, we specify the SLA-Aware enterprise service computing as a new architectural styles which extends the architectural styles ESOA and ECSA that we proposed previously (Tang, L., Dong, J., Peng, T., & Tsai, W. T., SOCA 2010) (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J, 2010). In this section, we first define the SLA-Aware ESOA and SLA-Aware ECSA architectural styles, and then we discuss each of the main parts. In this section, we assume the business core services are web services and the web services’ support and management services.
(SC) and service provider (SP) on the service guarantees for service consumers. The guarantees include the operations that need to be executed and the promised QoS that should be provided. Formally, we define SLA as SLA= SLA (SC, SP, C(QoS)),
(4.0)
The Concept of SLA and SLA-Awareness
in which SC is a service consumer or a service provided by another service provider, and C(QoS) is the negotiable QoS contract. Formula (4.0) can be simplified as SLA = SLA (SC, SP), where the SP can be a web service or cloud service, such as IaaS (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J, 2010). There are two types of SLA according to its nature as shown in Table 1: Moreover, there are two types of dynamic SLA deployment as shown in Table 2:
The existence of a quality service level agreement (frequently abbreviated as SLA) is of fundamental importance for any service delivery. It essentially defines the formal relationship between the service consumer and the service provider. We define SLA for SLA-Aware enterprise service computing as follows.
Definition 2: SLA-Awareness is a capacity and design principle to guarantee QoS provided by services. It uses dynamic SLA binding in a service computing system environment to achieve its goal. The capacity and quality of a SLA-Aware service computing system is controlled by dynamic SLAs and managed by dynamic SLM.
Definition 1: Service Level Agreement is a negotiable QoS contract between service consumer
34
SLA-Aware Enterprise Service Computing
Table 2. Vertical SLA vs. Horizontal SLA Type of SLA
Definition from network layer prospective
Definition from enterprise architecture layer prospective
Vertical SLA
A SLA between two SPs or SC and SP on different OSI layers, such as a SLA between VoD and its ISP
A SLA between two SPs or SC and SP on different enterprise layers, such as a SLA between web application in web server layer and web services in application server layer.
Horizontal SLA
A SLA between two SPs or SC and SP on same OSI layer, such as a SLA between two IP domains.
A SLA between two SPs on the same enterprise architecture layer, such as a SLA between two web services in a workflow process.
SLA-Aware ESOA and SLA-Aware ECSA
SSLAM = {mi | mi is a SLA-Aware SOA Management}, (4.6)
Software architectural style is an abstraction of a family of systems as a pattern of structural organization. An architectural style is a coordinating set of architectural constraints that restrict the roles/ features of architectural elements and the allowed relationships among those elements within any architecture that conforms to that style. Therefore, architectural style is a kind of roadmap and guidance for analyzing and designing concrete architectures. We previously proposed a model of enterprise service-oriented architecture (ESOA) (Tang, L., Dong, J., Peng, T., and Tsai, W. T. SOCA 2010). In this work we extend the ESOA style to the following SLA-Aware ESOA style:
SSLAP = {pi | pi is a SLA-Aware SOA Process}, (4.7)
ESOA-SLA = ⟨SSLA, CSLA, DSLA, SSLAI, SSLAM, SSLAP, SSLAQ⟩, (4.1) In which SSLA = {si | si is a SLA-Aware we service}, (4.2) CSLA = {ci | ci is a SLA-Aware service consumer}, (4.3) DSLA = {di | di is a SLA-Aware SOA data element}, (4.4) SSLAI = {ri | ri is a SLA-Aware SOA Infrastructure}, (4.5)
SSLAQ = {qi | qi is a SLA-Aware SOA quality attribute}, (4.8) Using the notation “⊲” to indicate the style extension relationship, we have ESOA ⊲ ESOA-SLA. The new constraints SLA is added to its parent style ESOA, and they apply consistently to the new elements, such as dynamic SLM and machine-processable SLA. The style extension is part of architectural style refinement (Pahl, C., Giesecke, S., & Hasselbring, W., 2009). We will explore the style, style refinement analysis, and evaluation in our future work. In (Tang, L., Dong, J., Zhao, Y., & Zhang, L.J, 2010), we presented a new enterprise service architectural style, called Enterprise Cloud Service Architecture (ECSA), which is a hybrid style of ESOA and cloud computing. Here we extend this style to the following SLA-Aware style: ECSA-SLA = ⟨SSLA, CSLA, DSLA, SSLAI, SSLAM, SSLAP, SSLAQ, SSLAD⟩, (4.9) In which
35
SLA-Aware Enterprise Service Computing
Figure 2. SLA-Aware QoS Taxonomy
SSLA = {si | si is a SLA-Aware cloud service}, (4.10)
SSLADII = {d | d is a service deploy type},
CSLA = {ci | ci is a SLA-Aware cloud service consumer}, (4.11)
SSLADIII ={d | d is a SLA-Aware service delivery model}, (4.20)
DSLA = {di | di is a SLA-Aware SOA cloud data element}, (4.12)
Similarly, we have ECSA ⊲ ECSA-SLA. Since the ESOA architecture can be regarded as a part of the ECSA architecture in the private cloud, we will focus on specifying the SLA-Aware ECSA in the rest of this section.
SSLAI = {ri | ri is a SLA-Aware SOA cloud infrastructure}, (4.13) SSLAM = {mi | mi is a SLA-Aware SOA cloud management}, (4.14) SSLAP = {pi | pi is a SLA-Aware SOA cloud process}, (4.15) SSLAQ = {qi | qi is a SLA-Aware SOA cloud quality attribute}, (4.16) SSLAD = SSLADI ∪ SSLADII∪ SSLADIII,
(4.17)
where SSLADI = {d | d is a building element of development}, (4.18)
36
(4.19)
SLA-Aware SOA Quality Attributes We have defined SOA quality attributes SQ as constraints of ESOA and ECSA. The SLA-Aware SOA quality attributes SSLAQ in equations (4.8) or (4.16) are subsets of SQ, and they are both important to the services and measurable. They can be measured by monitoring tools and calculated by service level management service. The core service level quality attributes are classified into several QoS classes shown in Figure 2. The quality attributes are the fundamental of the design of KQI, KPI, SLO and SLA.
SLA-Aware Enterprise Service Computing
Figure 3. SLA-Aware Web Service Ontology
SLA-Aware Web Service Traditional web service is a self-contained software abstraction of business, technical functionality, or infrastructure management, characterized by a well-defined interface that focuses normally on the descriptions of functional aspects, such as input, output, preconditions and effects known as IOPE (Dong, J., Paul, R., & Zhang, L.-J., 2008) (Dong, J., Paul, R., & Zhang, L.-J., 2008). The interface of a web service is defined by the WSDL language. However, SLA-Aware web service not only focuses on its functional aspects, but also emphasizes its QoS through the dynamic SLA. We define SLA-Aware web service as follows: Definition 3: The SLA-Aware Web Service in equations (4.2) or (4.10) is web service described by both WSDL and formal SLA Language. It is managed by the SLM based on the dynamic SLA with its consumers.
Figure 3 describes the SLA-Aware web service ontology. As discussed in the third Section, there are different languages, such as WSLA (Ludwig, H. et al. 2003), WS-Agreement (Ludwig, H., 2009), which can be used for specifying SLA.
4.5 SLA-Aware Service Consumer First, we need to extend the concept of service consumer, defined in equations (4.3) and (4.11) as follows: CSLA =CEnd∪ CS,
(4.21)
where CEnd is a set of end service consumers in which the element can be any web service client or cloud web application, such as SaaS (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J., 2010); and Cs is a subset of SSLA, in which the element is a service which consumes other services.
37
SLA-Aware Enterprise Service Computing
Figure 4. Model of the Interaction between SLA-Aware CSC and SLA-Aware CSP
Unlike a traditional service consumer, a SLAAware service consumer is not only the service requestor, but also a SLA negotiator which sends a SLA negotiation request to the SLM of a service provider either directly or through a negotiation broker (Hasselmeyer, P. et al., 2006) before sending the service request. We define SLA-Aware service consumer as follows: Definition 4: The SLA-Aware enterprise service consumer is a business application or another service which requests service from service(s) provider(s), can initialize a SLA negotiation with its SP, and make decisions regarding service class and service request based on both functional and non-functional (QoS, such as performance, availability, security, pricing as well as penalty) requirements. The SLA-Aware service consumer should be self-managed through a self-service portal with a set of dashboards.
38
Figure 4 describes a model of the interaction between the SLA-Aware cloud service consumer (CSC) and the SLA-Aware cloud service provider (CSP). Moreover, we assume the CSP, such as Amazon S3 web service, is in the public cloud (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J., 2010), which is based on a pay-as-you-go business model for its CSC. Therefore there is service pricing in the service negotiation, billing service for handling CSC payment, and billing justification based on usage and agreement. For example, the SLA of Amazon web service S3 (Amazon Web Services, 2010) defines the following Service Credit as its billing justification. Service Credits are calculated as a percentage of the total charges paid by you for Amazon S3 for the billing cycle in which the error occurred in accordance with the schedule as shown in Table 3.
SLA-Aware Enterprise Service Computing
Table 3. Service Credits of Amazon Web Service S3 Monthly Uptime Percentage
Service Credit Percentage
Equal to or greater than 99% but less than 99.9%
10%
less than 99%
25%
In Figure 4, the Service Delivery can be middleware with web service containers, such as Weblogic and WebSphere application servers.
4.6 SLA-Aware SOA Infrastructure The traditional SOA infrastructure is the heart of ESOA. It is the bridge of the transformation between business and services. However, the traditional enterprise SOA infrastructure is built in a kind of static data center (without adopting virtualization and other server consolidation technologies, like agility and alternate sourcing – cloud computing) in which (1) pre-provisioned resources are used - rigid, server silos and dedicated servers per application; (2) Server CPU utilization is often in single digits; (3) Scale is through adding hardware, and (4) Resources are shared within enterprise firewalls. Therefore, it is not adaptable to today’s on-demand business workload and real-time B2B requirements. It also costs more resources and power within enterprise’s data center. The SLA-Aware SOA cloud infrastructure is a kind of SLA driven service-oriented infrastructure which aims at improving the traditional SOA data center, reducing cost, and adapting on-demand requirements of business and customer. Definition 5: The SLA-Aware SOA cloud infrastructure in equation (4.13) is a SLA driven service-oriented infrastructure with the following main characteristics: •
•
It is managed by SLA-Aware SOA management.
• •
•
It supports elasticity and dynamism – automatic scalability and load-balancing, failover based on SLA and in terms of virtualization (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J., 2010) or other technologies (Tang, L., Dong, J., Zhao, Y., & Zhang, L.J., 2010). It supports global resource sharing through the Internet. It supports resource usage accountability – utility model (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J., 2010). It can be a part of cloud service, such as PaaS type services (Google App Engine (Tang, L., Dong, J., Zhao, Y., & Zhang, L.J., 2010)), or can be a cloud service, such as IaaS type service (Amazon EC2).
Figure 5 is the high-level view of SLA-Aware Dynamic Data Center in a cloud service-oriented enterprise. The SLA-Aware SOA Cloud Infrastructure consists of: • • •
• • •
•
An enterprise SOA and cloud service delivery network (SDN) A provisioning service A dynamic virtualized infrastructure: Virtualization Infrastructure as a Service (VIaaS), such as VMWare. A physical resource infrastructure: Physical Infrastructure as a Service (PIaaS) Business Applications layer SOA Infrastructure Management which includes SLA management as well as other management systems. Monitoring systems
For a SLA-Aware SOA cloud infrastructure, the VIaaS and PIaaS should be able to manage resources, such as CPU, OS, networking and storage allocation and can tune and re-purpose resources to the environment. The SLA management should (1) guarantee the resources to be allocated dynamically based on demand; (2)
39
SLA-Aware Enterprise Service Computing
Figure 5. SLA-aware SOA cloud infrastructure
guarantee QoS (such as availability, performance, and security) defined in SLA; and (3) guarantee the pricing and billing agreement. The Monitoring system should (1) monitor SLA and heartbeats; (2) monitor capacity of VIaaS and PIaaS; (3) monitor usage; (4) monitor utilization of resources as well as services; and (5) provide analysis and calculation results to SLA management and provisioning service as well as billing service. In Figure 5, the other managements in SOA management may include service discovery, policy enforcement, etc. (Tang, L., Dong, J., Peng, T., and Tsai, W. T., SOCA 2010).
SLA-Aware SOA Management The architectural styles ESOA-SLA and ESCASLA we have proposed are SLA-aware and service oriented. The SLA-Aware SOA management in
40
equations (4.6) and (4.14) is one of the key parts in (4.1) and (4.9). It is different from the general concepts and approaches for SOA management of ESOA and ESCA that we have discussed, since it is SLA-Aware and dynamic. The ESOA-SLA and ESCA-SLA emphasize end-to-end SLA management. Definition 6: The end-to-end SOA management SLAM can be defined as a set of SLM: SLAM = {SLMi | SLMi is a SLM with SLAi for service si}, (4.22) In which si includes functional services, VIaaS, PIaaS, IaaS which are infrastructure services, and other SOA management services, such as security services and logging services. Figure 6 depicts the
SLA-Aware Enterprise Service Computing
Figure 6. End-to-End SLA management in service-oriented enterprise architecture
End-to-End SLA Management in service-oriented enterprise architecture. Figure 4 shows that the SLM plays a service manager role. We highlight a SLA-Aware SLM for cloud service, such as the airline ticket reservation service in Figure 8. The SLM architecture can be implemented by WSLA framework (Keller, A. & Ludwig, H., 2003), WSOL framework (Tosic, V. et al. 2005), or WS-Agreement standard (Andrieux, A. et al., 2007). For instance, the SLA negotiation and offer between service consumer and service provider can be implemented by WSAgreement. Figure 7 is the agreement offer document defined by WS-Agreement for the ticket search service. A functional service like the travel service, under management of SLM, is different from traditional service. It must
• • •
query the SLM when it is going to execute an action/operation (search tickets); notify the SLM of resource usage in a timely manner; and obey the SLM’s instruction to destroy activities.
The user account service is one of the core parts for SLA management, since there is no way to handle users’ credit and service payment without it. When the user proposes a new SLA, SLM needs to verify the user’s credit from the account system. When the user uses the travel service, SLM needs to record the usage into the account service. At the end of each SLA billing cycle, SLM records the total usages in the user’s account for billing user. Moreover, when billing an account, if SLM finds that the account is suspended or closed, then the SLA will be suspended or closed.
41
SLA-Aware Enterprise Service Computing
Figure 7. Agreement Offer of WS-Agreement for Search Ticker Service
SLA-Aware SOA Process One of the important parts of the ESOA style is its set of SOA processes. The SOA process or workflow is an abstraction of Business Process Management (BPM). Each process is composed of multiple services in orchestration and/or choreography for completing a whole or partial business process or task. The traditional SOA process can be executed by using an ESOA infrastructure with a process engine in the internal network of an enterprise. However, traditional SOA processes face many challenges and issues: real-time high performance (such as automated trading), ondemand scalability, large payloads (10+ MB),
42
memory constraints, and high availability and reliability. The SOA process of ECSA style resolves the issues of traditional SOA process. Some complex transaction processes and workflows in enterprise may need to compose multiple services in the cloud for completing the tasks. However, traditional ways lack end-to-end QoS guarantees for processes. The question is: how can the cloud process service provider guarantee the quality requirements from the service consumers. In this section, we specify the SLA-Aware SOA process SSLAP in ECSA-SLA. Definition 7: Let pSLA ∈ SSLAP, then an end-to-end SLA-Aware SOA cloud process can be defined as
SLA-Aware Enterprise Service Computing
Figure 8. SLM for SLA-Aware Cloud Travel Service
pSLA = {c ∈ CSLA} ∪ { si | si ∈ SSLA and i=1,2, …n} ∪ {IaaSk | IaaSk ∈ SSLAI, k=1,2,…,m}. We define the end-to-end SLA chain for process pSLA as SLAp= SLA(p) ∪ SLA(IaaS), SLA(p) = { SLAij| SLAij = SLA(si, sj), i≠j, si is service consumer and sj is service provider, SLAij ≠Ø}, In which i=0,1,2,…,n, s0 = c is a service consumer which initiates the process. Suppose si is the first service called by c, SLA(IaaS) = {SLAi,IaaSj | SLAi,IaaSj = SLA(si, IaaSk), i <=n, k =1,2,…,m, IaaSk is a service provider}, where n ≥ m ≥ 1, SLA01≠Ø , SLA(p) can be empty and SLA(IaaS) ≠Ø, which means there are at least two SLAs – one is between the process service consumer and the process service, the other is
between the process service and its infrastructure. If n > 1 and n is in the process pSLA, si, i=j1,j2, ….jk are external services in the different clouds, then m=k. The structure of SLA(p) depends on the process patterns and the way that the SLAij is specified. For instance, Figure 9 describes a SLA-Aware sequence travel reservation workflow with two cloud services. Therefore pSLA = {c ∈ CSLA} ∪ { si | si ∈ SSLA and i=1,2,3,4} ∪ {IaaSk | IaaSk ∈ SSLAI, k=1,2,3}, SLA(p) = { SLA12, SLA13}, and SLA(IaaS) ={SLA2,IaaS1, SLA3,IaaS2, SLA4,IaaS3}. The SLA-Aware SOA Cloud processes, such as service composition, workflow, orchestration and choreography, are very important for improving customer experience and satisfaction with enterprises; therefore, this topic has attracted much research interest, including works listed in the
43
SLA-Aware Enterprise Service Computing
Figure 9. A SLA-Aware Sequence Travel Reservation Workflow
second Section and (Beauche, S. & Poizat, P., 2008) (Zeng, L. et al., 2004).
•
SLA-Aware Cloud Service Provisioning and Subscripting
•
We previously defined the enterprise cloud services delivery model in (Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J., 2010). The extension SLA-Aware cloud service delivery model defined in equation (4.20) can be specified in the following table: All the SLA-Aware cloud services in different models are actually delivered through a set of SLA-Aware cloud service provisioning services (Zhang, L.-J. & Zhou, Q., 2009) by service providers, which are part of the enterprise cloud SOA infrastructure. The SLA-Aware service provisioning in Figure 5 has four interfaces:
•
•
44
An interface with SLA management (SLM), which accepts SLM control and report service usage to SLM.
An interface with resource management, which allocates resources for services based on the demand. An interface with service scheduling system to provide scheduled services to clients based on SLA. An interface with service consumers, which deliver services to consumers.
In a SLA-Aware SOA cloud service environment, the cloud service subscription (Zhang, L.-J. & Zhou, Q., 2009) from clients (service consumers) is managed by a set of service provider’s SLA-Aware service subscription services which process the subscriptions of service consumers with SLA information. Zhang and Zhou pointed out that the cloud provisioning and subscription services should be extendable for supporting different types of resource sharing (Zhang, L.-J. & Zhou, Q., 2009) and service subscription. The SLA-Aware service provisioning and subscription is a principal of designing ECSA-SLA style architecture as well as
SLA-Aware Enterprise Service Computing
Table 4. SLA-Aware Delivery Modes of Cloud Services Delivery Mode
Description
SaaS-SLA
SLA-Aware Software as a Service
Sharing software under dynamic SLA
PaaS-SLA
SLA-Aware Platform as a Service
Sharing platform under dynamic SLA
IaaS-SLA
SLA-Aware Infrastructure as a Service
Sharing infrastructure under dynamic SLA
IMaaS-SLA
SLA-Aware Information as a Service
Sharing information under dynamic SLA
IRaaS-SLA
SLA-Aware Integration as a Service
Sharing integration under dynamic SLA
XaaS-SLA
SLA-Aware other cloud service delivery models
Sharing other resources under dynamic SLA
a challenge for both researchers and practitioners in enterprise service computing. In summary, the forth Section primarily specifies the new architectural style (ECSA-SLA) and its ontology. The style emphasizes dependability within enterprise service computing through dynamic SLA mechanisms and SLA management as first-class architecture design considerations. However, we still face a lot of challenges in many aspects, especially in research and practice. We will discuss those challenges in the next section.
CHALLENGES OF SLA-AWARE ENTERPRISE SERVICE COMPUTING SLA-Aware Enterprise Service Computing is a new enterprise architectural style. Higher automation, performance and adaptation are required for designing this style-architecture. Therefore, researchers and practitioners face a number of challenges. The challenges include: •
Resource Sharing
General Challenges: ◦⊦ Theoretical foundation of SLAAware enterprise service computing ◦⊦ Formalizing complicated service-oriented enterprise architectural styles ◦⊦ Verifying complex architectural styles ◦⊦ Autonomic self-service on the clientside, which can monitor and manage the SLA execution on the server-side.
◦⊦
•
Automated service provisioning and subscription ◦⊦ Automated service discovery and selection New Challenges: ◦⊦ Automated service level management ◦⊦ Automated SLA monitoring which can monitor SLA execution dynamically ◦⊦ Adaptive resource management based on SLA and demand ◦⊦ Adaptive SLA-Aware service execution in SP environment, such as adaptive service performance and scalability management, change management, dynamic reconfiguration, exception management, and fault-tolerance. ◦⊦ Adaptive system optimization ◦⊦ Real-Time (RT) or close to RT SLA management, dynamic SOA Infrastructure and management.
Autonomic computing, automated and adaptive service computing, and event-driven and real time service computing have been researched and adopted for tackling some of the challenges of SLA-Aware ESC.
Autonomic Service Computing Enterprise services are running in a flexible, constantly changing, and complicated distributed SOA
45
SLA-Aware Enterprise Service Computing
environment. Thus, self-management is necessary for reducing the management complexity (Kephart, J.O. & Chess, D., 2010). Kephart and Chess discussed the vision of autonomic computing (Kephart, J.O. & Chess, D., 2010), which defines the self-management by including the concepts of self-configuration, self-optimization, self-healing and self-protection. These self-services are very useful in cloud service, client-side management, and automated SLA-Aware cloud workflow management.
Automated and Adaptive Service Computing The SLA-Aware ESC requests an automated SLA negotiation between SC and SP. It is part of the automated service provisioning. Dynamic ESOA/ECSA infrastructure requires dynamic resource management or an elastic cloud based on demand. This requires the system to have some degree of automation and adaptation (Sahai, A., Durante, A., & Machiraju, V., 2001) (Chung, L., & Subramanian, N., 2003) (Wang, G. et al., 2005). The dynamic SLM needs to use automated SLA monitoring, diagnostics, and configurable and reconfigurable adaptation for managing SLA-Aware enterprise service systems (Beauche, S. & Poizat, P., 2008) (Buyya, R. et al., 2009) (Gao, T. et al., 2005) (Sahai, A., et al. 2002). SLA-Aware ESC also differentiates service classes, which allows clients to choose a proper class-service based on the pricing model; therefore, automated service discovery and selection (Liu, Y., Ngu, A.H., & Zeng, L.Z., 2004) becomes an important design requirement for some service systems.
Event-Driven and Real Time Enterprise Service Computing Enterprises need automated SLM to make sure they meet SLAs and optimize service delivery in order to improve business outcomes. The SLM for SLA-Aware ESC requires real-time or close
46
to RT visibility, dynamic SLA negotiation, and dynamic system reconfiguration and continuous refinement. However, this level of management is not easy to accomplish with today’s distributed and interconnected applications because they execute on heterogeneous systems in different locations. As a result, getting end-to-end visibility to track real-time processes and assure individual business transactions meet SLAs is a challenging task. Event-Driven Architecture (EDA) (Taylor, H., et al., 2009) (Tang, L., Dong, J., Peng, T., & Tsai, W. T. SOCA 2010) and RTSOA (Tsai, W.T., Sun, X., & Balasooriya, J., 2010) (Tsai, W.T., Shao, Q., Sun, X., & Elston, J., 2010) are solutions for this challenge.
CONCLUSION AND FUTURE RESEARCH We have introduced the SLA-Aware enterprise service computing and specified two new architectural styles: SAL-Aware ESOA and SLA-Aware ECSA in SLA-Aware ESC. SLA-Aware architectural styles have two unique characteristics: (1) SLA-Aware SOA applications require a set of SLM capacities from both service consumers and service providers; (2) Processing of non-functional requests (SLAs – performance, dynamic scalability, availability, etc.) of services are considered as the first-class capacity and are executed before functional operations of service. In this way, the service providers are required to provide not only functional services but also the QoS to service consumers. Capacity is the key requirement for a family of systems, for example, real-time online trading systems and online travel reservation systems. Examples include cloud services such as Amazon web services EC2 and S3, which require higher performance, availability, and dynamic scalability for satisfying the service consumers (business customers or their applications). Customers can get services and the corresponding
SLA-Aware Enterprise Service Computing
Table 5. Challenges and research directions Challenge
Research Directions
Theoretical foundation of SLA-Aware enterprise service computing
The SLA-Aware enterprise service computing is a new paradigm of distributed computing. Its theory and formalization is a hot research topic. There are several research directions: • O n t o l o g y o f S L A - Aw a r e e n t e r p r i s e s e r v i c e c o m p u t i n g , s u c h as SLA and QoS ontology (Dobson, G., & Sanchez-Macian, A., 2006). • Formal calculus for programmable QoS, such as Kaos (De Nicola, R. et al., 2003) • Event calculus for WS-Agreement (Mahbub, K. & Spanoudakis, G., 2007)
Modeling SLA-Aware enterprise service computing styles
The SLA-Aware enterprise service computing can be viewed as an architectural style. Modeling thestyleanditsrefinement,suchasitssubstyles,isaninterestingresearchtopic.Recenttrendsare • Ontology-based modeling methodology (Pahl, C., Giesecke, S., & Hasselbring, W., 2009). • Architectural Description Language (ADL) based modeling, such as ACME (Garlan, D. & Schmerl, B., 2006); Alloy (Kim, J.S. & Garlan, D., 2006) • Graph-based modeling (Baresi, L. et al., 2006)
Automated and Adaptive SLA-Aware enterprise service computing
The SLA-Aware enterprise service computing requests automated and adaptive service level management automated QoS-pricing computing, SLA-based adaptive optimization, elastic infrastructure and dynamic system reconfiguration. Those requirements introduced many challenging research topics we have outlined in section “Automated and Adaptive Service Computing”.
Real-Time or Close to RT SLA-Aware enterprise service computing
To guarantee delivering services and the end-to-end transaction process on SLA in a highly dynamic environment, such as the cloud, SLA-Aware ESC needs to support realtime or close to RT monitoring as well as measurement and management. Event-Driven Architecture (EDA) and RTSOA provide ideas and technology. How to plug them into SLA-Aware ESC becomes another interesting research direction. This topic is briefly introduced in section “Event-Driven and Real Time Enterprise Service Computing”
Automated End-to-End and chain SLA in transaction process or workflow
There is a lot of research on SOA process and workflow. However, how to meet SLAs for each service node in the end-to-end transaction process or workflow is a challenge. Modeling the SLA-Aware SOA process and its architecture is worthy of further research.
SLA-Aware application server and enterprise message bus (ESB) and other service process engines
SOA-enabled application servers, such as Weblogic, Websphere, ESB and process engines play an important role – the role of service mediator in enterprise service computing. Researching the next-generation SLA-Aware and adaptive highly-intelligent service mediator is also an exciting project.
QoS, such as performance, availability and price, based on the SLA. To enable the dynamic SLA and SLM in a traditional ESOA stack, representing the SLA in a standard way is important. We have introduced several standard ways for defining the SLA in machine-processable languages, such as WSAgreement, WSLA language, and WSOL. Most of the SLA languages are built on XML language. They support the SLA lifecycle in that they build, negotiate, execute, and terminate through SLAAware SOA management such as dynamic SLM, SLA-Aware middleware, and broker. We define SLA-Aware ESC as an architectural style in this chapter. The primary advantage of viewing and defining SLA-Aware ESC as architectural styles is an abstraction of the com-
mon structure. Constraints of and behavior of a family of ESC systems such as ECSA-SLA style systems, and defining general design principles for the family of enterprise architectures, are other advantages. The design principles of SLA-Aware ESC systems are discussed through specifying our SLA-Aware ESOA or ECSA formula. The principles include: • • •
Make SLA management and QoS the firstclass consideration Represent SLA in standard machine-processable language Manage SLA between service consumer and service provider through a dynamic SLM
47
SLA-Aware Enterprise Service Computing
• •
• •
•
Enable and execute SLA-based QoS operations ahead of service functional operations Manage SLA between enterprise services and ESOA/ECSA infrastructure providers by SLA-Aware SOA management which includes SLA monitoring, SLA control, SLA execution, dynamic reconfiguration, and SLA lifecycle management. The SLAAware SOA management also supports SLA-based dynamic resource management, service provisioning, subscription and classification (rating or pricing). Build SLA-Aware SOA processes and workflows by end-to-end SLA management Adopt autonomic service computing: selfmanagement, self-service, self-configuration and self-error handling and recovering. Adopt automated and adaptive service computing
Finally, we point out that there are a lot of challenges and opportunities for both researchers and practitioners in the emerging area. We summarize these challenges (or issues) and future research directions in Table 5.
REFERENCES Amazon, Auto Scaling and load balance. Retrieved October 24, 2010, from http://aws.amazon.com/ autoscaling/ Amazon, EC2 SLA. Retrieved October 24, 2010, from http://aws.amazon.com/ec2-sla/ Amazon Web Services. Retrieved October 24, 2010, from http://aws.amazon.com/about-aws/ Andrieux, A., Czajkowski, K., Dan, A., Ludwig, H., Nakata, T., Pruyne, J., et al. (2007). Web Service Agreement Specification (WS-Agreement). Retrieved from http://www.ogf.org/documents/ GFD.107.pdf
48
Badidi, E., Esmahi, L., Adel Serhani, M., & Elkoutbi, M. (2006). WS-QoSM: A Broker-based Architecture for Web Services QoS Management (pp. 1–5). Innovations in Information Technology. Baresi, L., Heclel, R., Thone, S., & Varro, D. (2003). Modeling and Validation of ServiceOriented Architectures Application vs. Style. The fourth joint meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering. Beauche, S., & Poizat, P. (2008). Automated Service Composition with Adaptive Planning. [LNCS]. Lecture Notes in Computer Science, 5364, 530–537. doi:10.1007/978-3-540-896524_42 Berbner, R., Spahn, M., Repp, N., Heckmann, O., & Steinmetz, R. (2010). WSQoSX – A QoS Architecture for Web Service Workflows. [LNCS]. Lecture Notes in Computer Science, 4749, 623– 624. doi:10.1007/978-3-540-74974-5_59 Bhoj, P., Singhal, S., & Chutani, S. (2001). SLA Management in federated environments. Computer Networks, 35, 5–24. doi:10.1016/S13891286(00)00149-3 Bouchenak, S. (2010). Automated Control for SLA-Aware Elastic Clouds, Proceedings of the Fifth International Workshop on Feedback Control Implementation and Design in Computing Systems and Network. Paris, France (pp. 27-28). Brandic, I., Venugopal, S., Mattess, M., & Buyya, R. (2008). Towards a Meta-Negotiation Architecture for SLA-Aware Grid Services. Technical Report GRIDS-TR-2008-10. Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599–616. doi:10.1016/j.future.2008.12.001
SLA-Aware Enterprise Service Computing
Chothia, T., & Kleijn, J. (2007). Q-Automata: Modeling the Resource Usage of Concurrent Components. Electronic Notes in Theoretical Computer Science, 175, 153–167. doi:10.1016/j. entcs.2007.03.009
Gao, T., Ma, H., Yen, I.-L., Bastani, F., & Tsai, W.-T. (2005). Toward QoS Analysis of Adaptive Service-Oriented Architecture. IEEE International Symposium on Service-Oriented System Engineering (SOSE) (pp. 219-226).
Chung, L., Nixon, B. A., Yu, E., & Mylopoulos, J. (2000). Non-functional requirements in software engineering. Springer.
Garlan, D., & Schmerl, B. (2006). Architecturedriven modeling and analysis, in: Tony Cant (Ed.). Proceedings of the 11th Australian Workshop on Safety Related Programmable System.
Chung, L., & Subramanian, N. (2003). Adaptive System/Software Architecture. Journal of Systems Architecture. Comuzzi, M., Theilmann, W., Zacco, G., Rathfelder, C., Kotsokalis, C., & Winkler, U. (2009). A Framework for Multi-level SLA Management. The eighth International Conference on Service Oriented Computing (ICSOC). Dan, A., Ludwig, H., & Pacifici, G. (2003). Web Services Differentiation with Service Level Agreement. Retrieved from http://www.ibm.com/ developerworks/library/ws-slafram/ De Nicola, R., Ferrari, G., Montanari, U., Pugliese, R., & Tuosto, E. (2003). A Formal Basis for Reasoning on Programmable QoS. Lecture Notes in Computer Science, 2772, 436–479. De Nicola, R., Ferrari, G., Montanari, U., Pugliese, R., & Tuosto, E. (2005). A Process Calculus for QoS-Aware Applications. Lecture Notes in Computer Science, 3454, 33–48. doi:10.1007/11417019_3 Dobson, G., & Sanchez-Macian, A. (2006). Towards unified QoS/SLA Ontologies. Proceedings of the IEEE Services Computing Workshops (pp. 169-174). Dong, J., Paul, R., & Zhang, L.-J. (2008). High Assurance Service-Oriented Architecture. IEEE Computer, 41(8), 22–23. Dong, J., Paul, R., & Zhang, L.-J. (2009). High Assurance Services Computing. Springer.
Gibbens, R., Mason, R., & Steinberg, R. (2000). Internet service classes under competition. IEEE Journal on Selected Areas in Communications, 18(12), 2490–2498. doi:10.1109/49.898732 Hasselmeyer, P., Qu, C., Schubert, L., Koller, B., & Wieder, P. (2006). Towards Autonomous Brokered SLA Negotiation, from “Exploiting the Knowledge Economy: Issues, Applications, Case Studies”. Amsterdam: IOS Press. Keller, A., & Ludwig, H. (2003). The WSLA Framework: Specifying and Monitoring Service Level Agreement for Web Service. Journal of Network and Systems Management, 11(1), 57–81. doi:10.1023/A:1022445108617 Kephart, J. O., & Chess, D. (2010). The Vision of Autonomic Computing. IEEE Computer, 36(1), 41–50. Kim, J. S., & Garlan, D. (2006). Analyzing Architectural Styles with Alloy. Proceedings of Workshop on the Role of Software Architecture for Testing and Analysis. Kritikos, K., & Plexousakis, D. (2008). QoS-Based Web Service Description and Discovery. Retrieved from http://ercim-news.ercim.eu/qos-based-webservice-description-and-discovery Liu, Y., Ngu, A. H., & Zeng, L. Z. (2004). QoS computation and policing in dynamic web service selection. Proceedings of the 13th International World Wide Web conference on Alternate track papers & posters (pp. 66-73).
49
SLA-Aware Enterprise Service Computing
Ludwig, H. (2009). WS-Agreement Concepts and Use: Agreement-Based, Service-Oriented Architecture, Service-Oriented Computing (pp. 199–228). The MIT Press.
Qian, L., Luo, Z., Du, Y., & Guo, L. (2009). Cloud Computing: An Overview. Lecture Notes in Computer Science, 5931, 626–631. doi:10.1007/9783-642-10665-1_63
Ludwig, H., Keller, A., Dan, A., & King, R. (2003). A Service Agreement Language for Dynamic Electronic Services. Electronic Commerce Research, 3, 43–59. doi:10.1023/A:1021525310424
Sahai, A., Durante, A., & Machiraju, V. (2001). Towards Automated SLA Management for Web Services, HP Tech report HPL-2001-310(R.1).
Mahbub, K., & Spanoudakis, G. (2007). Monitoring WS-Agreements: An Event Calculus–Based Approach. Test and Analysis of Web Services. McGough, A. S., Akram, A., Colling, D., Guo, L., Kotsokalis, C., & Krznaric, M. (2009). Enabling Scientists Through Workflow and Quality of Service, Grid Enabled Remote Instrumentation (pp. 345–359). Springer. Meng, S. (2007). QCCS: A Formal Model to Enforce QoS Requirements in Service Composition. Proceedings of the First Joint IEEE/IFIP Symposium on Theoretical Aspects of Software Engineering (pp. 389-400). Nurmela, T., & Kutvonen, L. (2007). Service Level Agreement Management in Federated Virtual Organizations. Lecture Notes in Computer Science, 4531, 62–75. doi:10.1007/978-3-540-72883-2_5 Overton, C. (2002). On the theory and practice of Internet SLAs, Computer Measurement Group. Journal of Computer Resource Measurement, 106, 32–45.
Sahai, A., Machiraju, V., & Sayal, M., A P. Van Moorsel A.P.A. & Casati F. (2002). Automated SLA Monitoring for Web Services. Lecture Notes in Computer Science, 2506, 28–41. doi:10.1007/3540-36110-3_6 Schmid, M., & Kroeger, R. (2008). Decentralised QoS-Management in Service Oriented Architecture. Lecture Notes in Computer Science, 5053, 44–57. doi:10.1007/978-3-540-68642-2_4 Shaw, M. (1995). Comparing architectural design styles. IEEE Software, 12(6). doi:10.1109/52.469758 Skene J., Lamanna, D.D., & Wolfgang Emmerich. (2004). Precise Service Level Agreements. In Proceedings of the 26th International Conference on Software Engineering (ICSE ‘04). IEEE Computer Society, Washington, DC, USA, 179-188. SLA@SOI, Empowering the service industry with SLA-aware infrastructures. Retrieved October 24, 2010, from http://sla-at-soi.eu/research/
Padgett, J., Djemame, K., & Dew, P. (2005). Grid-Based SLA Management. Lecture Notes in Computer Science, 3470, 1076–1085. doi:10.1007/11508380_110
Song, Y., Li, Y., Wang, H., Zhang, Y., Feng, B., Zang, H., & Sun, Y. (2008). A Service-Oriented Priority-Based Resource Scheduling Scheme for Virtualized Utility Computing. Lecture Notes in Computer Science, 5374, 220–231. doi:10.1007/978-3-540-89894-8_22
Pahl, C., Giesecke, S., & Hasselbring, W. (2009). An Ontology-Based Approach for Modeling Architectural Styles. Lecture Notes in Computer Science (Vol. 4758, pp. 60-75), 2007
Tang, L., & Dong, J. (2007). A Survey of Formal Methods for Software Architecture. Proceedings of the International Conference on Software Engineering Theory and Practice (pp. 221-227).
50
SLA-Aware Enterprise Service Computing
Tang, L., Dong, J., & Peng, T. (2008). A Generic Model of Enterprise Service-Oriented Architecture, 4th IEEE International Symposium on ServiceOriented System Engineering (SOSE) (pp 1-7).
Tsai, W. T., Sun, X., & Balasooriya, J. (2010). Service-Oriented Cloud Computing Architecture. The Seventh International Conference on Information Technology (pp.684-689).
Tang, L., Dong, J., Peng, T., & Tsai, W. T. (2010). A Classification of Enterprise Service-Oriented Architecture. 5th IEEE International Symposium on Service-Oriented System Engineering (SOSE) (pp. 74-81).
Wada, H., Champrasert, P., Suzuki, J., & Oba, K. (2008). Multiobjective Optimization of SLAAware Service Composition. The IEEE Congress on Services - Part I (pp. 368-375).
Tang, L., Dong, J., Peng, T., & Tsai, W. T. (2010). Modeling Enterprise Service-Oriented Architectural Styles. [SOCA]. Service Oriented Computing and Applications, 4(2), 81–107. doi:10.1007/ s11761-010-0059-2 Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J. (2010). Enterprise Cloud Service Architecture. The 3rd IEEE International Conference on Cloud Computing. (pp. 27-34). Tang, L., Zhao, Y., & Dong, J. (2009). Specifying Enterprise Web-Oriented Architecture, High Assurance Services Computing (pp. 241–260). Springer. Taylor, H., Yochem, A., Phillips, L., & Martinez, F. (2009). Event-Driven Architecture. AddisonWesley. (2004). The Open Group. SLA Management Handbook. Tosic, V., Pagurek, B., Patel, K., Esfandiari, B., & Ma, W. (2005). Management applications of Web Service Offerings Language (WSOL). Information Systems, 30(7), 564–586. doi:10.1016/j. is.2004.11.005 Tosic, V., Patel, K., & Pagurek, B. (2002). WSOL – Web Service Offerings Language. Lecture Notes in Computer Science, 2612, 57–67. doi:10.1007/3540-36189-8_5 Tsai, W. T., Shao, Q., Sun, X., & Elston, J. (2010). Real-Time Service-Oriented Cloud Computing. The 6th World Congress on Services (pp.473-478).
Wang, G., Wang, C., Chen, A., Wang, H., Fung, C., Uczekaj, S., et al. (2005). Service Level Management using QoS Monitoring, Diagnostics, and Adaptation for Network Enterprise Systems. Proceedings of the Ninth IEEE International EDOC Enterprise Computing Conference (pp. 239-250). Wang, H., Wang, G., Wang, C., Chen, A., & Santiago, R. (2007). Service Level Management in Global Enterprise Services: from QoS Monitoring and Diagnostics to Adaptation, a Case Study. Proceedings of the Eleventh International IEEE EDOC Conference Workshop (pp. 44-51). Yan, S. S., & An, H. (2009). Adaptive resource allocation for service-based systems. Proceedings of the First Asia-Pacific Symposium on Internetware. Yeom, G., Tsai, W.-T., Bai, X., & Min, D. (2009). Design of a Contract-Based Web Services QoS Management System. Proceedings of the 29th IEEE International Conference on Distributed Computing Systems Workshops (pp. 306-311). Zeng, L., Benatallah, B., Ngu, A. H. H., Dumas, M., Kalagnanam, J., & Chang, H. (2004). QoS-Aware Middleware for Web Services Composition. IEEE Transactions on Software Engineering, 30(5), 311–327. doi:10.1109/TSE.2004.11 Zhang, L.-J., Zhang, J., & Cai, H. (2007). Services Computing. Springer. Zhang, L.-J., & Zhou, Q. (2009). CCOA: Cloud Computing Open Architecture. IEEE International Conference on Web Services (pp. 607-616).
51
SLA-Aware Enterprise Service Computing
Zhang, Z., Dey, D., & Tan, Y. (2006). Price and QoS competition in communication services. European Journal of Operational Research (Vol.186 i2, pp. 681-693). Zhou, C., Chia, L.-T., & Lee, B.-S. (2004). DAMLQoS Ontology for Web Services, Proceedings of the IEEE International Conference on Web Services (ICWS’04), (pp.472).
52
Zhou, J., & Niemela, E. (2006). Toward Semantic QoS Aware Web Services: Issues, Related Studies and Experience. Proceedings of the 2006 IEEE/ WIC/ACM International Conference on Web Intelligence (pp. 553-557). Zhou, L., Pung, H. K., & Ngoh, L. H. (2006). Towards Semantic for QoS Specification. Proceedings of the 31th IEEE International Conference on Local Computer Networks (LCN).
53
Chapter 3
Dependability Modeling Paulo R. M. Maciel Federated University of Pernambuco, Brazil Kishor S. Trivedi Duke University, USA Rivalino Matias Jr. Federal University of Uberlândia, Brazil Dong Seong Kim Duke University, USA
ABSTRACT This chapter presents modeling method and evaluation techniques for computing dependability metrics of systems. The chapter begins providing a summary of seminal works. After presenting the background, the most prominent model types are presented, and the respective methods for computing exact values and bounds. This chapter focuses particularly on non-state space models although state space models such as Markov models and hierarchical models are also presented. Case studies are then presented in the end of the chapter.
INTRODUCTION Due to ubiquitous provision of services on the Internet, dependability has become an attribute of prime concern in hardware/software development, deployment, and operation. Providing fault tolerant services is inherently related to adoption of redundancy. Redundancy can be exploited either in time or in space. Replication of services is usually provided through distributed hosts across the world, so that whenever the service, the unDOI: 10.4018/978-1-60960-794-4.ch003
derlying host or network fails, another service is ready to take over (Gorbenko, A., Kharchenko, A., Romanovsky, A. 2007). Dependability of a system can be understood as the ability to deliver a specified functionality that can be justifiably trusted (Laprie, J. C. 1992). Functionality might be a set of roles or services (functions) observed by an outside agent (a human being, another system etc) that interacts with system at its interfaces; and the specified functionality of a system is what the system is intended for. This chapter aims to provide an overview of dependability modeling. The chapter starts briefly describing
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Dependability Modeling
some early and seminal work, their motivations and the succeeding advances. Afterwards, a set of fundamental concepts and definitions are introduced. Subsequently, the modeling techniques are classified, defined and introduced as well as a representative set of evaluation methods is presented. Later on, case studies are discussed, modeled and evaluated.
A Brief History This section provides a summary of early work related to dependability and briefly describes some seminal efforts as well as the respective relations with current prevalent methods. This effort is certainly incomplete, nonetheless, we hope it provides fundamental events, people and important research related to what is now called dependability modeling. Dependability is related to disciplines such as fault tolerance and reliability. The concept of dependable computing first appeared in 1820s when Charles Babbage undertook the enterprise to conceive and construct a mechanical calculating engine to eliminate the risk of human errors (Laprie, J. C. 1985) (Schaffer, S. 1994). In his book, “On the Economy of Machinery and Manufacture”, he mentions “ ‘The first objective of every person who attempts to make any article of consumption is, or ought be, to produce it in perfect form’ ” (Blischke, W. R. & Murthy, D. N. P. (Ed.) 2003). In the nineteenth century, reliability theory evolved from probability and statistics as a way to support computing maritime and life insurance rates. In early twentieth century methods had been applied to estimate survivorship of railroad equipment (Stott, H. G. 1905)(Stuart, H. R. 1905). The first IEEE (formerly AIEE and IRE) public document to mention reliability is “Answers to Questions Relative to High Tension Transmission” that summarizes the meeting of the Board of Directors of the American Institute of Electrical Engineers, held in September 26, 1902 (Answers 1904). In 1905, H. G. Stott and H. R. Stuart: discuss
54
“Time-Limit Relays and Duplication of Electrical Apparatus to Secure Reliability of Services at New York (Stott, H. G. 1905) and at Pittsburg (Stuart, H. R. 1905). In these works the concept of reliability was primarily qualitative. In 1907, A. A. Markov began the study of an important new type of chance process. In this process, the outcome of a given experiment can affect the outcome of the next experiment. This type of process is now called a Markov chain (Ushakov, I., 2007). In 1910s, A. K. Erlang studied telephone traffic planning problems for reliable service provisioning (Erlang, A. K. 1909). Later in the 1930s, extreme value theory was applied to model fatigue life of materials by W. Weibull and Gumbel (Kotz, S., Nadarajah, S. 2000). In 1931, Kolmogorov, in his famous paper “Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung” (Analytical methods in probability theory) laid the foundations for the modern theory of Markov processes (Basharin, G. P., Langville, A. N., Naumov, V. A. 2004.)(Kolmogoroff, A. 1931). In the 1940s quantitative analysis of reliability was applied to many operational and strategic problems in World War II (Blischke, W. R. & Murthy, D. N. P. (Ed.) 2003)(Cox, D. R. 1989.). The first generation of electronic computers were quite undependable, thence many techniques were investigated for improving their reliability. Among such techniques, many researchers investigated design strategies and evaluation methods. Many methods were then proposed for improving system dependability such as error control codes, replication of components, comparison monitoring and diagnostic routines. The most prominent researchers during that period were Shannon (Shannon,, C. E. 1948), Von Neumann (Neumann, J. V. 1956) and Moore (Moore, E. F. 1958), who proposed and developed theories for building reliable systems by using redundant and less reliable components. These were the predecessors of the statistical and probabilistic techniques that form the foundation of modern dependability theory (Avizienis, A. 1997).
Dependability Modeling
In the 1950s, reliability became a subject of great engineering interest as a result of the cold war efforts, failures of American and Soviet rockets, and failures of the first commercial jet aircraft, the British de Havilland comet (Barlow, R. E. & Proschan, F. 1967)(Barlow, R. E. 2002). Epstein and Sobel’s 1953 paper studying the exponential distribution was a landmark contribution (Epstein, B. & Sobel, M. 1953). In 1954, the Symposium on Reliability and Quality Control (it is now the IEEE Transactions on Reliability) was held for the first time in the United States, and in 1958 the First All-Union Conference on Reliability took place in Moscow (Gnedenko, B. V., Ushakov, I. A. 1995.) (Ushakov, I. 2007.). In 1957 S. J. Einhorn and F. B. Thiess adopted Markov chains for modeling system intermittence (Einhorn, S. J. & Thiess, F. B. 1957), and in 1960, P. M. Anselone employed Markov chains for evaluating availability of radar systems (Anselone, P. M. 1960.). In 1961 Birnbaum, Esary and Saunders published a milestone paper introducing coherent structures (Birnbaum, Z. W., J. D. Esary, S. C. Saunders. 1961). The reliability/availability models might be classified as non-state space model (or non-state space model) and state-space models. Reliability Block Diagrams (RBD) and Fault Trees (FT) are non-state space models and the most widely adopted models in reliability/availability evaluation. RBD is probably the oldest non-state space technique for reliability analysis. Fault Tree Analysis (FTA) was originally developed in 1962 at Bell Laboratories by H. A. Watson to evaluate the Minuteman I Intercontinental Ballistic Missile Launch Control System. Afterwards, in 1962, Boeing and AVCO expanded use of FTA to the entire Minuteman II (Ericson, C. A. II. 1999). In 1965, W. H. Pierce unified Shannon, Von Neumann and Moore theories of masking and redundancy as the concept of failure tolerance (Pierce, W. H. 1965). In 1967, A. Avizienis integrated masking methods with practical techniques for error detection, fault diagnosis, and recovery into the
concept of fault-tolerant systems (Avizienis, A., Laprie, J.-C., Randell, B. 2001.). The formation of the IEEE Computer Society Technical Committee on Fault-Tolerant Computing (now Dependable Computing and Fault Tolerance TC) in 1970 and of IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance in 1980 were important means for defining a consistent set of concepts and terminology. In early 1980s Laprie coined the term dependability for encompassing concepts such reliability, availability, safety, confidentiality, maintainability, security and integrity etc (Laprie, J. C. 1992) (Laprie, J. C. 1985). In late 1970s some works were proposed for mapping Petri nets to Markov chains)(Misra, K. B. (Ed.). 2008)(Natkin, S. 1980)(Symons, F. J. W. 1978). These models have been widely adopted as high-level Markov chain automatic generation models as well as for discrete event simulation. Natkin was the first to apply what is now generally called Stochastic Petri nets to dependability evaluation of systems (Marsan, M. A., Balbo, G., Conte, G., Donatelli S., Franceschinis, G. 1995).
Basic Concepts This section introduces and defines several fundamental concepts, taxonomy and quantitative measures for dependability. As mentioned in the beginning of the chapter, dependability of a system is its capability of delivering a set of trustable services that are observed by outside agents. A service is trustworthy when it implements the system´s specified functionality. A system failure occurs when the system fails to provide its specified functionality. A fault can be defined as the failure of a component of the system, a subsystem of the system, or another system which interacts with the considered system. Hence, every fault is a failure from some point of view. A fault can cause other faults, a system failure, or neither. A system with faults that delivers its specified
55
Dependability Modeling
Figure 1. States of XS(t)
fT(t) ≥ 0 and ∞
∫ f (t ) × dt = 1 T
0
The probability that the system S does not fail up to time t (reliability – see Figure 4) is P {T ≥ t } = R ( t ) = 1 − FT (t ) , functionality is said to be fault tolerant, that is, the system does not fail even when there are faulty components. Distinguishing faults from failures is fundamental for understanding the fault tolerance concept. The observable outcome of a fault at the system interface is called symptom and the most extreme symptom of a fault is a failure. Therefore, an analyst evaluating the inner part of a system might detect faulty components or sub-systems. From that point of view, a faulty component (or sub-system) had failed, since level of details analyzed is lower. Consider an indicator random variable X(t) that represents the system state at time t. X(t) = 1 for representing the operational state and X(t) = 0 for the faulty state (see Figure 1). More formally, 0, if S has failed X S ( t ) = 1, if S is operational
fT ( t ) =
56
dFT dt
t →∞
,
t →∞
(3)
P {t ≤ T ≤ t + ∆t } = FT (t + ∆t ) − FT (t ) = R (t ) − R (t + ∆t ) =
t + ∆t
∫
f T (t ) dt.
t
The probability of the system S failing during the interval [t, t + ∆t] if it has survived to the time t (conditional probability of failure) is
{
}
P t ≤ T (0) ≤t + ∆t | T > t =
(1)
(2)
The probability of the system S fail within the interval [t, t + ∆t] may be calculated by:
R (t ) − R (t + ∆t )
Now, consider a random variable T as the time to reach the state X(t) = 0, given that the system started in state X(t) = 1 at time t = 0. Therefore, the random variable T represents the time to failure of the system S, FT(t) its cumulative distribution function (see Figure 2), and fT(t) the respective density function (see Figure 3), where: FT (0) = 0and lim FT (t ) = 1,
R (0) = 1and lim R (t ) = 0.
R(t )
.
P{t ≤ T ≤ t + ∆t | T > t}/∆t is conditional probability of failure per time unit. When ∆t → 0, then lim
R ( t ) − R ( t + ∆t )
R ( t ) × ∆t − R (t + ∆t ) − R (t ) = lim ∆t → 0 ∆t dR(t ) 1 1 =− × × dt R (t ) R (t ) ∆t → 0
=
dFT (t ) dt
×
1
R (t )
=
fT
R (t )
= λ (t ) ,
(4)
Dependability Modeling
Figure 2. FT(t): cumulative distribution function
Figure 3. fT)(t): density function
Figure 4. R(t) Reliability Function
57
Dependability Modeling
where λ(t) is named the hazard function. Hazard rates may be characterized as decreasing failure rate (DFR), constant failure rate (CFR) or increasing failure rate (IFR) according to λ(t) (Gorbenko, A., Kharchenko, A., Romanovsky, A. 2007)(Rausand M. & Hoyland, A. 2004)(Smith, D. J. 2009). Since dR(t ) 1 × , dt R (t )
λ (t ) =−
λ (t ) dt =−
dR(t ) R (t )
hazard rate functions form the so called bathtub curve (Figure 5.d) (Ebeling, C. E. 2005). The mean time to fail (MTTF) is defined by: MTTF = E T =
∞
∫t × f
T
(t )dt .
(7)
0
Since
(5)
,
f T (t ) =
dFT dt
=−
dR(t ) , dt
thus,
thus,
∞
t
t
0
0
dR(t )
∫ λ (t ) dt = −∫ R (t )
dR(t ) MTTF = E T = −∫ × t dt. dt 0
=
t
−∫ λ (t ) dt = ln R(t ) = 0
t
− λ (t ) dt
R (t ) = e
(6)
∫ 0
− H (t )
=e
,
where H(t) is the cumulative hazard rate function (cumulative failure rate function). Consider a hazard rate of an entire population of products over time (λ(t)), where some products will fail in early life (“infant mortality”), others will last until wear-out (“end of life”), and others will fail during its useful life period (“normal life”). Infant mortality failures are usually caused by material, design and manufacturing problems, whereas wear-out failures are related to fatigue or exhaustion. Normal life failures are considered to be random. Infant mortality is commonly represented by decreasing hazard rate (see Figure 5.a), wear-out failures are typically represented by increasing hazard rate (Figure 5.c), and normal life failures are usually depicted by constant hazard rate (see Figure 5.b). The overlapping of these three separate
58
dR(t ) × dt , and applying dt integration by parts (∫ u dv = uv − ∫ v du) , then Let u = t, dv =
du = dt, v = R(t), hence: ∞
MTTF = −∫ 0
hence:
dR (t ) dt
× t dt
∞ R t dt − × =− t × R (t ) |∞ ( ) ∫ 0 0 ∞ ∞ = −0 − ∫ R (t ) × dt = ∫ R (t ) × dt, 0 0
∞
MTTF =
∫ R (t ) × dt,
(8)
0
which is often easier to compute than (7). Another central tendency reliability measure is the median time to failure (MedTTF), defined by: MedTTF = t, FT = R (t ) = 0.5 .
(9)
Dependability Modeling
Figure 5. Hazard rate: (a) Decreasing, (b) Constant, (c) Increasing, (d) Bathtub curve
The median time to failure divides the time to fail distribution into two halves, where 50% of failures occur before MedTTF and the other 50% after. Consider a continuous time random variable YS(t) that represents the system state. YS(t) = 0 when S is failed, and YS(t) = 1 when S has been repaired (see Figure 6). More formally, 0, if S is failed YS ( t ) = 1, if S has ben repaired
(10)
Now, consider the random variable D that represents the time to reach the state YS(t) = 1, given that the system started in state YS(t) = 0 at time t = 0. Therefore, the random variable D represents the system time to repair, FD(t) its cumulative distribution function, and fD(t) the respective density function, where:
FD (0) = 0 and lim FD (t ) = 1, fD (t ) =
dFD(1) dt
t →∞
,
fD(t) ≥ 0, and ∞
∫ f (t ) × dt = 1 D
0
The probability that the system S will be repaired by t considering a specified resource is defined as maintainability. P { D ≤ t } = FD (t ) =
t
∫ f (t ) × dt , D
0
M (t ) = 1 − FD (t ) ,
(11)
The mean time to repair (MTTR) is defined by: MTTR = E D =
∞
∫t × f
D
(t )dt
0
An alternative often easier to compute MTTR is
∞
MTTR =
∫ M (t ) × dt.
(12)
0
59
Dependability Modeling
Figure 6. States of YS(t)
Figure 7. States of a repairable system
Consider a repairable system S that is either operational (Up) or faulty (Down). Figure 7 shows the system state transition model, Z(t) = 1 when Up and Z(t) = 0 when down. Whenever the system fails, a set of activities are conducted in order to allow the restoring process. These activities might encompass administrative time, transportation time, logistic times etc. When the maintenance team arrives to the system site, the actual repairing process may start. Further, this time may also be divided into diagnosis time and actual repair time, checking time etc. However, for sake of simplicity, we group these times such that the downtime equals the time to restore (TR), which is composed by non-repair time (NRT) (that groups transportation time, order times, deliver times, etc.) and time to repair (TTR) (see Figure 8). Thus, Downtime = TR = NRT + TTR.
(13)
The simplest definition of Availability is expressed as the ratio of the expected system uptime to the expected system up and downtimes: A=
E[Uptime] . E[Uptime] + E[ Downtime]
(14)
Consider that the system started operating at time t = t’ and fails at t = t’’, thus ∆t = t’’ – t’ = Uptime (see Figure 7).Therefore, the system availability may also be expressed by: 60
A=
MTTF , MTTF + MTR
(15)
where MTR is the mean time to restore, defined by MTR = MNRT + MTTR (MNRT – mean nonrepair time, MTTR – mean time to repair), so: A=
MTTF . MTTF + MNRT + MTTR
If MNRT ≅ 0, A=
MTTF . MTTF + MTTR
As MTBF = MTTF + MTR = MTTF + MNRT + MTTR, and if MNRT ≅ 0, then MTBF = MTTF + MTTR. Since MTTF ≫ MTTR, thus MTBF ≅ MTTF, therefore: A=
MTBF . MTBF + MTTR
(16)
The instantaneous availability is the probability that the system is operational at t, that is,
{
}
{
}
A (t ) = P Z (t ) = 1 = E Z (t ) , t ≥ 0.
Dependability Modeling
Figure 8. Downtime and uptime
If repairing is not possible, the instantaneous availability, A(t), is equivalent to reliability, R(t). If the system approaches stationary states as the time increases, it is possible to quantify the steady state availability, such that it is possible to estimate the long-term fraction of time the system is available. A = lim A (t ), t ≥ 0. t →∞
(17)
Commonly Used Distributions The time to failure of a component is a nonnegative continuous random variable. This section briefly summarizes the continuous distributions that have been widely adopted in dependability evaluation. The most adopted distributions are: Exponential, Expolynomial distributions such as Erlang and hyper-exponential; and Weibull, Normal and Lognormal distributions (Bolch, G., Greiner, S., Meer, H., Trivedi, K. S. 2006) (Dhillon, B. S. 2007)(Trivedi, K. S. 2002). Arandom variable T representing a component´s life time has exponential distribution if its probability density function is given by
f (t ) = λe−λt , t ≥ 0, where λ > 0 is a parameter of this distribution. The respective reliability function, cumulative distribution function, hazard function (failure rate), mean (mean time to failure) and variance are, respectively: R (t ) = e−λt , t ≥ 0,
F (t ) = 1 − e−λt , t ≥ 0,
h (t ) = λ,
1 E T = MTTF = , λ 1 2 Var T = σ = 2 . λ Table 1 summarizes the density, reliability, cumulative distribution and hazard functions, and mean and variance of the above mentioned distributions.
61
Dependability Modeling
Table 1. Distribution Summary Distri� butions Exponential
f(t) − λt
λe
R(t) − λt
,t ≥ 0
e
k −1
Erlang
k λ ( k λt )
(k − 1) !
t≥0
k
i
−λi t
i
, t≥0
i =1
, t≥0
1− e
k −1
∑
h(t)
E[T]
Var[T]
Parameter
λ
1 λ
1 λ2
λ > 0, λ : rate.
λ
1 λ
1 kλ 2
k ∈ + , λ > 0 Each phase’s rate: kλ. k : number of phases.
, t≥0
1−
e−k λt × i =0
∑q λ e Hyper-exponential
e−k λt ,
F(t) − λt
i
(k λt ) , t≥0 i!
k −λ t 1 − ∑qi 1 − e i i =1
(
k −1
e−k λt ∑ i =0
k
i
(k λt ) , t≥0 i!
) ,t ≥ 0 ∑q (1 − e ),t ≥ 0
i
i =1
k
1
− λi t
∑
k i =1
qi
∑λ
qi λi
i =1
1 = λ
i
k
2∑ i =1
qi λi
2
−
1 λ2
k ∈ + (number of phases); λ1 … λ k (rates); q1 … qk (probabilities); λ i > 0, qi > 0; k
∑q
i
=1
i =1
α −1
α
Weibull
αλ ( λt ) −( λt )α
e
2π × σ 2
– (t −µ )
e −∞ ≤ t ≤ ∞
Standard Normal
Normal
2 σ2
t − µ . 1 − Φ σ Left truncated: t − µ Φ σ , t Φ σ t≥0
t − µ Φ σ
2π
1 2π× σ 2
– (ln t −µ )
e 2σ t≥0
2
,
α −1
αλ ( λt ) t≥0
,
λ>0 (scale), α∈ (shape).
1 1 Γ(1 + ) λ λ
µ (mean), σ (variance).
1 × σ t − µ φ σ t − µ 1 − Φ σ
µ=0 (mean), σ =1 (variance).
Φ(.)
– t2
1
φ (t ) =
Lognormal
1 − e−( λt ) , t ≥ 0
,t ≥ 0
1
62
α
e−(λt ) , t ≥ 0
e 2
µ − ln t Φ σ
µ − ln t 1 − Φ σ
µ − ln t φ σ σt µ − ln t Φ σ σ
µ + σ2
e
2
e 2µ
(e
2 σ2
−e σ
2
)
µ σ
Dependability Modeling
Specific Failure Terminology for Software A software failure may be defined as the occurrence of an out of specification result produced by the software system for the respective specified input value. Since this definition is consistent with the of system failure definition (previously defined), the reader might ask: why should we pay special attention to software dependability issues? The reader should bear in mind, however, the pervasive nature of software systems and its many scientific communities involved in subjects related to make software systems more dependable. These communities, nonetheless, have specific backgrounds, many of which not rooted on system reliability. These communities have developed jargons that do not necessarily match the prevailing system dependability terminology. Software research communities have long pursued dependable software systems. Correction of codification problems and software testing begin with the very origin of software development itself. Since then, the term software bug has been broadly applied to refer to mistakes, failures and faults in a software system; whereas debugging is referred to the methodical process of finding bugs. The formal methods community has produced massive contributions on models, methods and strategies for checking software validity and correctness. These communities along with dependability researchers have proposed and applied redundancy mechanisms and failure avoidance techniques as means for achieving highly dependable software systems. In 1985 Jim Gray proposed a classification of software failures (Gray, G. 1985). He classified the failures (bugs) as Bohrsbugs and Heisenbugs. Heisenbugs are transient or intermittent failures (Shetti, N. M. 2003), since if the program state is reinitialized and the failed operation retried, the operation will usually not fail the second time. The term Heisenbug derived from Heisenberg’s Uncertainty Principle which states that it is impos-
sible to simultaneously predict the location and time of a particle. On the other hand, if a Bohrbug is present in the software, there would always be a failure on retrying the operation which caused the failure. The word Bohrbug comes from the deterministic representation of atom proposed by Niels Bohr in 1913. Bohrbugs is “easily” detectable by standard debugging techniques. This terminology is modified and extended in (Trivedi, K. S. & Grottke, M., 2007).
Coherent Systems Consider a system S composed by a set of components, C = {ci | 1 ≤ i ≤ n}, where the state of the system S and its components could be either operational or failed. Let the discrete random variable xi indicate the state of component i, thus: 0 if the component i has failed xi = (18) 1 if the component i is operational The vector x = (x1, x2, …, xi, …, xn) represents the state of each component of the system (Wherever need and the context is clear, x may also be referred as set.), and it is named state vector. The system state may be represented by a discrete random variable ϕ(x) = ϕ(x1, x2, …, xi, …, xn), such that 0 if the system has failed φ (x ) = 1 if the systemis operational
(19)
ϕ(x) is called the structure function of the system. If one is interested in representing the system state at a specific time t, the components’ state variables should be interpreted as a random variables at time t. Hence, ϕ(x(t)), where x(t) = (x1(t), x2(t), …, xi(t), …, xn(t)). For any component ci, φ (x ) = xi φ (1i , x ) + (1 − xi )φ (0i , x ),
(20)
63
Dependability Modeling
where ϕ(1i, x) = ϕ(x1, x2, …, 1i, …, xn) and ϕ(0i, x) = ϕ(x1, x2, …, 0i, …, xn). The Equation 19 expresses the system structure function in terms of two conditions. The first term (xiϕ(1i, x)) represents a state which where the component ci is operational and the state of the other components are random variables (ϕ(x1, x2, …, 0i, …, xn)). The second term ((1 – xi)ϕ(0i, x)), on the other hand, states the condition where the component ci has failed and the state of the other components are random variables (ϕ(x1, x2, …, 0i, …, xn)). Equation 19 is known as factoring of the structure function and very useful for studying complex system structures, since through its repeated application, one can eventually reach a subsystem whose structure function is simple to deal with (1). A component of a system is irrelevant to the dependability of the system if the state of the system is not affected by the state of the component. In mathematical terms, a component ci is said to be irrelevant to the structure function if ϕ(1i, x) = ϕ(0i, x). A system with structure function ϕ(x) is said to be coherent if and only if ϕ(x) is nondecreasing in each xi and every component ci is relevant (Kuo, W & Zuo, M. J. 2003). A function ϕ(x) is non-decreasing if for every two state vectors x and y, such that x < y, then ϕ(x) ≤ ϕ(y). Another aspect of coherence that should also be highlighted is that replacing a failed component in working system does not make the system fail. But, it does not also mean that a failed system will work if a failed component is substituted by an operational component. Example 1: Consider a coherent system (C, ϕ) composed of three blocks, C = {a, b, c} (see Figure 9). Using Equation 20 and first factoring on component a, we have:
Figure 9. Structure function
φ ( xa , xb , xc ) = xa φ (1a , xb , xc )
+ (1 − xa )φ (0a , xb , xc ) = xa φ (1a , xb , xc )
since ϕ(0a, xb, xc) = 0. Now factoring ϕ(1a, xb, xc) using Equation 20 on component b, φ (1a , xb , xc ) = xb φ (1a , 1b , xc )
+ (1 − xb )φ (1a , 0b , xc ) .
As ϕ(1a, 1b, xc) = 1, thus: φ (1a , xb , xc ) = xb + (1 − xb )φ (1a , 0b , xc ) . Therefore: φ ( xa , xb , xc )= xa φ (1a , xb , xc )
= xa × [ xb + (1 − xb )φ (1a , 0b , xc )].
Fact ϕ(1a, 0b, xc) on component c to get: φ (1a , 0b , xc ) = xc φ (1a , 0b , 1c )
+ (1 − xc )φ (1a , 0b , 0c ) .
Since ϕ(1a, 0b, 1c) = 1 and ϕ(1a, 0b, 0c) = 0, thus: φ (1a , 0b , xc ) = xc .
64
.
Dependability Modeling
So
i =1
φ ( xa , xb , xc )= xa × xb + (1 − xb )φ (1a , 0b , xc ) = xa × xb + (1 − xb )xc = φ ( xa , xb , xc )= xa xb + xa xc (1 − xb ) = φ ( xa , xb , xc ) = xa 1 − (1 − xb )(1 − xc ) . In some cases, simplifying the structure function may not be an easy task. A logic function of coherent system may be adopted to simplify system’s functions through Boolean algebra. As described earlier, assume a system S composed by a set of components C = {ci | 1 ≤ i ≤ n}. The state of the system S and its components could be either operational or faulty. Let si denotes the operational state of component ci, and sÌ–i its complement, that is, indicate that ci has failed. The Boolean state vector, bs = (s1, s2, …, si, …, sn), represents the Boolean state of each component of the system. The system state could be either operational or failed. The operational system state is represented by φ(bs), whereas ϕ (bs)
denotes a faulty system. bs = (s1, s2, …, Ti, …, sn) and bs = (s1, s2, …, Fi, …, sn) represent a system where the component ci is either working (Ti) or failed (Fi). Using the notation described, si is equivalent to xi, sÌ–i represents 1 – xi, φ(bs) is the counterpart of ϕ(x) = 1, ϕ (bs) depicts ϕ(x) = 0, Ë— represents
×, and Ë– is the respective counterpart of +. Let consider a series system composed a set of components C = {ci | 1 ≤ i ≤ n}, the logic function of a series system is i =1
ϕ ( s1, s2 , …, si , …, sn ) = s1s2 … si … sn = ∧si . n
ϕ ( s1, s2 , …, si , …, sn ) = s1 s2 … s1 … sn = ∧si . n
(22)
For any component ci, the logic function may be represented by ϕ (bs) = si ϕ ( s1, s2 , …, Ti , …, sn )
+ si ϕ ( s1, s2 , …, Fi , …, sn ) .
(23)
Adopting the system presented in the Example 1. One may observe that the system is functioning if components a and b are working or if a and c are. This is the respective system logic function. More formally: ϕ ( sa , sb , sc ) = ( sa ∧ sb ) ∨ ( sa ∧ sc ) . Therefore, ϕ ( sa , sb , sc ) = ( sa ∧ sb ) ∨ ( sa ∧ sc )
(
)
= sa ∧ ( sb ∨ sc ) = sa ∧ sb ∨ sc .
Thus, ϕ ( sa , sb , sc ) = sa ∧ sb ∧ sc . The structure function may be obtained from the logic function using the respective counter parts. Hence, since si is represented by xi, sÌ–i is 1 – xi, and Ë— corresponds to ×, then: φ (x ) = xa × 1 − (1 − xb ) × (1 − xc ) ,
(21)
For a parallel system with n components, the logic function
which is the same result obtained in the Example 1.
65
Dependability Modeling
MODELING TECHNIQUES The aim of this section is to introduce a set of important models types for dependability evaluation as well as offering the reader a summary view of key methods. The section begins with a classification of model; the main non-state space and state-space models are then described along with the respective analysis methods.
Classification of Modeling Techniques This section presents a classification of dependability models. These models may be broadly classified into non-state space (or non-state space) and state-space models. State-space models may also be referred as non-combinatorial, and combinatorial can be identified as non-state space models. Non-state space models capture conditions that make a system fail (or to be working) in terms of structural relationships between the system components. These relations observe the set of components (and sub-systems) of the system that should be either properly working or faulty for the system as a whole to be working properly. State-space models represent the system behavior (failures and repair activities) by its states and event occurrence expressed as labeled state transitions. Labels can be probabilities, rates or distribution functions. These models allow representing more complex relations between components of the system, such as dependencies involving sub-systems and resource constraints. Some state-space models may also be evaluated by discrete event simulation in case of intractable large state spaces or when combination of non-exponential distributions prohibits an analytic solution. In some special cases state space analytic models can be solved to derive closed-form answer, but generally a numerical solution of the underlying equations is necessary using a software packages. The most prominent non-state space model types
66
are Reliability Block Diagrams, Fault Trees and Reliability Graphs; Markov Chains, Stochastic Petri nets, and Stochastic Process algebras are most widely used state-space models. Next we introduce these model types and their respective evaluation methods.
Non-State Space Models Reliability Block Diagrams This section describes the two most relevant non-state space model types for dependability evaluation (Sahner, R. A., Trivedi, K. S., Puliafito, A. 1996), namely, Reliability Block Diagrams (RBD) and Fault Trees (FT), and their respective evaluation methods. The first two sections define each model type, their syntax, semantics, modeling power and constraints. Each model type is introduced and then explained by examples so as to help the reader not only to master the related math but acquire practical modeling skill. The subsequent sections concern the analysis methods applied to the previously presented models. First a basic set of standard methods is presented. The methods are particularly applied to models of systems in which components are arranged as series, parallel or as a combination of parallel and series combinations of components. Afterwards, a set of methods that applies to non-series-parallel configuration are presented. These methods are more general than the basic methods, since they could be applied to evaluate sophisticated component compositions, but the complexity analysis is of concern. The set that should be described are series-parallel reductions, minimal cut and path computation methods, decomposition, sum of disjoint products (SDP) and delta-star and star-delta transformation methods. Besides, dependability bounds computation are presented. Finally, measures of component importance are then presented. RBDs have a source and a target vertex, a set of blocks (usually rectangles), where each block represents a component; and arcs
Dependability Modeling
Figure 10. Reliability block diagram
Figure 11. RBD of series structure
connecting the components and the vertices (see Figure 10). The source node is usually placed at the left hand side of the diagram whereas the target vertex is positioned at the right. Graphically, when a component ci is working, the block bi can be substituted by an arc, otherwise the rectangle is removed. The system is properly working when there is at least one path from the source node to the target node. RBDs have been adopted to evaluate series-parallel and more generic structures, such as bridges, stars and delta arrangements. The simplest and most common RBDs support series-parallel structures only. Consider a series structure composed of n independent components presented in Figure 11, where pi = P{xi = 1} are the functioning probabilities of blocks bi. These probabilities could be reliabilities or availabilities, for instance. The probability for the system to be operational is
{
}
{
}
P φ (x ) = 1 = P φ ( x1, x2 , …, xi , …, xn ) = 1 n
n
i =1
i =1
= ∏P{ xi = 1} = ∏pi = 1.
Therefore, the system reliability is
{
}
n
RS (t ) = P φ (x, t ) = 1 = ∏P{ xi (t ) = 1} n
i =1
= ∏Ri (t ), i =1
(25)
where Ri(t) is the reliability of block bi. Likewise, the system instantaneous availability is
{
}
n
AS (t ) = P φ (x, t ) = 1 = ∏P{ xi (t ) = 1} n
= ∏Ai (t ), i =1
i =1
(26)
where Ai(t) is the instantaneous availability of block bi. The steady state availability is
(24) 67
Dependability Modeling
{
}
n
n
i =1
i =1
AS = P φ (x ) = 1 = ∏P{ xi = 1} = ∏Ai ,
Figure 12. RBD of parallel structure
(27)
where Ai is steady state availability of block bi. Now consider a parallel structure composed of n identical and independent components presented in Figure 12, where pi = P{xi = 1} are the functioning probabilities of blocks bi. The probability for the system to be operational is
{
}
{
}
P φ (x ) = 1 = P φ ( x1, x2 , …, xi , …, xn ) = 1 n
n
= 1 − ∏P{ xi = 0} = 1 − ∏(1 − P{ xi = 1}) i =1
{
}
n
i =1
= P φ (x ) = 1 = 1 − ∏ (1 − pi ) . i =1
(28)
where Ri(t) and Qi(t) are the reliability and the unreliability of block bi, respectively. Similarly, the system instantaneous availability is
(29)
AP (t ) = P φ (x, t ) = 1
Thus
{
}
P φ (x ) = 1 = 1 − (1 − pi )n .
i =1
i =1
n
RP (t ) = 1 − ∏Qi (t ) n
(30)
i =1
= 1 − ∏1 − Ri (t ), i =1
such that, Qi (t ) = P{ xi (t ) = 0} = 1 − P{ xi (t ) = 1} = 1 − Ri (t ),
68
i =1
{
}
n
n
i =1
i =1
(32)
= 1 − ∏UAi (t ) = 1 − ∏1 − Ai (t ),
RP (t ) = 1 − ∏P{ xi (t ) = 0}
i =1
n
− Ai (t ), AP (t ) = P φ (x, t ) = 1
n
= 1 − ∏(1 − P{ xi (t ) = 1}),
}
= 1 − ∏P{ xi (t ) = 0} = 1 − ∏1
The system reliability is then:
n
{
n
such that, UAi(t) = P{xi(t) = 0} = 1 – P{xi(t) = 1} =1 – Ai(t), where Ai(t) and UAi(t) are the instantaneous availability and unavailability of block bi, respectively. The steady state availability is
{
}
n
AP = P φ (x ) = 1 = 1 − ∏UAi n
= 1 − ∏1 − Ai , (31)
i =1
i =1
(33)
Dependability Modeling
Figure 13. RBD of system S1
where Ai and UAi are the steady availability and unavailability of block bi, respectively. Due to the importance of the parallel structure, the following simplifying notation is adopted:
{
}
n
P φ (x ) = 1 =1 − ∏(1 − P{ xi = 1}) n
i =1 n
= P{ xi = 1} = pi = 1 − (1 − pi ) . i =1
n
i =1
Example 2: Consider a system S1 represented by the RBD in Figure 13. This model is composed of four blocks (b1, b2, b3, b4) where each block has r1, r2, r3 and r4 as their respective reliabilities. The system reliability of the system S1 is RS = r1 × 1 − (1 − r2 × r4 ) × (1 − r3 ) . 1 In a coherent system (C, ϕ), a state vector x is called a path vector if ϕ(x) = 1 and C1(x) is the respective path set. Hence, a path set is set of components that if every one of its components are operational, the system is also operational. C1(x) is a minimal path set if ϕ(x) = 0, for any y < x, that is C1(x) is comprising a minimal number of component that should be operational for the system to be operational. In a series system of n components (see Figure 10), there is only one minimal path set and it is composed of every
component of the system. On the other hand, if we consider a parallel system with n components as depicted in Figure 13, then the system has n path minimal sets, where each set is composed of only one component. A state vector x is called a cut vector if ϕ(x) = 0 and C0(x) is the respective cut set. C0(x) is a minimal cut set if ϕ(x) = 1 for any y > x. In the series system, there are n cut sets where each cut set is composed of one component only. The minimal cut set of the parallel system is composed by all the components of the system. System S1 of Figure 13 has two minimal path sets, C11 (x ) = {b1 , b2 , b4 } and C12 (x ) = {b1, b3 }; and
three minimal cut sets, C01 (x ) = {b1 }, C01 (x ) = {b2 , b3 } and C01 (x ) = {b3 , b4 }.
Structures like k out of n, bridges, delta and star arrangements have been customarily represented by RBD, nevertheless such structures can only be represented if the components are replicated in the model. Consider a system composed of 3 identical and independent components (b1, b2, b3) that is operational if at least 2 out of its 3 components are working properly. The success probability of each of those blocks is p. This system can be considered as a single block (see Figure 14) where its success probability (reliability, availability or maintainability) is depicted by
69
Dependability Modeling
Figure 14. A 2 out of 3 system
n k n−k = i p (1 − p) ∑ i=k = 3 p2 − 2 p3. n
3 2 i p (1 − p ) (34) ∑ i =2 3
The bridge structure may also be considered as a single block with its respective failure probability or can be transformed into the equivalent series-parallel RBD and then evaluated. Consider a bridge system depicted in Figure 15, composed by blocks b1, b2, b3, b4, and b5. The series-parallel RBD equivalent model presented in Figure 15, and its structure function is φ (x ) = 1 − (1 − x1 x2 ) (1 − x4 x5 )(1 − x1 x3 x5 )(1 − x2 x3 x4 ) .
(
)
(35)
The reader should observe that the seriesparallel equivalent model replicates every component of the bridge (every component appears twice on the model.
Fault Tree This section presents the Fault Tree model (FT). Fault tree was first proposed in Bell Telephone Laboratories in 1962 by H. A. Watson to evaluate the Minuteman I Missile. Differently from RBDs, FT is failure oriented model, and as in RBDs, it was initially proposed for calculating reliability. Nevertheless, FT has been also extensively applied to evaluate other dependability metrics.
70
In a FT, the system failure is represented by the TOP event (undesirable state). The TOP event is caused by lower level events (faults, component’s failures etc) that alone or combined may lead to the TOP event. The combination of events is described by logic gates. The events that are not represented by combination of other events are named basic events. The term event is somewhat misleading, since it actually represents a state reached by event occurrences. IN FTs, the system state may be described by a Boolean function that is evaluated as true whenever at least one minimal cut is evaluated as true. The system state may also be represented by a structure function, which, opposite to RBDs, represents the system failure. If the system has more than one undesirable state, a Boolean function (or a structure function) should be defined for representing each failure mode, that is, one function should be constructed for describing the combination of events that cause each undesirable state. The most common FT elements are the TOP event, AND and OR gates, and basic events. Many extensions have been proposed which adopt other gates such as XOR, transfer and priority gates. This chapter, however, does not cover these extensions. As in RBDs, the system state may also be described by the FT structure function. The FT structure function is evaluated to 1 whenever at least one structure function of a minimal cut is evaluated as 1. Consider a system S composed of a set of components, C = {ci | 1 ≤ i ≤ n}. Let the discrete random variable yi(t) indicate the state of component i, thus: 1 if the component i is faulty at time t yi (t ) = 0 if the component i is operational at time t
The vector y(t) = (y1(t), y2(t), …, yi(t), …, yn(t)) represents the state of each component of the system, and it is named state vector. The system state may be represented by a discrete random
Dependability Modeling
Figure 15. A bridge system
Table 2. Basic Symbols and their description Symbol
Description TOP event represents the system failure. Basic event is an event that may cause a system failure. Basic repeated event.
AND gate generates an event (A) if All event Bi have occurred.
OR gate generates an event (A) if at least one event Bi have occurred.
KOFN gate generates an event (A) if at least K events Bi out of N have occurred.
The comment rectangle.
variable ψ(x(t)) = ϕ(y1(t), y2(t), …, yi(t), …, yn(t)), such that 0 if the systemis operational at time t ψ ( y(t )) = 1 if the systemis faulty at time t
ψ(y(t)) is named the Fault Tree structure function of the system. As ψ(y(t)) is Bernoulli random variable, its expected value is equal to the probability of occurrence of the respective TOP event. In other words, E[ψ(y(t))] = P{ψ(y(t)) = 1} is the system failure probability, which is denoted by Q(t). Example 3: Consider a system in which software applications read, write and modify the content of the storage device D1 (source). The system periodically replicates the production data (generated by the software application) of one storage device (D1) in two storage replicas (targets) so as to allow recovering data in the event of data loss or data corruption. The system is composed of three storage devices (D1, D2, D3), one server and hub that connects the disks D2 and D3 to the server (see Figure 16.a). The system is considered to have failed if the hardware infrastructure does not allow the software applications to read, write or modify data on D1, and if no data replica is available, that is both disks D2 and D3 have failed. Hence, if D1 or the Server or the Hub, or either replica storages (D2, D3) are faulty, the system fails. The respective FT is presented in Figure 16.b. For sake of conciseness, the Boolean variables representing 71
Dependability Modeling
Figure 16. Data replication
the events (faults) of each device are named with the respective devices´ names, hence bs = {Server, Hub, D1, D2, D3}. The system is considered to have failed if the hardware infrastructure does not allow the software applications to read, write or modify data on D1, and if no data replica is available, that is both disks D2 and D3 have failed. Hence, if D1 or the Server or the Hub, or either replica storages (D2, D3) are faulty, the system fails. The respective FT is presented in Figure 16.b. For sake of conciseness, the Boolean variables representing the events (faults) of each device are named with the respective devices´ names, hence bs = {Server, Hub, D1, D2, D3}. Denote the FT Logic Function counterpart that represents the FT structure function (ψ) by Ψ. In the present context, the events of interest now are malfunctioning events (faults, failures, human errors etc). Hence, let the Boolean variables s0, s1, s2, s3 and s4 denote the occurrence of events ev0, ev1, ev2, ev3 and ev4. According to the notation previously introduced, si (a Boolean variable) is equivalent to xi and sÌ–i represents 1 – xi. The Ψ(bs) (Logical func-
72
tion that describes conditions that cause a system failure) is the counterpart of ψ(y(t)) = 1 (FT structural function – represents system failures), Ψ (bs) depicts of ψ(y(t)) = 0, Ë— represents ×, and
Ë– is the respective counterpart of +. In this example, the FT logic function is Ψ (bs) = s0 ∨s1 ∨s2 ∨ (s3 ∧s4 ), which is equal to s0 ∨s1 ∨s2 ∨(s3 ∧s4 ) = s0 ∧s1 ∧ s2 ∧ (s3 ∧s4 ) The respective FT structure function may be expressed as ψ ( y(t )) = [1 − (1 − y0 (t )) × (1 − y1(t ))
×(1 − y2 (t )) × (1 − y3 (t ) × y4 (t ))]
.
The reader may note (in the above expression) that if y0(t) = 1 or y1(t) = 1 or y2(t) = 1 or
Dependability Modeling
y3(t) = y4(t) = 1, then ψ(y(t)) = 1, which denotes a system failure. Example 4: Now consider a system composed by two processors (P1 and P2), two memory systems local to each processor (M1 and M2),) and three storage devices (D1, D2 and D3) – see Figure 18.a. D1 and D2 are only accessed by software applications running on P1 and P2, respectively. If any of these devices (D1, D2) fails, the storage device D3 takes over their functions so as to store data generated by software applications running on either processor. The system is considered to have failed if both of its sub-system (S1 = {P1, M1, D1, D3}, S2 = {P2, M2, D2, D3}) fails. Each subsystem fails if its processor fails or if the respective local memory fails or if both of its storage devices fail. The FT structure function is easily derived. We named the respective state variables as y (t ) = ( y P (t ) , y M (t ) , y D (t ) , y P (t ) , y M (t ) , 1
1
1
2
y D (t ), y D (t )) 2 3 ψ ( y(t )) = 1 − 1 − y P (t ) × 1 − y M (t ) 1 1 × 1 − y D (t ) × y D ( t ) 1 3 × 1 − 1 − y P ( t ) × 1 − y M (t ) 2 2 × 1 − y D ( t ) × y D (t ) . 2 3
( (
) ( )) ) ( ))
(
( ( (
)
As ψ(y(t)) = yi(t) × ψ(1i, y(t)) + (1 – yi(t)) × ψ(0i, y(t)), then:
(
)
ψ ( y(t )) = y D (t ) × ψ 1i , y (t )
(
3
) ( ) ) ) ( )
)
+ 1 − y D (t ) ×ψ 0i , y (t ) = y D (t ) 3 3 × 1 − 1 − y P (t ) × 1 − y M (t ) × 1 − y D (t ) 1 1 1 × 1 − 1 − y P (t ) × 1 − y M (t ) × 1 − y D (t ) 2 2 2 + 1 − y D (t ) × 1 − 1 − y P (t ) × 1 − y M (t ) 3 1 1 × 1 − 1 − y P (t ) × 1 − y M (t ) . 2 2
( ( ( ( (
( (
( ( ( (
) ( ) ( ) ( ))
)) )) ))
3
reduced to
( (
) )
ψ y (t ) = 1 − 1 − y P (t ) × 1 − y M (t ) × 1 − y D (t ) 1 1 1 × 1 − 1 − y P (t ) × 1 − y M (t ) × 1 − y D (t ) , 2 2 2
(
)
( (
) (
) (
) (
) (
))
and when y D (t ) = 0, it reduces to 3
( (
))
ψ y (t ) = 1 − 1 − y P (t ) × 1 − y M (t ) 1 1 × 1 − 1 − y P (t ) × 1 − y M ( t ) . 2 2
(
)
( (
) (
) (
))
Therefore, the original FT in Figure 17.b may be factored into two FTs, one considering y D (t ) = 1 and other y D (t ) = 0 as shown in 3
Figure 17.c.
3
Analysis Methods
2
)
One should observe that each that when y D (t ) = 1 , then the FT structure function is
This section introduces some important methods adopted in non-state space models for calculation of system probability of failure when components are independent. For simplifying the notation, reliability (ri(t), R(t)), steady state and instantaneous availability (ai, A, ai(t), A(t)) of components and system might replace pi and P.
Expected Value of the Structure Function The most straightforward strategy for computing system reliability (availability and maintainability) of system composed of independent components is through the respective definition. Hence, consider a system S and its respective structure function ϕ(x). The system reliability is defined by RS = P{φ (x ) = 1}
73
Dependability Modeling
Figure 17. System with shared device
Since ϕ(x) is a Bernoulli random variable, then P{ϕ(x) = 1} = E[ϕ(x)]; therefore, RS = E[ϕ(x)]. As xi is a binary variable, thus xi k = xi for any i and k; hence ϕ(x) is a polynomial function in which each variable xi has degree 1. Summarizing, the main steps for computing the system failure probability, by adopting this method are: 1. obtain the system structure function. 2. remove the powers of each variable xi; and 3. replace each variable xi by the respective pi.
74
Example 5: Consider a 2 out of 3 system represented by the RBD in Figure 14. The structure function of the RBD presented in Figure 14 is : φ (x ) = 1 − (1 − x1 x2 ) (1 − x1 x3 )(1 − x2 x3 ) (36) Considering that xi is binary variable, thus xi k = xi for any i and k, hence, after simplification φ (x ) = x1 x2 + x1 x3 + x2 x3 − 2 x1 x2 x3.
(37)
Dependability Modeling
Since ϕ(x) is Bernoulli random variable, its expected value is equal to P{ϕ(x)=1}, that is, E[ϕ(x)] = P{ϕ(x)=1}, thus
{
}
P{φ (x ) = 1} = pi ×E φ (1i , x ) + (1 − pi ) ×E φ (0i , x ) .
P φ (x ) = 1 = E φ (x ) = E x1 x2 + x1 x3 + x2 x3 − 2 x1 x2 x3 = E x1 x2 + E x1 x3 + E x2 x3 − 2 ×E x1 x2 x3 = E x1 E x2 + E x1 E x3 + E x2 E x3 − 2 ×E x1 E x2 E x3 .
Since E[ϕ(1i, x)] = P{ϕ(1i, x)=1} and E[ϕ(0i, x)] = P{ϕ(0i, x)=1}, then:
Therefore
P φ (1i , x ) = 1 P φ (x ) = 1 =pi × − P φ (0i , x ) = 1 −P φ (0i , x ) = 1 .
{
{
}
+ p2 p3 − 2 × p1 p2 p3.
{
}
(38)
(39)
which is equal to Equation 34.
PIVOTAL DECOMPOSITION OR FACTORING This method is based on the conditional probability of the system according the states of certain components. Consider the system structure function as depicted in Equation 20 and identify the pivot component i, then P{φ (x ) = 1} = E[ xi φ (1i , x )
}
{
}
(40)
It may also be represented by
As p1 = p2 = p3 = p P φ (x ) = 1 = 3 p 2 −2 p 3 ,
{
+ (1 − pi ) ×P φ (0i , x ) = 1 .
{
P φ (x ) = 1 = p1 p2 + p1 p3
}
P φ (x ) = 1 =pi × P φ (1i , x ) = 1
{
}
{ {
}
} }
(41)
Example 6: Consider the system composed of three components, a, b and c, depicted in Figure 10, where ϕ(xa, xb, xc) denotes the system structure function. As P{ϕ(x)=1} = E[xiϕ(1i, x)+(1–xi)ϕ(0i, x)], then:
{
}
P φ ( xa , xb , xc ) = 1 = pa ×E φ (1a , xb , xc ) + (1 − pa ) × E φ (0a , xb , xc ) , But as E[ϕ(0a, xb, xc)] = 0, so:
{
}
P φ ( xa , xb , xc ) = 1 = pa ×E φ (1a , xb , xc ) . Since
{
}
E φ (1a , xb , xc ) = P φ (1a , xb , xc ) = 1 ,
+ (1 − xi )φ (0i , x )] = E[ xi ]×E φ (1i , x ) + E (1 − xi ) ×E φ (0i , x ) .
Now factoring on component b,
As xi is a Bernoulli random variable, thus:
then
P{φ (1a , xb , xc ) = 1} = pb × E φ (1a , 1b , xc ) + (1 − pb ) × E φ (1a , 0b , xc ) ,
75
Dependability Modeling
Figure 18. Series reduction
{
}
P φ ( x a , x b , x c ) = 1 = pa × pb ×E φ (1a , 1b , xc ) + (1 − pb ) × E φ (1a , 0b , xc )
As E[ϕ(1a, 1b, xc)] = 1, thus:
{
}
P φ ( x a , xb , x c ) = 1 . = pa pb + (1 − pb ) × E φ (1a , 0b , xc )
E φ (1a , 0b , xc ) = pc × E φ (1a , 0b , 1c ) + (1 − pc ) × E φ (1a , 0b , 0c ) . As E[ϕ(1a, 0b, 1c)] = P{ϕ(1a, 0b, 1c) = 1} = 1 and E[ϕ(1a, 0b, 0c)] = P{ϕ(1a, 0b, 0c) = 1} = 1, then E φ (1a , 0b , xc ) = pc .
Now, as we know that
Therefore:
E φ (1a , 0b , xc ) = P{φ (1a , 0b , xc ) = 1} ,
P φ ( xa , xb , xc ) = 1 = pa pb + (1 − pb ) × pc = P φ ( xa , xb , xc ) = 1 = pa pb + pa pc (1 − pb ),
and
{
{
{
}
}
which is
}
P φ (1a , 0b , xc ) = 1 = E[ xc φ (1a , 0b , 1c )
+ (1 − xc )φ (1a , 0b , 0c )
],
{
}
P φ ( xa , xb , xc ) = 1 = pa 1 − (1 − pb )(1 − pc ) .
then
REDUCTIONS
E φ (1a , 0b , xc ) =E[ xc ]E φ (1a , 0b , 1c ) + E (1 − xc ) E φ (1a , 0b , 0c ) ,
Plain series and parallel systems are the most fundamental dependability structures. The dependability of such systems is analyzed through equations described in Section “Reliability Block Diagrams”. Other more complex structures such, k out of n and bridge structures may also be directly evaluated as single components using the
thus
76
Dependability Modeling
Figure 19. Parallel reduction
equations also presented in Section “Reliability Block Diagrams”. The dependability evaluation of complex system structures might be conducted iteratively by indentifying series, parallel, k out of n and bridge subsystems, evaluating each of those subsystems, and then reducing each subsystem to one respective equivalent block. This process may be iteratively applied to the resultant structures until a single block results. Consider a series system composed of n components (Figure 18.a) whose failure probabilities are pi. This system may be reduced into onecomponent equivalent system (Figure 18.b) whose n
failure probability is Ps = ∏pi . i =1
A parallel system composed by n components (Figure 19.a) whose failure probabilities are pi may be reduced into one-component equivalent system (Figure 16.b) whose non-failure probabiln
ity is QP = 1 − ∏(1 − pi ) , and its respective i =1
failure probability is Pp = 1 – Qp. K out of n and bridge structures may also be represented by one-component equivalent block
as described in Section “Reliability Block Diagrams”. Example 7: Consider a system represented in Figure 20 composed of four basic blocks (b1, b2, b3, b5), one 2 out of 3 and one bridge structure. The three components of the 2 out of 3 block are equivalent, that is, the failure probability of each component is the same (p4). The failure probabilities of components b1, b2, b3, b5 and the failure probability of the bridge structure are pb1, pb2, pb3, pb4 and pb5, respectively. The 2 out of 3 structure can be represented one equivalent block whose reliability is 3p2 –2p3 (Equation 34). The bridge structure can be transformed into one component, bb, (see Figure 21) whose failure probability is pbb = (1 – (1 – pb1pb2) (1 – pb4pb5)(1 – pb1pb3pb5)(1 – pb2pb3pb4)). After that, two series reductions may be applied, one reducing blocks b2 and b3 into block b23; and a second that combines blocks b5 and bb and reduces it to the block b5b. The reliability of block b23 is p23 = p2 × p3, and the block reliability of block b5b is p5b = p5 × [(1 – (1 – pb1pb2)(1 – pb4pb5)
77
Dependability Modeling
Figure 20. Reductions: System
Figure 21. Reductions: After bridge reduction
Figure 22. Reductions: After first series reductions
Figure 23. Reductions: After the parallel reduction
78
Dependability Modeling
Figure 24. Reductions: Final RBD
minimal cut vector if ϕ(x) = 1, for any y > x, and the respective path set is named minimal cut set, that is MCS(x) = {ci | ci ∈ CS(x), ϕ(x) = 1 ∀ y > x}.
(1 – pb1pb3pb5)(1 – pb2pb3pb4))]. The resulting RBD is depicted in Figure 22. Now a parallel reduction may be applied to merge blocks b23 and b4. Figure 23 shows the RBD after that reduction. The block b234 represents the block b23 and b4composition, whose reliability is p234 = 1 – (1 – p2 × p3) × (1 – 3p2 – 2p3). Finally, a final series reduction may be applied to RBD depicted in Figure 23 and one block RBD is generated (Figure 24), whose reliability is p12345b = p1 × 1 − (1 − p2 × p3 ) × 1 − 3 p 2 − 2 p 3 × p5 × 1 − (1 − pb1 pb 2 )(1 − pb 4 pb5 )(1 − pb1 pb 3 pb5 )(1 − pb 2 pb 3 pb 4 ) .
(
(
)
)
COMPUTATION BASED ON MINIMAL PATHS AND MINIMAL CUTS Structure functions define the arrangement of components in a system. These arrangements can also be expressed by the path and cut sets. Consider a system S with n components and its structure function ϕ(x), where SCS = {c1, c2, …, cn} is the set of components. A state vector x is named a path vector is ϕ(x) = 1, and the respective set of operational components is defined as path set. More formally, the respective path set of a state vector is defined by PS(x) = {ci | ϕ(x) = 1, xi = 1, ci ∈ SCS}. A path vector x is called minimal path vector if ϕ(x) = 0, for any y < x, and the respective path set is named minimal path set, that is MPS(x) = {ci | ci ∈ PS(x), ϕ(x) = 0 ∀ y < x}. A state vector x is named a cut vector is ϕ(x) = 0, and the respective set of faulty components is defined as cut set. Therefore, CS(x) = {ci | ϕ(x) = 0, xi = 0, ci ∈ SCS}. A cut vector x is called
Example 8: Consider a system represented by the RBD presented in Figure 25. PS1 = {b1, b2}, PS2 = {b1, b3} and PS3 = {b1, b2, b3} are path sets; and PS1 = {b1, b2}, PS2 = {b1, b3}, PS3 = {b1, b2, b3,}, PS4 = {b1}, and PS5 = {b2, b3}, are cut sets. v1 = (1, 1, 0), v2 = (1, 0, 1), and v3 = (1, 1, 1) are the respective path vectors. The respective cut vectors are v4 = (0, 0, 1) (for PS1), v5 = (0, 1, 0) (for PS2), v6 = (0, 0, 0) (for PS3), v7 = (0, 1, 1) (for PS4), and v8 = (1, 0, 0) (for PS5) are cut vectors. PS1 is a minimal path set, since for every yi < v1, ϕ(yi) = 0; that is, for v8 = (1, 0, 0), v5 = (0, 1, 0) and v6 = (0, 0, 0); ϕ(v8) = ϕ(v5) = ϕ(v6) = 0. The same is true for PS2, since v8 = (1, 0, 0), v4 = (0, 0, 1) and v6 = (0, 0, 0) (v8 < v2, v4 < v2, v6 < v2); ϕ(v8) = ϕ(v4) = ϕ(v6) = 0. On the other hand, PS3 is not minimal, since either v1 < v3 or v2 < v3; and ϕ(v1) = 1 and ϕ(v2) = 1. PS4 is a minimal cut set, since for v3 = (1, 1, 1), the only larger binary vector than v17, ϕ(v3) = 1. The cut set PS5 is also minimal, because for v1, v2, and v3 (the three larger binary vector than v8), ϕ(v1) = ϕ(v2)= ϕ(v3) = 1. The same is not true for PS1, PS2, and PS3. Consider a system S with arbitrary structure with p minimal path sets {PS1, PS2, …, PSp} and k minimal cut sets {CS1, CS2, …, CSk}. The structure function of a particular minimal path set is ζi ( x) =
∏ x .
j ∈ PSi
j
(42)
ζi(x) is named the minimal path series structure function. As the system is S is working if at least one of the p minimal path is functioning, then the system structure function is
79
Dependability Modeling
Figure 25. Sets and Cuts
p
p
i =1
i =1
(
)
φ (x ) =ζi (x ) = 1 − ∏ 1 − ζi (x ) .
(43)
φ (x ) = 1 − (1 − x1 x2 ) (1 − x1 x3 ) .
Hence, p φ (x ) =1 − ∏ 1 − i =1
p = x . ∏ j ∏ x j. j ∈ PSi i =1 j ∈ PSi
So, (44)
Alternatively, considering the cut sets, the structure function of a particular minimal cut set is κi (x ) = 1 −
∏ (1 − x j )=
j ∈ CSi
xj
(45)
j ∈ CSi
k
k
φ (x ) =∏κ i (x ) = ∏ x j . i =1
φ (x ) = x1 x2 + x1 x3 − x12 x2 x3 , As xi are binary variables, xin = xi , hence φ (x ) = x1 x2 + x1 x3 − x1 x2 x3 . Since P{ϕ(x) = 1} = E[ϕ(x)], then
κi(x) is named minimal cut parallel structure function. As the system is S fails if at least one of the k minimal cuts fails, then the system structure function is (46)
i = 1 j ∈ CSi
Example 9: The failure probability of the system depicted in Figure 25 can be computed either by Equation 44 or 46. The structure functions of the minimal path structures S1 = {b1, b2} and S2
80
= {b1, b3} are ζ1(x) = x1x2 and are ζ2(x) = x1x3, respectively. Therefore,
{
}
P φ (x ) = 1 = E x1 x2 + x1 x3 − x1 x2 x3 = P φ (x ) = 1 = E x1 x2 + E x1 x3 −E x1 x2 x3 = P φ (x ) = 1 = E x1 E x2 + E x1 E x3 − E x1 E x2 E x3 = P φ (x ) = 1 = p1 p2 + p1 p3 − p1 p2 p3 .
{ {
}
{
}
}
If the minimal cut structures S4 = {b1} and S5 = {b2, b3} are considered instead, the structure functions of each is κ4(x) = 1 – (1 – x1) and κ5(x) = 1 –[(1 – x2)(1 – x3)], respectively. Since the components are independent
Dependability Modeling
Figure 26. Minimal path sets
{
}
P φ (x ) = 1 = 1 − (1 − p1 ) 1 − (1 − p2 )(1 − p3 ) ,
which is equivalent to the first result. Minimal paths and cuts are important structures in dependability evaluation of systems. Many evaluation methods are based on these structures, hence methods have been proposed for computing minimal paths and cuts. This topic, however, is not covered in the book chapter.
SDP METHOD This section introduces the sum-of-disjoint-products (SDP) method for calculating dependability measures. The SDP method uses minimal paths and cuts to compute system probability of failure or the system operational probability which is the probability that the system is operational.) by summing-up probabilities of disjoint terms. Many strategies have been proposed to derive the disjoint-product terms from minimal paths and cuts, and how to consider them when computing the system probability of failure (or probability of functioning). The union of the minimal paths or cuts of a system can be represented by the system logic function. The system logic function may have several terms. If these terms are disjoint, then the dependability measure (reliability, availability and maintainability) can be directly computed by the simple summation of probabilities related
to each term. Otherwise the probability related to one event (path or cut) is summed with the probability of event occurrence represented by disjoint-product of terms. Considering a system composed of three independent components b1, b2 and b3, depicted in Figure 25, where the components failure probabilities are p1, p2 and p3, respectively. S1 = {b1, b2} and S2 = {b1, b3} are minimal path sets, and S4 = {b1}, and S5 = {b2, b3} are minimal cut sets. The minimal path sets are depicted in Figure 26. As usual, let the Boolean variable related to the component bi as si and the state variable as xi. As S1 is the minimal path set, then φ1(bs) = s1 Ë— s2 and ζ1(x) = x1x2. Therefore, the system failure probability is at least the failure probability of this path set, that is, P{ϕ(x) = 1} ≥ P{ζ(x) = 1}, where PS = P ζ1 (x ) = 1 = E[ x1 x2 ] = E x1 E[ x2 ]= 1
{
}
p1p2. Now, if the second minimal path set is considered (S2), the system failure probability should be obtained by considering the failure probabilities of components of the second path that do not belong to the first minimal path. The logic function representing the minimal path set S2 is φ2(bs)
= s1 Ë— s3, and ϕ1 (x ) ∧ ϕ 2 (bs) = s3 (equivalent to S1c ∩S2 = {b3 } ). The equivalent structure function is ζ12(x) = x3, and P{ζ12(x) = 1} = E[x3] = p3. As there is no other minimal path, P{ϕ(x) = 1} = P{ζ1(x) = 1} + P{ζ12(x) = 1} = p1p2 + p3. The minimal cut sets may considered instead of the minimal path sets. In this case, the first minimal cut used provides a system failure probability upper bound and the subsequent disjoint products reduce the failure probability value since each additional cut considered introduces possible terms that describes system failures. The above explanation shows how the SDP method uses the system logic function expressed as union of disjoint products. The disjoint terms are products of events representing components that work or fail.
81
Dependability Modeling
Figure 27. Event sets
n P φ (x ) = 1 = P ∪Si = P {S1 } i =1 n i −1 c +∑P ∩S j ∩ Si ⇔ j =1 i=2 P φ (x ) = 1 = P{ϕ1 (bs)} j =1 n +∑P{ ∧ ϕ j (x ) ∧ ϕi (bs)}. i − 1 i=2
{ {
Now, consider three set representing minimal path (or minimal cuts) named S1, S2 and S3 (see Figure 27). The sum of disjoint products may be represented by
(
S1 ∪ S2 ∪ S3 = S1 ∪ S1c ∩ S2
(
c
c
)
∪ S1 ∩ S2 ∩ S3 ,
)
Let denote P{Si} = P{φ1(bs)} = P{ζ1(x) = 1}, then P{S1 ∪ S2 ∪ S3} = P{φ1(bs) Ë– φ2(bs) Ë– φ3(bs)}, therefore:
{
P{S1 ∪ S2 ∪ S3 } = P {S1 } + P S1c ∩ S2
{
c
c
}
+ P S1 ∩ S2 ∩ S3 .
}
The first term, P{S1}, is the contribution related to the first path (cut). The second term, P S1c ∩ S2 , is the contribution of the second
{
}
path (cut) S2 that has not been accounted by S1, and the third term, P S1c ∩ S 2 c ∩ S3 , is the
{
}
contribution related to the path (cut) S3 that has not be considered neither in S1 nor in S2. Generalizing for n sets, the following expression is obtained
82
} }
(47)
For implementing the Expression 48 many algorithms have been proposed for efficient evaluation of the additional contribution toward the union by additional events that have not been accounted for by any of the previous events (Kuo, W & Zuo, M. J., 2003). Example 10: Consider the RBD presented in Figure 25, where the operational probabilities are p1 = p2 = p3 = 0.9. The minimal path sets and cuts are S1 = {b1, b2} and S2 = {b1, b3}, and S4 = {b1} and S5 = {b2, b3}, respectively. The operational probability computed in the first interaction of 48 when considering the minimal path S1 is 0.980296. If the minimal cuts are adopted instead of paths, and if S4 is the first, operational probability is 0.990099. So, P{ϕ(x) = 1} ∈ [0.980296, 0.990099]. In the second interaction, operational probability calculated considering the S1c ∩ S2 is 0.9900019. When adopting the cuts,
the next (and sole) disjoint product is S 4 c ∩ S5 . The operational probability computed considering the additional term is 0.9900019. The reader may observe that the two bounds converged. Thus, the system operational probability is P{ϕ(x) = 1} = 0.990099.
DEPENDABILITY BOUNDS The methods introduced so far derive the exact dependability measures of systems. For large
Dependability Modeling
systems, computing exact values may be a very computer intensive task, and may make the respective evaluation impractical. Approximation methods provide bounds to exact solutions in much shorter times. There are many approximations methods for estimating dependability of systems. Among the most important, we may stress the EP (EsaryProschan) method (Esary, J. D. & Proschan, F., 1970), min-max method (Kuo, W & Zuo, M. J., 2003) (Proschan, F. & Barlow, R. E., 1975), modular decomposition and the adoption of SDP method. This section introduces the min-max method and also applies the SDP method for computing bounds.
Min-Max The min-max method provides dependability bounds of coherent system using minimal paths and cuts. Consider a coherent system S, whose structure function is ϕ(x), with p minimal path sets (MPS = {PS1, PS2, …, PSp}) and k minimal cut sets (MCS = {CS1, CS2, …, CSk}), where ζi(x) is structure function of a path i and κj(x) is the structure function of a particular minimal cut set. P{ζi(x) = 1} is the operational probability of the path i, whereas P{κj(x) = 0} is probability that every component of the minimal cut j have failed. One should bear in mind that P{ζi(x) = 1} = 1 – P{κj(x) = 0}. A dependability lower bound may be obtained by
{ {
}}
L = Max1≤ i ≤ p P ζi ( x ) ϕ= 1 , and an upper bound may be calculated by
{ {
}}
U = Min1≤ i ≤ k P κ j (x ) ϕ ϕ= 1 . Hence,
{
}
L ≤ P φ (x) = 1 ≤ U . Example 11: Consider the bridge system depicted in Figure 16. The system is composed of a set of five components, CS = {b1, b2, b3, b4, b5}. The set of minimal paths and cuts are MPS = {PS1, PS2, PS3, PS4} and MCS = {CS1, CS2, CS3, CS4}, respectively, where PS1 = {b1, b2}, PS2 = {b4, b5}, PS3 = {b1, b3, b5}, PS4 = {b2, b3, b4}, CS1 ={b1, b2}, CS2 ={b2, b5}, CS3 ={b1, b3, b5} and PS4 = {b2, b3, b4}. The state variables of each component are named by labeling function such that lf: CS → x, where x = {xi | bi ∈ CS}. Therefore the set of state variables is x = {x1, x2, x3, x4, x5}. Note that the reader should bear in mind that x has been interchangeably adopted to represent sets and vector whenever the context is clear. The structure functions of each minimal path are ζ1(x) = x1x2, ζ2(x) = x4x5, ζ3(x) = x1x3x5, ζ4(x) = x2x3x4. Since xi are Bernoulli random variables, P{xi} = E[xi] = pi As every variable xi are independent, then P{ζ1(x) = 1} = p1p2, P{ζ2(x) = 1} = p4p5, P{ζ3(x) = 1} = p1p3p5, P{ζ4(x) = 1} = p2p3p4. If p1 = p2 = p3 = p4 = p5 = p, then P{ζ1(x) = 1} = P{ζ2(x) = 1} = p2, and P{ζ3(x) = 1} = P{ζ4(x) = 1} = p3. As p ∈ (0, 1) and L = Max1 ≤i≤p{P{ζi(x) = 1}}, thus
{ {
}} = Max {P {¶ (x) = 1}, P {¶ (x) = 1}, P {¶ (x ) = 1} , P {¶ ( x ) = 1}} = Max { p , p , p , p } = p . L = Max1≤ i ≤ 4 P ζi ( x ) = 1 1≤ i ≤ 4
1
3
2
4
2
2
3
3
3
The structure function of a cut set is κ j ( x ) = 1 − ∏ (1 − x j )= x j , hence κ1(x) = j ∈ CSi
j ∈ CSi
1 – [(1 – x1) × (1 – x2)], κ2(x) = 1 – [(1 – x2) × (1 – x5)], κ3(x) = 1 – [(1 – x1) × (1 – x3) × (1 – x5)],
83
Dependability Modeling
κ4(x) = 1 – [(1 – x2) × (1 – x3) × (1 – x4)]. Considering the minimal cut set CS1, so
{
}
(
) (
{
}
=L1 + P PS1c PS2 = L1
+ P{ϕ p1 (bs)∧ϕ p 2 (bs)},
)
P κ1 ( x ) = 1 = E 1 − (1 − x1 ) × (1 − x2 ) = 1 − E (1 − x1 ) × (1 − x2 ) = 1 − E 1 − x1 ×E 1 − x2 = 1 − 1 − E x1 × 1 − E x2 = 1 − (1 − p1 ) × (1 − p2 ) .
(
L1 =P { PS1 } = P{ϕ p1 (bs)}, L2
. . .
)
{
}
P φ (x ) = 1 = Ln =Ln −1
{
+ P PS1 PS2 c … PSn −1c PSn As p1 = p2 = p, then P{κ1(x) = 1} = 1 – (1 – p)2. Adopting the same process P{κ2(x) = 1}, P{κ3(x) = 1} and P{κ4(x) = 1} are calculated. P{κ2(x) = 1} = 1 – (1 – p)2, P{κ3(x) = 1} = 1 – (1 – p)2 and P{κ4(x) = 1} = 1 – (1 – p)2. As p ∈ (0, 1) and U = Min1 ≤i≤k{P{κi(x) = 1}}, then
{ {
}} =
U = Min1≤i ≤4 P κ j (x ) = 1
{ {
} {
} {
}} =
} {
Min P κ1 ( x ) = 1 , P κ 2 ( x ) = 1 , P κ 3 ( x ) = 1 , P κ 4 ( x ) = 1
{
2
2
3
3
Min 1 − (1 − p ) , 1 − (1 − p ) ,1 − (1 − p ) , 1 − (1 − p )
} = 1 − (1 − p) . 2
c
= Ln −1 + P{ϕ p1 (bs) ∧ ϕ p 2 (bs)∧
…∧ϕ p( n −1) (bs) ∧ ϕ pn (bs)}.
Now, consider the set of minimal cut sets, MCS = {CSj | 1 ≤ j ≤ m}. As P{κj(x) = 0} is probability that every component of the minimal cut j have failed, it represents the system failure probability (unreliability, unavailability etc) related to the cut j. Again, adopting the Equation 49, these failure probabilities may be successively computed. Thus
Therefore 1 – (1 – p)2 ≤ P{ϕ(x) = 1} ≤ p3. If p = 0.99 then 0.98010 ≤ P{ϕ(x) = 1} ≤ 0.999999.
FPL1 =P {CS1 } = P ϕ c1 (bs) ,
SDP
Note that φcj(bs) = T ⇔ κj(x) = 0.
The SDP method, described in Section “SDP method”, provides consecutive lower (upper) bounds when adopting successive minimal paths (cuts). The basic algorithm adopted in SDP method order the minimal paths and cuts from shorter to longer ones. If the system components are similarly dependable, longer paths provides smaller dependability bounds whereas shorter cuts also offer smaller bounds. Consider an RBD with n minimal paths and m minimal cuts. First take into account only the set of minimal path sets, MPS = {PSi | 1 ≤ i ≤ n}. According to Equation 49, the system dependability (reliability, availability etc) may be also successively expressed so as to find tighter and tighter lower bounds. Hence
FPL2 =FPL1 + P CS1c CS2
84
}
{
{
}
}
= FPL1 + P{ϕ c1 ( bs)∧ ϕ c 2 (bs)}, . . .
{
}
P φ (x ) = 0 = FPLm =FPLm −1
{
c
c
c
+ P CS1 CS2 …CSm −1 CSm = FPLm −1 + P{ϕ c1 (bs)∧ϕ c 2 (bs)
ϕ c( m −1) (bs) ∧ ϕ cm (bs)}. ∧…∧
As P{ϕ(x) = 1} = 1 –P{ϕ(x) = 0}, then
}
Dependability Modeling
U1 = 1 −FPL1 = 1 −P {CS1 } = 1 − P{ϕ c1 (bs)}, U 2 = 1 −FPL2 =1 − FPL1 + P CS1c CS2 = 1 − FPL1 + P{ϕ c1 (bs)∧ ϕ c 2 (bs)} , . .
{
}
.
{
L2 = p1 × p2 + (1 − p1 ) × p4
}
×p5 + p1 × (1 − p2 ) × p4 × p5 .
L3 = p1 × p2 + (1 − p1 ) × p4 × p5 + p1
×(1 − p 2 ) × p 4 × p5 + p1 × (1 − p2 ) × p3
{
}
Example 11: Consider again the bridge system depicted in Figure 15 in which the components are independent. The set of minimal paths and cuts are MPS = {PS1, PS2, PS3, PS4} and MCS = {CS1, CS2, CS3, CS4}, respectively, where PS1 = {b1, b2}, PS2 = {b4, b5}, PS3 = {b1, b3, b5}, PS4 = {b2, b3, b4}, CS1 ={b1, b2}, CS2 ={b2, b5}, CS3 ={b1, b3, b5} and PS4 = {b2, b3, b4}, where bs = {si | bi ∈ CS} and x = {xi | bi ∈ CS} are the set of respective Boolean and state variables, respectively. Considering the minimal path sets L1 = P { PS1 } = P{ϕ p (bs)} = P{ζ p ( x ) = 1}. 1
1
As ϕ p (bs) = s1 ∧ s2 , then ζ p ( x ) = x1 × x2 , hence
1
×(1 − p 4 ) × p5 .
P φ (x ) = 1 = U m = 1 − FPLm = 1 − FPLn −1 + P CS1c CS2 c …CSm −1c CSm = 1 − FPLm −1 + P{ϕ c1 (bs) ∧ ϕ c 2 (bs) ∧ …∧ϕ c(m −1) (bs) ∧ ϕ cm (bs)} .
1
L1 = P { PS1 } = P{ϕ p (bs)} = P{ζ p ( x ) = 1} = E[ζ p ( x )] 1 1 1 L1 = E x1 × x2 = E x1 × E x2 = p1 × p2 .
Adopting the same process the successive lower bounds are obtained. Therefore, a second, a third and a fourth tighter bound may be obtained. The second and the third bounds are:
The fourth and last bound (the exact value) is then calculated by
{
}
L4 = L3 + P ϕ p1 ( bs) ∧ ϕ p 2 ( bs)∧ ϕ p 3 ( bs) ∧ ϕ p 4 ( bs) .
Hence L4 = p1 × p2 + (1 − p1 ) × p4 × p5 + p1 × (1 − p2 ) × p4 ×p5 + p1 × (1 − p2 ) × p3 × (1 − p4 ) × p5 + (1 − p1 ) × p2 × p3 × p4 × (1 − p5 ) .
If p1 = p2 = p3 = p4 = p5 = p, then
{
}
P φ (x ) = 1 = L4 = 2 p 3 (1 − p)2
+ p (1 − p ) + p 2 (1 − p ) + p 2 . 3
The same process may be applied considering the minimal cut sets in order to obtain upper bounds. Let p = 0.90909091. If this process is applied for the minimal path and minimal cuts, the lower and upper bounds shown in Table 3 are obtained.
State-Space Models Dependability models can be broadly classified into non-state-space models (or called combinatorial models) and state-space models. Non-state space models such as RBD and FT, which are introduced in previous sections, can be easily formulated and solved for system dependability
85
Dependability Modeling
Table 3. Successive Lower and Upper Bounds Iterations i
Li
Ui
1
0.826446280992
0.991735537190
2
0.969879106618
0.983539375726
3
0.976088319849
0.982918454403
4
0.982297533080
0.982297533080
Note that at the fourth iteration (last) the lower and the upper bounds equal, since then we have the exact result.
under the assumption of stochastic independence between system components. In non-state-space models, for the dependability models, it is assumed that the failure or recovery (or any other behaviors) of a component is not affected by a behavior of any other component. To model more complicated interactions between components, we use other types of stochastic models such as Markov chains or more generally state-space models. In this subsection, we briefly look at the most widely adopted state-space models, namely Markov chains. First introduced by Andrei Andreevich Markov in 1907, Markov chains have been in use intensively in dependability modeling and analysis since around the fifties. We provide a short introduction to discrete state stochastic process and fundamental characteristics including stationarity, homogeneity, and memory-less properties. We then introduce in brief Discrete Time Markov Chain (DTMC), Continuous Time Markov Chain (CTMC), and semi Markov process. Finally, we briefly introduce stochastic Petri nets (SPN) which is a high-level formalism to automate the generation of Markov chain. A stochastic process is a family of random variables X(t) defined on a sample space. The values assumed by X(t) are called states, and the set of all the possible states is the state space, I. The state space of a stochastic process is either discrete or continuous. If the state space is discrete, the stochastic process is called a chain. And the time parameter (also referred to as index set) of a stochastic process is either discrete or continuous. If the time parameter of a stochastic process
86
is discrete (finite or countably infinite), then we have a discrete-time (parameter) process. Similarly, if the time parameter of a stochastic process is continuous, then we have a continuous time (parameter) process. A stochastic process can be classified by the dependence of its state at a particular time on the states at previous times. If the state of a stochastic process depends only on the immediately preceding state, we have a Markov process. In other words, a Markov process is a stochastic process whose dynamic behavior is such that probability distributions for its future development depend only on the present state and not on how the process arrived in that state. It means that at the time of a transition, the entire past history is summarized by the current state. If we assume that the state space, I, is discrete (finite or countably infinite), then the Markov process is known as a Markov chain (or discrete state Markov process). If we further assume that the parameter space T, is also discrete, then we have a discrete-time Markov chain (DTMC) whereas if the parameter space is continuous, then we have a continuous-time Markov chain (CTMC). The changes of state of the system are called transitions, and the probabilities associated with various state-changes are called transition probabilities. For homogenous Markov chain, its transition probability is independent of time (step) but depends only on the state, the Markov chain in this case is said to be time homogeneous. Homogeneous DTMC sojourn time in a state follows the geometric distribution. The steady state and transient solution methods of DTMC and CTMC are described in more detail in (Sahner, R. A., Trivedi, K. S., Puliafito, A. 1996)(Trivedi, K. S., 2002). The analysis of CTMC is similar to that of the DTMC, except that the transitions from a given state to another state can happen at any instant of time. For CTMC, we allow the parameter to a continuous range of values, the set of values of X(t) is discrete. CTMCs are useful models for performance as well as availability prediction. We show the CTMC models in the case studies
Dependability Modeling
section. Extension of CTMC to Markov reward models (MRM) make them even more useful. MRM can be used by attaching a reward rate (or a weight) ri to state i of CTMC. We used MRM in case study 2 to compute capacity oriented availability. For a homogeneous CTMC, the sojourn time (the amount of time in a state) is exponentially distributed. If we lift this restriction and allow the sojourn time in a state to be any (nonexponential) distribution function, the process is called a semi-Markov process (SMP). The SMP is not covered in this chapter, more details can be found in (Trivedi, K. S., 2002). Markov chains are drawn as a directed graph and the transition label is probability, rate, and distribution for homogeneous DTMC, CTMC, and SMP, respectively. In Markov chains, states represent various conditions of the system. States can keep track of number of functioning resources and states of recovery for each failed resource. The transitions between states indicate occurrences of events. A transition can occur from any state to any other state and represent a simple or a compound event. Hand construction of the Markov model is tedious and error-prone, especially when the number of states becomes very large. Petri net (PN) is a graphical paradigm for the formal description of the logical interactions among parts or of the flow of activities in complex systems. The original PN did not have the notion of time, for dependability analysis, it is needed to introduce duration of events associated with PN transitions. PN can be extended by associating time with the firing of transitions, resulting in time the Petri nets. A special case of timed Petri nets is stochastic Petri nets (SPN) where the firing times are considered to be random variables with exponential distribution. The SPN model can be automatically converted into underlying Markov model and solved. SPN is a bipartite directed graph consisting of two kinds of nodes; places and transition. Places typically represent conditions within the system being modeled. Transitions represent events occurring
in the system that may cause change in the condition of the system. Tokens are dots (or integers) associated with places; A place containing tokens indicates that the corresponding condition holds. Arcs connect places to transitions (input arcs) are transitions to places (output arcs). An arc cardinality (or multiplicity) may be associated with input and output arcs, whereby the enabling and firing rules are changed as follows. Inhibitor arcs are represented with a circle-headed arc. The transition can fire iff the inhibitor place does not contain any tokens. A priority level can be attached to each PN transition, among all the transitions enabled in a given marking, only those with associated highest priority level are allowed to fire. An enabling (or guard) function is a Boolean expression composed from the PN primitives (places, transitions, tokens). Sometimes when some events take extremely small time to occur, it is useful to model them as instantaneous activities. SPN model were extended as generalized SPN to allow for such modeling by allowing some transitions, called immediate transitions, to have zero firing time. For further details see (Trivedi, K. S., 2002).
CASE STUDIES In this section we present two case studies, which will illustrate the application of the aforementioned techniques to areas related to service computing. Some of these cases are built based on authors’ experiences developing dependability research works for companies in IT industry. Even those cases created hypothetically are developed as close as possible from real-world scenarios.
Multiprocessor Subsystem The main focus of this study is to evaluate the availability of a multiprocessor processing subsystem in terms of individual and multiples processor failures. Nowadays, multiprocessor-computing platform is in the core of many service computing
87
Dependability Modeling
Figure 28. Dual Quad-core CPU subsystem model
projects. The modeling techniques used in this case study is CTMC. This subsystem is an integral part of many computing system platform and this study can be easily generalized. The modeled processing subsystem is based on a Symmetric multiprocessing dual quad-core processors platform, so it is composed of two physical CPUs, where each CPU (processor) contains four cores. Typically, in this case the operating system (e.g., Linux and MS Windows) considers four logical processors for each physical CPU - a total of eight logical processors. The OS believes it is running on an 8-way machine. Assume that this system is serving critical applications in a data center environment, so from a hardware standpoint, we assume that in case of a failure in a physical CPU (e.g., CPU0) the computer is able to work with a second CPU (e.g., CPU1) after rebooting the system. In this situation the system is considered to be running in degraded mode until the failed CPU is replaced. This rebootoriented recovery property is very important to assure a lower downtime, having a significant influence on the overall system availability. Hence, we assume that the modeled motherboard contains sockets for two physical CPUs on dual independent point-to-point system buses. If only one physical CPU is operational, these physical individual buses connecting each CPU directly to
88
the motherboard’s NorthBridge offer the physical possibility to have any of the two physical processors, independently of the socket used, booting up and running the operating system. Hence, the OS kernel is not restricted to start up from a specific processor. For example, motherboards compatible with the Intel “Blackford-P” 5000P chipset implements the abovementioned capabilities and our modeling assumption is based on it. As described above, although each physical processor has four cores, an individual core failure may be considered an unlikely event. For that reason we assume that a processor failure will stop the entire physical processor and therefore its eight encompassed cores. Figure 28 shows the modeled platform with two Quad-core CPUs. States UP and U1 are up states. State UP represents the subsystem with two operational physical processors. When a processor fails, this subsystem transits from state UP to D1. We assume that a possible common mode failure can bring all processors down with probability 1-Ccpu, thus transiting from state UP to DW. In states D1 and DW the computer is down due to a system crash caused by a processor failure. We assume that after such an event, the next action taken by the administrator is to turn on the computer, and then with probability Capp the computer will come up with only one physical processor (four
Dependability Modeling
Table 4. Input Parameters CPU subsystem model Params
Description
Value
1/λcpu
mean time for processor failure – operating with 2 proc.
1,000,000 hours
1/λcpu1
mean time for processor failure – operating with 1 proc.
(1/λcpu) / 2
Ccpu
coverage factor for cpu failure
1/βapp
mean time to reboot the whole computer
0.99 2 minutes
Capp
coverage factor for appliance failure after reboot due to a processor failure
Θcpu1
decision factor for keeping the system running with one processor
True (1)
1/αspFRU
mean time for new appliance arrival (FRU)
4 hours
1/μappFRU
mean time to install the new CPU (FRU)
cores). The mean time to reinitialize (reboot) the system is 1/βapp. A successful initialization after a processor failure brings the subsystem to state U1 and since it is running with only one physical processor it can fail with a mean time to failure of 1/λcpu1. Parameter λcpu1 usually is assigned to a higher value than λcpu indicating a shorter lifetime for the remaining processor under a higher workload than usual (dual mode). While in degraded mode the system administrator can decide to request a processor replacement in order to resume the system to full processing capacity. This decision is indicated in the θcpu1 value. From states DW and U1 the only possible transition is to FR (the field replacement service state). An important aspect necessary for the availability modeling and analysis is the repair service. It has a direct impact on the MTTR (mean time to repair) metric necessary for the system availability analysis. In a data center environment, a specific field support staff is responsible for the replacement of failed parts. These parts are commonly categorized as customer replaceable units (CRU) or field replacement unit (FRU). These are industry terms used to mean those parts of the system that are designed to be replaced by the customer personnel (CRU), or only by authorized manufacturer representative (FRU). The main difference is the time to repair, where FRU involves longer repair time than CRU because it is not available locally, and hence the necessity to account the travel time.
0.95
30 minutes
Since CPU is very often considered an FRU in real data centers, we modeled the repair service considering this assumption. Hence, from state DW the replacement of the failed CPU is requested and the service personnel arrive with a mean time of 1/αspFRU to fix and reboot the system (transition from FR to UP) afterward. Table 4 shows the CPU subsystem model parameter values. Most of these values are based on industry standards, and specialist judgment. The breakdown analysis for the downtime (Table 5 and Figure 29) shows us that the contribution of state D1 to the total downtime is very significant. Hence, actions should be taken to reduce the time spent in D1 reducing the reboot time and to increase the coverage factor Capp. This multiprocessor subsystem model can be used in conjunction with other models (e.g., cooling, power supply, storage, etc.) to compose a more complex system such as an entire server machine or even a cluster of servers.
A Virtualized Subsystem This case study presents the availability modeling and analysis of a virtualized system (6). Service computing is highly dependent on data center infrastructure and virtualization technologies. We develop an availability model of a virtualized system using a hierarchical model in which fault trees are used in the upper level and homogeneous
89
Dependability Modeling
Table 5. Breakdown Downtime Breakdown Downtime
Value
D1
2.08134903e+000
DW
2.50190267e-001
FR
5.25593188e-001
continuous time Markov chains (CTMC) are used to represent sub-models in the lower level. We incorporate not only hardware failures (e.g., CPU, memory, power, etc) but also software failures including virtual machine monitor (VMM), virtual machine (VM), and application failures. We also consider high availability (HA) feature and VM live migration in the virtualized system. Metrics we use are system steady state availability, downtime in minutes per year and capacity oriented availability. Figure 30 shows a virtualized two hosts system. The virtualized system consists of two physical virtualized servers (called hosts from now on) in which each host has one VM running on the VMM in the host. Two virtualized hosts share a common SAN (storage area network). This configuration is commonly used to support VM live migration. Applications running on VMs can be same or different; we assume that the application is the same, and we denote them as APP1, APP2 to distinguish
Figure 29. Breakdown downtime sensitivity analysis
90
them, so this is active/active configuration in a virtualized system. We define the system unavailability as the probability that both hosts are down. Figure 31 shows virtualized system availability models in which the top level fault tree models include both hardware and software failures. H1 and H2 represent host1 and host2 failure, respectively. HW1 and HW2 represent hardware failure of the host1 and host2, respectively. HW1 consists of the CPU (CPU1), memory (Mem1), network (Net1), power (Pwr1), cooler (Coo1). A leaf node drawn as a square means that there is a sub model defined. System can fail if both of the host hardware (including VMM) or SAN, or VMs fail. This is because the SAN is shared by two hosts and VMs on one host can be migrated to the other host. Below we discuss the VMs availability model in detail. The description for other sub models can found in (6). Figure 32 shows VMs subsystem availability model. We use active/active configuration and we consider the system to be up if at least one application is working properly. In the top state UUxUUx all the components are up. State UUxUUx represents, from the left to right, host1 (in short, H1) is up, VM1 (in short, V1) is up, host1 has a capacity to run another VM (in short, x), host2 (in short, H2) is up, and VM2 (in short, V2)
Dependability Modeling
Figure 30. Architectures of two hosts virtualized systems
Figure 31. System availability model for the virtualized system
is up, and host2 has capacity to run another VM. If H1 fails with rate λh, it enters state FUxUUx. The failure of H1 is detected using one of the failure detection mechanisms (for an instance, a
heartbeat mechanism every 30 seconds). Once H1 failure is detected with rate δh (in state DxxUUR, ‘D’ stands for detection of the host failure), it restarts V1 on H2 (note that H2 had a capacity
91
Dependability Modeling
Figure 32. VMs availability model
to receive V1, as denoted by x, and the restart of a VM is denoted as ‘R’). This is called VM High Availability (HA) service in VMware. It takes mean time 1/rv to restart a VM on the other host. The host is then repaired with rate μh (state UxxUUU, H1 is repaired). Now, in H2, there are two VMs, i.e., V1, and V2. To make full use of system resources, V1 on H2 may be migrated to H1 with rate mv (i.e., transition from the state UxxUUU to UUxUUx). We begin again with the state UUxUUx in which both hosts are up and both VMs on the hosts are also up. If H2 fails (state, UUxFUx), as soon as failure is detected, V2 on H2 is restarted on H1 (state, UURDxx). And then it is migrated to H2 once H2 is repaired. Now, we describe a second host failure and recovery. We
92
can begin with the state UxxUUU or UUUUxx. In state UxxUUU, both hosts are up and two VMs are on H2. If H2 fails with the mean time 1/λh, it enters state UxxFUU and then host failure is detected with rate δh (state, UxxDUU). In this state, there are two VMs; Only one VM can be migrated so that two VMs compete each other and the migration rate is 2rv. The next state will be the state UURDxx. Similarly, from state DUUUxx, two VM can be migrated to other host. Next, we incorporate VM failure and recovery. If V1 fails, it goes to state UFxUUx, in which H1 is up but V1 fails. It takes mean time 1/δv to detect VM failure. And then it takes mean time 1/μv to recover. In some cases, if VM is not recovered, it needs to be restarted on the other host. In this
Dependability Modeling
Table 6. Input Parameters VMs model Params
Description
Value
1/λh
mean time for host failure
host MTTFeq
1/λv
mean time for VM failure
2160 hours
1/λa
mean time for application failure
1/δh
mean time for host failure detection
30 seconds
1/δv
mean time for VM failure detection
30 seconds
1/δa
mean time for app. failure detection
30 seconds
1/mv
mean time to migrate a VM
1/rv
mean time to restart a VM
1/μv
mean time to repair VM
30 minutes
1/μ1a
mean time to application first repair
20 minutes
1/μ2a
mean time to application 2nd repair
1/μh
mean time to repair host failure
cv
coverage factor for VM repair
ca
coverage factor for application repair
case, the VM on H1 is migrated to H2 with mean time, 1/mv. Then, it is restarted on H2 (state, UxxUUR) and it is recovered (state, UxxUUU). To capture this imperfect recovery we use a coverage factor for VM, cv, so the rates become cvμv and (1-cv)μv. Obviously, V2 also can fail (state, UUxUFx); it takes mean time 1/δv to detect the failure. It can be recovered with rate cvμv; otherwise it needs to be migrated to H1 with rate (1-cv)μv. The coverage factor cv can be determined by fault injection experiments. So far, we incorporated host failure as well as VM failure; we also incorporate applications failures. We use notation UUxf_UUx to represent application1 failure on V1 on H1 where ‘f’ means the failure of application1 on V1 (here, UUxf_UUxu, we used underscore (‘_’) to distinguish between H1 and H2). If the failure of application1 is detected (UUxd_ UUx) with rate δa, it can be recovered with rate caμ1a then it goes back to state UUxUUx. Or it sometime needs additional recovery action (state, UUxp_UUx) with repair rate μ2a. The application 2 on V2 on H2 can also fail (UUx_UUxf). It can be detected with mean time, 1/δa, and it can be recovered with mean time, 1/μ1a. Otherwise, it
336 hours
5 minutes 5 minutes
1 hour host MTTReq 0.95 0.9
needs additional recovery steps with mean time, 1/μ1a. The output measures such as steady state availability, downtime, and capacity oriented availability (COA) are computed using the hierarchical model. We used SHARPE (Trivedi, K. S. & Sahner R. 2009) to compute the output measures. We compute mean time to failure equivalent (MTTFeq) and mean time to repair equivalent (MTTReq) for Markov sub models (such as CPU, memory availability model, etc) by feeding the input parameters value. The MTTFeq and MTTReq of each submodel are used to compute MTTFeq and MTTReq of a host. We use them in VMs availability model (1/MTTFeq and 1/MTFReq of host equal to λh and μh, respectively). Finally, we evaluate the system availability by feeding all the input parameters values from Table 6 into all the sub-models in system availability model shown in Figure 32 The steady state availability and downtime in minutes per year of virtualized two hosts system are summarized in Table 7. We also compute the COA by assigning reward rates to each state of the VMs availability model (so this Markov chain becomes Markov
93
Dependability Modeling
Table 7. Output measures of the virtualized system Output measure
Value
Steady state availability (SSA)
9.99766977e-001
Downtime in minutes per year (DT)
1.22476683e+002
Capacity oriented availability (COA) of VMs model
9.96974481e-001
Figure 33. Unavailability vs. mean time to VM/VMM failure
Figure 34. COA vs. mean time to VM restart (migrate)
reward model). Reward rate 1 is assigned to states where one VM is running on each host (e.g., UUxUUx), reward rate 0.75 is assigned to states where two VMs are running on a host (e.g., UUUUxx, UxxUUU, UUUDxx, etc). Reward rate 0.5 is assigned to states where only one VM is
94
running on a host (e.g., UUxFxx, UUxDxx, etc). Zero reward rate is assigned to all other states. The computed COA of the virtualized system is also in Table 7. The sensitivity to some parameters is shown from Figure 33 to Figure 34. Figure 33 shows unavailability vs. mean time to VM failure
Dependability Modeling
(1/λh in Table 6) / VMM failure (1/λVMM). As seen from the Figure, the system unavailability drops as the mean time to VMM failure increases. But after it reaches about 750 hours, increasing mean time to VMM failure does not reduce the system unavailability too much. As seen from the Figure 33, the system unavailability does not change much as the mean time to VM failure increases, since another VM on the other host is working properly. Figure 34 shows the COA vs. mean time to restart (migrate) a VM. The COA drops as the mean time to restart and as the mean time to migrate a VM increases. Therefore it is important to minimize the mean time to restart (migrate) a VM to maximize the COA. The more case studies using hierarchical modeling approach are in (Trivedi, K, 2008) (Trivedi, K. S., 2002) (Sahner, R. A., Trivedi, K. S. & Puliafito, A., 1996).
CONCLUSION Dependability modeling and evaluation is an area of long tradition and sound fundaments. Dependability studies are particularly important for the design success of critical systems. In this chapter, we began introducing some seminal and important works that widely influenced the development of this research area. Then we introduced some model types and important analysis methods. This chapter focused more on non-state space models than state based models, nevertheless the case studies presented considered both class of models.
REFERENCES Anselone, P. M. (1960). Persistence of an Effect of a Success in a Bernoulli Sequence. Journal of the Society for Industrial and Applied Mathematics, 8(2), 272–279. doi:10.1137/0108015
Answers to Questions Relative to High Tension Transmission. (1904)... Transactions of the American Institute of Electrical Engineers, XXIII, 571–604. doi:10.1109/T-AIEE.1904.4764484 Avizienis, A. (1997). Toward Systematic Design of Fault-Tolerant Systems. IEEE Transactions on Computers, 30(4), 51–58. Avizienis, A., Laprie, J.-C., & Randell, B. (2001). Fundamental Concepts of Dependability. LAASCNRS, Technical Report N01145. Barlow, R. E. (2002). Mathematical reliability theory: from the beginning to the present time (Lindqvist, B. H., & Doksum, K. A., Eds.). Mathematical and Stastical Methods In Reliability, World Scientific Publishing. Barlow, R. E., & Proschan, F. (1967). Mathematical Theory of Reliability. New York, NY: John Wiley & Sons. Basharin, G. P., Langville, A. N., & Naumov, V. A. (2004). The Life and Work of A. A. Markov. Linear Algebra and its Applications. Special Issue on the Conference on the Numerical Solution of Markov Chains, 386, 3-26. Birnbaum, Z. W., Esary, J. D., & Saunders, S. C. (1961). Multi-component systems and structures and their reliability. Technometrics, 3(1). doi:10.2307/1266477 Blischke, W. R., & Murthy, D. N. P. (Eds.). (2003). Case Studies in Reliability and Maintenance. Hoboken, New Jersey: John Wiley & Sons. Bolch, G., Greiner, S., Meer, H., & Trivedi, K. S. (2006). Queueing Networks and Markov Chains Modeling and Performance Evaluation with Computer Science Applications. John Wiley & Sons. Cox, D. R. (1989). Quality and Reliability: Some Recent Developments and a Historical Perspective. The Journal of the Operational Research Society, 41(2), 95–101.
95
Dependability Modeling
Dhillon, B. S. (2007). Applied reliability and quality: fundamentals, methods and applications. Berlin, Heidelberg: Springer-Verlag. Ebeling, C. E. (2005). An Introduction to Reliability and Maintainability Engineering. Waveland Press. Einhorn, S. J., & Thiess, F. B. (1957). Intermittence as a stochastic process. NYU-RCA Working Conference on Theory of Reliability. Epstein, B., & Sobel, M. (1953). Life Testing. Journal of the American Statistical Association, 48(263), 486–502. doi:10.2307/2281004 Ericson, C. A., II. (1999). Fault Tree Analysis - A History. In Proc. the 17th International Systems Safety Conference. Erlang, A. K. (1909). Principal Works of A. K. Erlang - The Theory of Probabilities and Telephone Conversations. First published in Nyt Tidsskrift for Matematik B, 20, 131-137. Esary, J. D., & Proschan, F. (1970). A Reliability Bound for Systems of Maintained, Interdependent Components. Journal of the American Statistical Association, 65(329), 329–338. doi:10.2307/2283596 Gnedenko, B. V., & Ushakov, I. A. (1995). Probabilistic Reliability Engineering (Falk, J. A., Ed.). Wiley-Interscience. doi:10.1002/9780470172421 Gorbenko, A., Kharchenko, A., & Romanovsky, A. (2007). On composing Dependable Web Services using undependable web components. Int. J. Simulation and Process Modelling, 3(1/2), 45–54. doi:10.1504/IJSPM.2007.014714 Gray, G. (1985). Why Do Computers Stop and What Can Be Done About It?. Tandem TR 85.7. Kim, D., Machida, F., & Trivedi, K. S. (2009). Availability Modeling and Analysis of a Virtualized System. In Proc. 15th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) (pp. 365-371). IEEE Computer Society.
96
Kolmogoroff, A. (1931). Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung [in German] [Springer-Verlag.]. Mathematische Annalen, 104, 415–458. doi:10.1007/BF01457949 Kotz, S., & Nadarajah, S. (2000). Extreme Value Distributions: Theory and Applications. Imperial College Press. doi:10.1142/9781860944024 Kuo, W., & Zuo, M. J. (2003). Optimal Reliability Modeling: Principles and Applications. Hoboken, New Jersey: Joh Wiley & Sons. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and terminology. In Proc. 15th IEEE Int. Symp. on Fault-Tolerant Computing, (pp. 2-11). Laprie, J. C. (1992). Dependability: Basic Concepts and Terminology. Springer-Verlag. Leemis, L. M. (2009). Reliability: Probability Models and Statistical Methods. Lawrence Leemis. Marsan, M. A., Balbo, G., Conte, G., Donatelli, S., & Franceschinis, G. (1995). Modelling with Generalized Stochastic Petri Nets. John Wiley and Sons. Misra, K. B. (Ed.). (2008). Handbook of Performability Enginneering. Springer. doi:10.1007/9781-84800-131-2 Molloy, M. K. (1981). On the Integration of Delay and Throughput Measures in Distributed Processing Models. Ph.D. Thesis, UCLA, 1981. Moore, E. F. (1958). Gedanken-Experiments on Sequential Machines. The Journal of Symbolic Logic, 23(1), 60. doi:10.2307/2964500 Nahman, J. M. (2002). Dependability of Engineering Systems: Modeling and Evaluation. Springer-Verlag.
Dependability Modeling
Natkin, S. (1980). Les Reseaux de Petri Stochastiques et leur Application a l’Evaluation des Systkmes Informatiques. CNAM, Paris, France: Thèse de Docteur Ingegneur.
Shetti, N. M. (2003). Heisenbugs and Bohrbugs: Why are they different?DCS/LCSR Technical Reports, Department of Computer Science, Rutgers, The State University of New Jersey.
Neumann, J. V. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Shannon, C., & McCarthy, J. (Eds.), Automata Studies (pp. 43–98). Princeton University Press.
Smith, D. J. (2009). Reliability, Maintainability and Risk. Elsevier.
O’Connor, P. D. T. (2009). Practical Reliability Engineering. New York, NY: John Wiley and Sons. Pierce, W. H. (1965). Failure-tolerant computer design. New York, NY: Academic Press. Proschan, F., & Barlow, R. E. (1975). Statistical Theory of Reliability and Life Testing: Probability Models. New York: Holt, Rinehart and Winston. Rausand, M., & Hoyland, A. (2004). System Reliability Theory: Models, Statistical Methods, and Applications. New York, NY: John Wiley & Sons. Sahner, R. A., Trivedi, K. S., & Puliafito, A. (1996). Performance and Reliability Analysis of Computer Systems - An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers. Schaffer, S. (1994). Babbage’s Intelligence: Calculating Engines and the Factory System. Critical Inquiry, The University of Chicago Press, 21(1), 203–227. Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379–423, 623–656.
Stott, H. G. (1905). Discussion on “Time-Limit Relays” and “Duplication of Electrical Apparatus to Secure Reliability of Services” at New York. Stuart, H. R. (1905). Discussion on “Time-Limit Relays” and “Duplication of Electrical Apparatus to Secure Reliability of Services” at Pittsburg. Symons, F. J. W. (1978). Modelling and analysis of communication protocols using numerical Petri nets. Ph.D. Thesis, University of Essex. Trivedi, K. S. (2002). Probability and Statistics with Reliability, Queueing and Computer Science Applications. New York, NY: John Wiley & Sons. Trivedi, K. S., & Grottke, M. (2007). Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. IEEE Transactions on Computers, 4(20), 107–109. Trivedi, K. S., & Sahner, R. (2009). SHARPE at the age of twenty two. ACM SIGMETRICS Performance Evaluation Review, 36(4), 52–57. doi:10.1145/1530873.1530884 Trivedi, K. S., Wang, D., Hunt, D. J., Rindos, A., Smith, W. E., & Vashaw, B. (2008). Availability Modeling of SIP Protocol on IBM© WebSphere. In Proc. 14th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) (pp. 323-330). IEEE Computer Society. Ushakov, I. (2007). Is Reliiabiility Theory Still Alive?. e-journal “Reliability: Theory& Applications”, 1(2).
97
98
Chapter 4
Trends and Research Issues in SOA Validation Antonia Bertolino Consiglio Nazionale delle Ricerche, Italy Guglielmo De Angelis Consiglio Nazionale delle Ricerche, Italy Antonino Sabetta1 Consiglio Nazionale delle Ricerche, Italy Andrea Polini University of Camerino, Italy
ABSTRACT Service Oriented Architecture (SOA) is changing the way in which software applications are designed, deployed and maintained. A service-oriented application consists of the runtime composition of autonomous services that are typically owned and controlled by different organizations. This decentralization impacts on the dependability of applications that consist of dynamic services agglomerates, and challenges their validation. Different techniques can be used or combined for the verification of dependability aspects, spanning over traditional off-line testing approaches, monitoring, and on-line testing. In this chapter we discuss issues and opportunities of SOA validation, we identify three different stages for validation along the service life-cycle model, and we overview some proposed research approaches and tools. The emphasis is on on-line testing, which to us is the most peculiar stage in the SOA validation process. Finally, we claim that on-line testing is only possible within an agreed governance framework.
SOA TESTING: ISSUES AND OPPORTUNITIES In this chapter we discuss issues that testers have to face in the validation of service oriented archiDOI: 10.4018/978-1-60960-794-4.ch004
tecture (SOA). Due to the high dynamism and the multi-organizational characteristics of SOA, the traditional testing process is not adequate anymore to guarantee an acceptable level of reliability and trust. At the same time, new opportunities arise for SOA validation. In particular, we argue in favor
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Trends and Research Issues in SOA Validation
of moving testing activities from the laboratory to the field, i.e., towards run-time. This is aligned with the emerging trend of a surrounding open world (Baresi et al., 2006), in which software-intensive systems are more and more pervasive and evolve at an unprecedented pace, the boundaries between software and the external world become fuzzy, and no single organization is anymore in control of a whole system. In the future world, SOA development is going to be fully decentralized and multi-supplier, and services are going to be pervasive, dynamic, autonomic, multi-tenant. Quoting (Baresi et al., 2006), developers recognize that they must shift from the prevailing synchronous approach to distributed programming to a fundamentally more delay-tolerant and failure resilient asynchronous programming approach. Global behaviors emerge by asynchronous combinations of individual behaviors, and bindings and compositions change dynamically. In such a context, where it will be the norm that services that are “stranger” to each other will dynamically connect and collaborate (Bertolino, 2010), how can the behavior of a SOA system be validated? In the remainder of this section, to introduce SOA validation, we revisit some typical assumptions behind traditional testing and discuss why these are not anymore obvious in a service-based context. Then in the second section we outline a three-stage SOA testing process, including offline, admission and on-line testing, and for each stage we briefly overview issues and trends. In this overview we mention some approaches. However we warn that the references provided must be considered as an illustrative sample, somewhat biased towards the authors’ own research, and in no way the overview has to be taken as an attempt of an exhaustive literature survey. As the most novel of the three stages is on-line, in the third section we depict its potential application scenarios. Finally, in the fourth section we introduce the framework of SOA test governance that we foresee to facilitate
collaborative SOA validation. The last section draws conclusions.
Software Testing Basic Assumptions Testing of software systems is typically structured in three phases, namely unit, integration and system testing. The objective of unit testing is to check that the behavior of each single module composing the system corresponds to the intended one. Typically, the module undergoing unit testing is executed in a test environment that mimics the behavior of the other connected modules. On the other hand, the objective of integration testing is to check the correctness of module agglomerates. In this case the focus shifts to checking that some interconnected modules, that have been possibly already verified in isolation, are actually able to interoperate and provide correct results. The final step is usually system testing, which aims at checking that the system as a whole correctly provides the functionality for which it has been conceived and built. In each of the three phases, testing activities can be carried out based on some basic assumptions that in the development of “traditional” software are so obvious as to even remain implicit. We identify three main basic assumptions: 1. Software access 2. Pre-run-time model/specification availability 3. Off-line reproducibility (or no side effects) The software access assumption foresees that in order to check the various modules, either in isolation or in agglomerates, the tester has full access to the functionality provided by the various elements composing the system. Depending on the applied testing strategy this assumption may go even further, requiring the possibility of accessing the source code (white-box testing). The second assumption concerns the availability, before run-time, of appropriate reference information such as a model or the specification,
99
Trends and Research Issues in SOA Validation
for each single module, for module agglomerates and for the whole system. In a traditional setting, the developers’ organization generally yields global view and control of the application under development. So we can say that the final application is to some extent the result of a coordinated and centralized effort for which a specification is available. The availability of a reference model or a specification is at the base of many testing strategies, nowadays referred to as Model-based Testing (Utting & Legeard, 2006), and is important both to guide test selection and to decide which is the correct result to expect; in presence of a formal model, the latter can be exploited also to automatically derive the test cases. The third assumption refers to the fact that the software life-cycle generally foresees a pre-release stage in which a system under development, and its composing elements, can be manipulated off-line within an artificial test environment. Experiments carried on in this phase would produce no permanent effect on the resources used by the system after deployment. Moreover, even after release, it is generally possible to continue to modify the system and experiment with it in a duplicated offline environment without influencing the status and behavior of the deployed applications.
Issues in SOA Testing In devising a testing approach within a SOA setting, many of the assumptions that hold true in a traditional development process have to be relaxed, if not discarded. In particular the three basic assumptions listed above become hardly satisfiable. Assumption 1 affects in particular the activities related to integration and system testing. A service developer could test in the laboratory an individual service in isolation, but the same is not generally possible for service agglomerates and for their interactions. In fact, as already remarked, service interaction points are not under the control of a single organization. Besides, given service
100
dynamic discovery and binding, it is also difficult to know in advance which external services a service under test will be bound to at run-time. Thus, when setting up a SOA testing approach, each single organization typically loses control over the run-time environment: testing rather becomes a collaborative activity by all participating members (see, e.g., (Tsai et al., 2006)). Concerning Assumption 2, in the SOA world we cannot assume anymore that one organization owns the specifications of all interacting services. On the contrary, we can say that removing such a central point of control is somehow an objective of the new paradigm. Precisely, with respect to specifications, two different business models emerge. A first model foresees the production of a service by a business unit to pursue individual organization objectives. Service integrators then access and use external services owned and made available by different organizations. In this case there exists no global specification defining services functionality or how the services should be integrated. The common goal emerges from local decisions taken by each service integrator while using other services. This way of integrating services is typical of the so-called orchestration paradigm of interaction. A different business model considers the existence of a reference global specification called a choreography. As stated in (W3C, 2005), a choreography describes collaborations of participants by defining from a global viewpoint their common and complementary observable behavior, where information exchanges occur, when the jointly agreed ordering rules are satisfied. To play a role within a choreography, a service provider must then develop the service according to the global specification, typically released by a super partes organization. Both models present relevant issues for testing and affect what can be achieved. Specifically, in an orchestration, the behavior of the integrated services is specified only as far as the interface bound
Trends and Research Issues in SOA Validation
to the orchestrator itself is concerned. Generally, there is no control over how the orchestrator or the integrated services deal with a testing session of the orchestration. For example, dynamic run-time binding could allow for uncontrolled modification of the run-time behavior of a tested service beyond tester control. In the case of a choreography, the situation is perhaps easier for the tester, given that all the services were developed taking the same global specification as a reference. Nevertheless there also exist some access control limitations preventing a single organization to observe, and then test, the behavior of all the services in a choreography; this impacts Assumption 1. Finally the off-line reproducibility assumption is hardly applicable in a SOA setting. Services are made available to other parties just through the published access points. In some scenarios it may be possible to apply approaches supporting the testing of service agglomerates in a controlled test environment (see the next section), however this cannot be considered a general solution for carrying out test experiments. To address this exigency, approaches based on service monitoring and self-healing have been proposed, and Section “SOA Monitoring” discusses them. Nevertheless, such solutions passively observe service interactions and permit, only a-posteriori, to recover from situations in which incorrect behaviors have already emerged. In this chapter we want to make a case in favor of service on-line testing strategies, i.e., of techniques that are meant to pro-actively launch selected test cases in the real execution environment. We are aware that the applicability of on-line testing presents crucial issues to be solved. In particular invocations for test purposes to services belonging to other organizations can generally produce side effects which are outside of the tester’s control. This is certainly true when an invoked service refers to stateful resources to provide its functionality: consider for instance the testing of a booking service while it is serving normal requests. Each test invocation would
produce, as a side effect, the reservation of real resources, with all consequent problems to be handled at business level. Indeed there could be side effects also for services that do not directly refer to stateful resources to provide their functionality. For example, test invocations could result in the payment of a usage fee or could degrade the Quality of Service which has been agreed with other clients. Therefore, in planning an on-line testing strategy it is important to be aware of the necessity of possibly establishing compensation techniques through which the possible side effects can be accepted or avoided. Though obvious, it might be useful to explicitly remark that in our view this on-line testing stage is not meant to substitute or prevent the execution of canonical off-line testing activities where feasible (e.g., unit testing of a service by its developers); nor it is to be seen as a replacement for a continuous monitoring activity, which is vital to keep the system under control. Rather on-line testing must be conceived as an additional means, combining advantages of both off-line testing and monitoring, to increase service trustworthiness in dynamic and large-scale service interactions.
SOA VALIDATION STAGES With reference to Figure 1, the horizontal axis covers the typical steps of a service life-cycle: after being designed and developed, it must be installed and deployed for being made accessible to potential customers. Usually, in order to facilitate its discovery, a service provider can then explicitly request to register the service within a service registry (such as the UDDI (OASIS consortium, 2005)). In this case, we distinguish between the request by a service to be included in the registry, and the actual publication. In general software life-cycle processes foresee that the effort in testing activities starts from the design of a service and ends with its deployment. The solid line in Figure 1 depicts how this
101
Trends and Research Issues in SOA Validation
Figure 1. Service life-cycle model (bottom) and testing stages (top)
life-cycle model maps with the efforts spent for testing (i.e. Off-line Testing). The dashed line in Figure 1 represents the additional effort in testing activities that we advocate is necessary for software services. In particular, we promote the extension of the testing activities also to the publication of the service (referred to as the “Admission Testing” stage), and afterwards during live usage and maintenance (i.e. On-line Testing and Monitoring). Traditional off-line testing mainly concerns the application of testing strategies at the service development time, where the service execution is exercised in a simulated or controlled testing environment. On the other hand, monitoring and on-line testing stages foresee controlling and validating the live usage of a service implementation. Basically, it is assumed that the three stages (i.e. Off-line Testing, Admission Testing, On-line Testing and Monitoring) proceed in sequential order, since a service is first designed and developed, then deployed, and then used. There are however clear relationships among the stages. In particular the results of the analyses conducted off-line may be used to guide the on-line validation activities. Moreover, the results from analysis during the on-line validation might provide a feedback to the service developer or to the service integrator,
102
highlighting necessary or desirable evolutions for the services, and therefore might be used for offline validation of successive enhanced versions of services. For the three introduced stages, in the remaining of this section we overview relevant issues, approaches and tools.
Off-Line Testing In principle, most of the consolidated testing approaches for traditional systems could be applied as well to the off-line stage testing for SOA. Specifically, in our view and where feasible, the combination of canonical off-line testing approaches is needed to gain confidence that a service under development will deliver the expected functionality. In (Canfora & Di Penta, 2008), the authors extensively survey the features and limitations of a valid set of existing off-line solutions that have been proposed during the last years. The survey includes approaches for different levels of testing: unit testing of atomic services and service compositions, regression testing, and testing of non-functional properties. Recent approaches to test selection for Web Services are surveyed
Trends and Research Issues in SOA Validation
instead in Chapter 17 of this book (Bartolini et al., 2011), classified into functional, structural and security-based. Nevertheless, as the three basic test assumptions are not guaranteed, it is very difficult to validate the service behavior when interacting with external services. Thus, as described in the first section, service developers need also tools and techniques that aim at the assessment of the integration and the interoperability of services before their final deployment. Among the others, we have proposed approaches for reproducing a predictable run-time environment by combining functional and nonfunctional testing approaches allowing the developers to mock up and test different live usage scenarios. Specifically, Jambition (PLASTIC Consortium, 2008a) is a functional testing tool that relies on a model-based testing approach originating from a sound and well-established formal testing theory (Tretmans, 1996). The key idea of this approach is to exploit as much as possible the behavioral description often available for deployed services to automatically derive a conformance test suite for a service under development. Due to the extreme dynamicity of the service domain, many authors have suggested to augment the service WSDL description with operational specifications in order to characterize services in a richer way. Jambition assumes that such specifications are expressed in terms of Symbolic Transition System (STS) (Frantzen et al., 2005). Puppet (Bertolino et al., 2007) is a code generator that supports the automated derivation of the elements necessary to recreate a predictable “live” environment that is suitable for the evaluation of non-functional properties of the service under test. In other words, Puppet allows testers to automatically generate the required services in such a way that they yield the expected functional and non-functional behavior with respect to a given specification. In particular, concerning the non-functional specification, Puppet requires that
the foreseen Quality of Service in the testbed is based on the WS-Agreement language (Forum, 2005). In its complete version, Puppet emulates the functional behavior of a service in the testbed by means of Jambition (Bertolino et al., 2008). As introduced in Section “Issues in SOA Testing”, the orchestration paradigm introduces both pros and cons with respect to testing. For example, the Orchestration Participant Testing (De Angelis et al., 2010) (opt) strategy provides a general approach to derive test cases for services to be composed within an orchestration. As a result, the derived test cases aim at assessing if a service will behave correctly when integrated within an orchestration. Therefore, a fail outcome of a test execution does not necessarily lead to the identification of a fault in the integrated services; instead it just shows that the service should not be integrated in the considered orchestration. The BPEL Participant Testing (BPT) (BPT homepage, 2010) tool is an available open-source implementation of the opt strategy. Specifically, the tool generates test suites for service participants assuming that the orchestration specification is expressed using the BPEL language (Jordan & Evdemon, 2007).
Admission Testing The SOA foresees the existence of a service directory that is used by services to search and obtain references to each other. The idea behind admission testing is to have any service applying to be registered to undergo a preliminary testing stage, based on which the actual insertion of the service in the directory will be decided. In case a service fails to show the required behavior, its registration in the directory is not granted. In this sense in (Bertolino et al., 2008) we called such a framework “Audition”, as if the service undergoes a monitored trial before being put “on stage”. It is worth noting that from a scientific point of view the implementation of the framework does not require to introduce novel testing approaches.
103
Trends and Research Issues in SOA Validation
On the contrary, existing software tools (such as test generators) can be reused. To be able to automatically derive test cases for those services asking for registration, the framework relies on an information model that should provide some description of the service expected behavior. Such information model should be provided to the service registry when the service asks for inclusion, and should be suitable for automatic test case derivation. These requirements may have importance consequences on the applicability of the framework in a real setting, and presuppose that a SOA test governance context (as we discuss in the fourth section) is in place. Nevertheless, different variations of admission Figure 2. The Audition Framework
104
testing can be derived, for instance relaxing the requisite on automatic derivation of test cases and using instead predefined and static test suites stored in the registry. Figure 2 depicts the process subsumed by the Audition framework (transitions are numbered in logically sequential order). The process starts with a service S1 asking the directory service to be linked in. In addition to the WSDL, S1 provides a behavioral model the offered service. The registry stores S1, but marks the status of its registration as “pending” (services whose registration is pending are not discoverable yet by querying the registry). At the same time, the WSDL and the behavioral model of S1 are sent
Trends and Research Issues in SOA Validation
to a Testing Driver Service (TDS) for the automatic test case synthesis. During the auditing, the TDS invokes S1, acting as the driver of the test session, and checking if the service behaves accordingly to the behavioral model provided at the registration. In case of errors the TDS notifies the registry to abort the registration. Also, in case S1 queries the registry for discovering service Sm, the registry redirects S1 to an ad-hoc and controlled implementation of Sm that is automatically generated by means of an appropriate Service Factory. As a consequence the Proxy/Stub version of the service Sm checks the content and the order of any invocation made by S1; in case a violation of the specification of the invoked service is detected, the Proxy/Stub informs the registry service that S1 is not suitable for registration. WS-Guard (Ciotti, 2007) is a prototype implementation of an enhanced version of a UDDI registry augmented with audition testing services. Admission testing clearly raises issues regarding the invocations to fully-operating services (as opposed to services being auditioned). As discussed in the introduction, this may be particularly challenging if the services invoked are related to stateful resources.
On-Line Testing In the first section we have discussed several difficulties in applying testing techniques before live usage, whereby we suggested to extend the testing phase till run-time. We distinguish between proactive and passive testing, a.k.a. monitoring. In some cases, on-line testing and monitoring may be simplistically considered overlapping. However, in our research we recognized that the two approaches actually present different goals, different techniques, different issues, and, in most cases, even different motivations. Specifically, monitoring approaches focus in observing how the system behaves spontaneously. In this sense,
in the literature it is also referred to as passive testing. Instead, on-line testing is considered as a proactive process where testers launch designed test cases in order to validate selected behaviors. The latter can discover potential problems before they manifest themselves to the service client, and becomes most convenient, of course, when on-line repair strategies are available. Even though monitoring and on-line testing are different, they share a common main goal: to dynamically reveal possible deviations from the expected behavior due to run-time scenarios that were not foreseeable at design and development time. In SOA validation, several reasons may deal to unsuspected scenarios. Among the others, the release of a new version of an existing service, the change of a service implementation, the run-time binding with another service that is supposed to dynamically change, the emergence of unplanned usage patterns from a service client. Different scenarios can be envisaged for application of on-line testing. For example, let us consider a service choreography regulating the access to some shared resources. Access to each resource is protected by means of a combination of access control policies. In recently developed architectures, such as (Pacyna et al., 2009), such access control policies can be dynamically defined, changed and updated. Hence the actors in the choreography (i.e., users and service providers) could dynamically change the access policies to their resources in unpredictable ways. In such scenarios, we foresee that the services participating in a choreography can be regularly submitted to on-line testing in order to assess that they keep complying with their public and manifested access policies. (role)cast is a framework developed in the context of the European Project TAS3 to support on-line testing in choreographies that include authentication/authorization/identification mechanisms. Specifically, (role)cast forwards application level messages (i.e., the payload of the SOAP body message) to the target services
105
Trends and Research Issues in SOA Validation
including into the transport protocol (i.e. the SOAP envelope) the authorization credentials that are defined by a testing plan. In very general terms, authorization credentials consist of information signed by trusted authorities aiming at certifying specific attributes/properties of the sender. Such information will be used by the resource/service provider to grant or deny the provisioning of the service or the access to a resource. Furthermore, let us consider another scenario aimed at testing whether the run-time binding of a given service with other services wrongly affects its manifested functionality. Even though in this scenario the dynamism of the authorization mechanisms may not be crucial, usually services in live usage still abide by some authorization mechanisms. Thus, functional on-line testing should combine service invocations at the application level with authorization credentials, which refer to the access policy of the resources under test. In other words, invoking a method of a service, also the functional on-line test driver must hold the rights to access that resource for the specific behavior, otherwise the testing invocation will be stopped by the policy control system. In this sense, (role)cast can be used both to mock-up identities in testing the access control policy of a service in a choreography, and also to provide a configurable authorization layer that testers can reuse as support to authenticating service interactions on various kinds of tests (e.g. functional and non-functional testing).
SOA Monitoring We have already argued how the characteristics of systems obtained by composition of existing services, which typically provided by thirdparties, are exceptionally difficult to predict and to guarantee. As a consequence, service-oriented systems make extensive use of runtime assessment approaches, such as those based on monitoring. Generally speaking, the goal of a monitoring system is twofold: (1) to observe the operation
106
of the subject system and (2) to support the interpretation of such observation in a way that is useful for the specific purposes of monitoring. These purposes may be different, and include systems management, security assessment and enforcement, QoS and SLA monitoring, support for autonomic mechanisms to enable self-tuning, self-repair and other self-* capabilities. Besides these “administration uses” of monitoring, it is a very natural means to support passive testing of service-oriented systems. In this context, the basic idea underlying monitoring is to add observation mechanisms to the SUT (and to the underlying runtime platform) so as to gather evidence of its actual behavior in order to compare its consistency with the expected behavior. The comparison is usually performed alongside normal system operation, i.e., on-line. The evidence obtained by monitoring usually takes the form of execution traces, when testing functional characteristics, or QoS measures, when testing non-functional characteristics. By examining the high-level structure of monitoring systems, it is possible to identify a few key functionalities that are common to most existing monitoring solutions. Raw event collection concentrates on the collection of raw information from the execution of the observed components. This is done separately on each component of the system and can be implemented according to varying degrees of intrusiveness. Raw events are subject to local filtering, which discards redundant or irrelevant information before passing it on to an event storage and brokering component. This is responsible for gathering pre-filtered monitoring information from different originating sources to a central (possibly not unique) aggregation node. At this point, global interpretation takes place, to make sense of pieces of low-level of information coming from several sources. Global interpretation transforms this raw information into high-level, “business-level”, events, which can be detected
Trends and Research Issues in SOA Validation
only at an aggregated level. This activity is also commonly referred to as “(event) correlation”. This general monitoring infrastructure is suitably refined in the case of service-oriented systems. Concerning the collection of raw events, in the most general case services are only accessible (and observable) through their interfaces, therefore an interception mechanism, such as a proxy, is used. On the other hand, if it is possible to have read access to the monitoring facilities provided by the execution environment, these can be used as a source of raw events. In the simplest case, log files can be used as sources. The built-in support for monitoring offered by execution environments is exploited, e.g., in Dynamo-AOP (Bianculli & Ghezzi, 2008). Dynamo-AOP focuses on functional behavior of orchestrated services, and provides support to augment orchestrating services with checks, in order to verify that the orchestrated services conform to the expected behavior. Other monitoring approaches aim at monitoring the QoS of web-services. For example, SLAngMon (Raimondi et al., 2008) implements a mechanism to automatically generate on-line checkers of Service Level Agreements (SLAs) starting from a formal specification of their expected (correct) behavior; the specification approach is founded on the timed-automata theory. SLAngMon checkers are implemented as modules of the execution environment, and targets the observation of single services and temporal correlation is done locally. Further information on Dynamo-AOP and SLAngMon, together with their prototype implementations, is available from (PLASTIC Consortium, 2008b). A novel trend that is arising in the serviceoriented enterprise systems is the idea of engineering monitoring more systematically into new applications. Because of the high costs entailed by adding monitoring in an ad-hoc fashion, monitoring itself is increasingly considered as a crosscutting functionality (Wetzstein et al., 2010)
for which generic instrumentation mechanisms are provided in order to address ever changing monitoring needs. This view is motivated by and fits perfectly the needs of enterprise-level systems, where complex (crosscutting) functionalities, such as monitoring for testing purposes, are more and more delegated externally (Wetzstein et al., 2010). When addressing monitoring across enterprise boundaries, a common trend is to use publish-subscribe event brokering backbone based on a message-oriented infrastructure (e.g., an Enterprise Service Bus, ESB) (Eugster et al., 2003) coupled with rule engines (Browne, 2009) to implement complex event detection (Luckham, 2001). To the usual concern of overhead caused by the observation infrastructure, cross-organization monitoring stresses other important issues, such as those related to privacy and security of monitoring information (Chow et al., 2009).
ON-LINE TESTING WITHIN FEDERATIONS The “Future Internet” will be shaped by service federations. A service federation is a network of services, possibly belonging to different organizations, who define rules (both technical and organizational) to be followed by all members to pursue common wellness. It constitutes a social world within which it is then necessary to put in place some organization to govern the interactions among the participating services, aiming at assuring that everyone abides by the agreed social rules. This set of rules and associated policies and responsibilities form the SOA governance (see the fourth section). In an often used metaphor, a service federation, and its associated governance, correspond somehow to countries and government of countries. In every country there are laws that must be obeyed by citizens; citizens are responsible and legally pursuable for law violations. Similarly, within a
107
Trends and Research Issues in SOA Validation
federation providers should put in place services that respect the rules. So, services are provider alter ego developed to make real provider’s intentions. At the same time services are the instruments used to pursue the provider objectives. In some cases a misbehaving provider could judge that violating a rule could be more beneficial to its objectives and therefore could maliciously alter the behavior of a deployed service. In this scenario, on-line testing can play a similar role to that of Public Officers within a country. In particular we can imagine that the federation can insert in the service infrastructure specific mechanisms for monitoring and checking the behavior of services. In order to discover misbehaving services the federation can then conduct on-line testing session on a “suspicious service” in order to verify that it behaves accordingly to the rules. A key issue for federated services is to establish and maintain trust among the involved organizations. On-line testing can be considered a good opportunity to increase the quality of federated services and of their interactions, and can play the important role of trust enhancer. In our vision, the federation will also publish specific rules, typically concerning different aspects of a service implementation, by which the service provider will have to abide in order to make on-line testing possible. The identification of mechanisms and compensation techniques will be in our opinion an interesting research domain. For instance the federation could establish that when a service asks for registration within a federated directory service it will be preventively submitted to an on-line testing session (the concept of Audition testing proposed in (Bertolino et al., 2006) and illustrated in Section “Admission Testing”). Such testing activities will be possibly supported by ad-hoc mechanisms in the infrastructure adopted by the federation. Particularly interesting seems the case in which the federation also defines choreographies to be referred to by members to develop their services. In this case it is clear the
108
interest, by the federation members, to check that a service complies with its role within the choreography. Two different scenarios of on-line testing approaches can emerge in service federations, leading to different kinds of on-line testing support: (i) test aware and (ii) test unaware. In the first scenario, a service is aware and willing of being submitted to an on-line test session, in view of the common wellness. Such a scenario is acceptable when the tester does not have any motivation to suspect that the service under test will maliciously change its behavior during a testing session with respect to the one shown during normal usage. In such a setting managing side effects becomes a sort of cooperative effort to which all the involved services should take part. For instance copies of stateful resources could be introduced and used in testing invocations. A more challenging scenario emerges when service providers and their deployed services could exhibit in the testing session a different, unfaithful behavior from real life. This could happen either because of malicious intentions of the service provider or unintentionally, possibly as an effect of the collaborative mechanisms themselves used for on-line testing purpose, e.g., to avoid permanent effects caused by test invocations. For instance a booking portal could privilege an airline company returning more entries for flights operated by that airline, in violation of the choreography specified by the federation that requires a fair and unbiased information provision. Or a service providing access to sensible data could incorrectly disclose data violating the declared access policies, for instance as a consequence of a role dynamically emerged at run-time and not considered at design time. In this second scenario the service under test should not be able to distinguish an on-line test invocation from a real one. Therefore it is important that the on-line testing invocation can faithfully mock the behavior and the identity of a real service within the federation. This poses
Trends and Research Issues in SOA Validation
interesting challenges, for example on how to provide a test service with a realistic trustable identity. The (role)cast tool introduced in Section “On-line Testing” provides an example of a framework that can support testers in dealing with authorization layers within service federations. In either case, test aware or unaware, the condition sine-qua-non to on-line testing remains however that a governance agreement is reached and established within the federation. We have earlier motivated SOA test governance in (Bertolino & Polini, 2009) and will discuss such aspects in the following section.
SOA TEST GOVERNANCE SOA development is a collaborative effort (Tsai et al., 2006), and therefore every phase, including testing, needs to be redefined as an activity shared among the involved stakeholders. Indeed, while the SOA paradigm is claimed as achieving distributed system integration in a cheaper and easier way, actually this ease of use from the client’s side is counterbalanced by quite a complex structure from the server’s side. We find relevant the following sentence from (Windley, 2006): Counterintuitive as it may seem, SOA requires more organizational discipline than previous development models. Your intuition might tell you that flexibility results from fewer rules, not more, but that’s not the case. This concern also applies to SOA validation. In (Bertolino & Polini, 2009), we defined the concept of SOA Test Governance (STG) as that subset of SOA governance concerning the establishment and enforcement of policies, procedures, notations and tools that are required to enable distributed and on-line testing of service integrations. SOA governance is introduced in the next subsection, while Section “Dealing with SOA Test Governance” discusses on the concept of STG. The last subsection presents our abstract framework
providing the high-level guidelines towards the establishment of a SOA Test Governance.
SOA Governance SOA Governance consists of the “development and enforcement of SOA policies and procedures” (Windley, 2006), where in turn a SOA policy consists of a set of design rules combined with enforcement. Nowadays, in a business-oriented view, SOA Governance is generally considered at intraorganization level, i.e., it consists of regulating the SOA development process so that the organization goals are achieved with the contribution and interoperation between the various business units within the organization. However, in our view, the most adequate perspective to look at SOA governance is at cross-organization level: SOA governance should establish the conditions to achieve end-to-end interoperability across different organizations and different platforms. In such a vision, the establishment of a governance framework also asks for the inclusion of trusted and authoritative third-parties that govern the relations between participating organizations. It becomes also clear that governance combines technical aspects with social aspects, for example conflicts resolution and legal disputes. A set of typical stakeholders in a SOA governance framework and their respective roles and activities are discussed in Table 1. Different key components make SOA governance at design-time and at run-time (Linthicum, 2008). Design-time SOA governance applies to the whole service life cycle. It relies heavily on standards: the adopted or created standards make what is referred to as an Interoperability Framework (IF). To enforce governance at run-time, policies and procedures should be embedded into the service management system. For instance, a central role in SOA governance can be played by a registry or a repository. Otherwise, the alternative is to mix the rules for governance with ser-
109
Trends and Research Issues in SOA Validation
Table 1. Stakeholders in a SOA Governance framework Stakeholder
Role and activities
Service User
This is the organization that accesses and uses a service deployed by a provider. The user has to abide by the established rules when using a service. At the same time it can assume that providers will have to stick to similar rules.
Service Developer
This is the organization that develops a service, maybe as consequence of a request made by a service provider. Governance within a given context could impose development standards, implementation of specific interfaces and so on.
Service Provider
This represents the organization that deploys the service over a server and makes it available to external organizations. In case of a pay-per-use service it is the provider who gets money. Provider’s main objective is thus to have a service that is highly appreciated by user so to continuously increase its usage. A provider can play the role of user when one of its services need to interact with a service deployed by another provider. Also in this case the governance could impose rules on the way in which a service is provided, for instance requiring that specific information is published.
Service Integrator
This is the organization that composes a set of services in order to derive a complex application. The integrator is influenced by the governance that can define different rules to follow in the integration
Directory Provider
This is the organization that provides a directory and discovery service. The directory provider plays a fundamental and strategic role within a SOA infrastructure. It should behave fairly providing references to registered services without any direct economic interest. Being the directory service a highly trusted party it is possible to assign to it specific active governance roles, for instance to check that registered services provide required interfaces for testability purpose.
Choreography Specifier
Within a specific domain it is possible that the various actors decide to create a body whose mission is to standardize the way in which applications should cooperate in terms of application level interfaces and interactions. This organization will define choreographies specifying the behavior of services that should participate in a choreography in order to reach a specific goal. This is a special kind of governance “agency” that poses rules on service behavior to reach specific goals.
Trusted Third Parties
These are trusted parties providing specific services supporting specific tasks such as authentication, security and so on. All of them operate according to rules defined by the governance and could be deputed to solve specific governance roles.
Monitor and Enforcer
These are specific organizations that can use mechanisms within a SOA infrastructure to check that what is happening is in-line with what specified by the governance. The governance will have to previously define which are the information of relevance for monitoring and which are the specific powers of the “monitor and enforcer” organization.
Governance Board
This is the organism that is deputed to define the rules. Typically will be constituted by experts from organizations willing to cooperate or interested in establishing an environment to be used for cooperation. At the same time the governance board will define the various roles and will assign specific tasks to the various roles.
vices functionality, which makes services less flexible. An instrument that is used to formally establish mutual responsibilities and warranties between a service provider and a service consumer is the Service Level Agreement (SLA). In addition to specifying the level of service, in terms of availability, performance, or other attributes of the service, the SLA also establishes costs and penalties in case of non compliance. Therefore SLAs are also integral part of a SOA governance framework.
110
Risk for governance is that it can be too restrictive and overwhelming. But if well managed, it really facilitates interoperability. Governance rules should be conceived and implemented so that applying them is easier than breaking them.
Dealing with a SOA Test Governance The concept of STG makes it explicit the fact that service testing involves a collaborative approach among different stakeholders. It sets the stage for
Trends and Research Issues in SOA Validation
explicitly defining its constituent components and the respective tasks. A useful first sketch for STG can be obtained by mapping a standard generic testing process onto the SOA paradigm. We do this by taking as a reference the multi-layer test process that is currently being defined by ISO/IEC JTC1/SC7 WG26 in Part 2 of ISO/IEC 29119 (Working Group 26 ISO/IEC JTC1/SC7 Software and Systems Engineering committee, 2008), the new international standard for software testing. A first draft of the testing process has been recently released that includes four layers, namely: 1. 2. 3. 4.
Organizational Test Policy; Organizational Test Strategy; Project Test Management Level; Test Levels.
The topmost layer is where in an organization the scope and objectives of testing, and the regulating policy is decided. It also establishes organizational testing practices and provides a framework for establishing, reviewing and continually improving the three subsequent layers. If we try to map such a layer into SOA development, where the possibly established testing policy and rules have then to be fulfilled by the different companies participating in the SOA, it is evident that such layer goes beyond the borders of one organization. Probably some international association, such as the OSGI Alliance2 or the WS-I Organization3 that represents the interests of a community should be considered to overtake such a role. Currently, there exists no initiative in the SOA world that could resemble this layer. As already explicit in ISO/IEC 29119, important activities that such organization should carry on are to publish and gain consensus on the test policy. Whereas the cited standard speaks of establishing a testing policy within an organization, within SOA such activities should be carried on at cross-organization level, for example the test
policy should be made public and agreed by all those accessing a set of services. Layer 2 is where the policy is actualized into a strategy. The activities that constitute the establishment of a test strategy correspond one-to-one to those included in layer 1: the test strategy must be created, agreed, published and maintained. However, while a test policy is a high-level statement to provide general directives, the test strategy is a more concrete and detailed document, which identifies, among other topics, the type of testing, the techniques adopted, tools adopted, Completion criteria (see (Working Group 26 ISO/IEC JTC1/SC7 Software and Systems Engineering committee, 2008), pp. 28-29). In the context of an organization, the purpose of a test strategy is to establish the testing practice and testing guidelines to be used on all projects. Going to a SOA context, a test strategy is the place where an implementation plan of a test policy needs to be established. Organizations that adhere to the test policy must abide by the guidelines and techniques there defined. We speculate that given a generic SOA test policy, community of service developers and users could agree on a test strategy for interoperability that fulfills a shared policy. Agreed upon test strategies could then be standardized. Layer 3 describes the testing activities compliant with the organization strategy for a specific project. The project test manager makes plans, identifies risks, instantiates the test strategy, decides on staffing and scheduling. Moreover, appropriate mechanisms for monitoring test progress, in order to decide on completion and control that the test plan is adhered must be put in place. In the context of SOA testing, the similitude with a specific project varies depending on if the application is got as an orchestration or as a choreography. In the former case, the role of the project test manager is served by the service orchestrator. This builds a composite application by aggregating different available services. Making reference to a standardized and shared test
111
Trends and Research Issues in SOA Validation
strategy, the service orchestrator decides how to develop a specific test plan. Within a choreography-based setting, there is not a stakeholder that has central control on the application. Hence the project test management level must be itself embedded into the choreography. One possible STG scenario following this case is the one presented in Section “Admission Testing”. Finally, layer 4, Test Levels, regulates how testing is carried out within the different levels (such as unit, integration, acceptance). In the case of SOA, the different levels of testing pertain to different actors belonging to different organizations. The establishment and adherence to a common shared governance framework can regulate the test activities carried out in independent manner by the individual organizations, so that they all contribute to improved SOA effectiveness and reliability.
An Abstract Framework for SOA Test Governance As SOA becomes pervasive, services agglomerate in federations, and service applications grow to ultra-large scale, we anticipate that many STG will be established resulting somehow in the definition of different SOA infrastructures in which a partner willing to enter will likely have to agree on the rules defined by the STG. For instance within a given infrastructure (i.e. a service federation) this would mean that the STG requires that a service entering on the scene has to provide an additional interface for testing purpose. Figure 3 depicts the abstract STG framework introduced in (Bertolino & Polini, 2009). The framework provides high-level guidelines towards the establishment of SOA Test Governance. Furthermore, according to the principles discussed in the previous subsection, the framework defines a general process and its key actors that either service federations or third party services should instantiate in order to implement and run a STG.
112
As shown, we distinguish three stages: Foundation, Design and Runtime. Of course, the activities, rules and processes performed in the three stages impact and influence each other. An instantiation of the framework in a specific domain and context should identify: •
•
• •
who are the participating actors, such as who is involved in the governance board, who is appointed of monitoring and enforcing, how the developers and providers enter the scene, and so on; which is the policy and the strategy to be followed by the relevant actors to design, deploy, publish, discover and run the services; how and where the testing is documented, ruled and then executed; which are the mechanisms and instruments that the STG monitor and enforcer has to fulfill its tasks.
CONCLUSION We have discussed issues and research trends in testing of SOA applications. SOA validation spans over three stages: 1. off-line testing, which corresponds to traditional testing in the laboratory by the service developers. We have explained that this testing requires to mock the surrounding services with which the service under test cooperates. 2. admission testing, in which services are submitted to a check in the deployment environment when asking for registration; we have outlined a framework in which this validation role is played by an augmented registry 3. on-line testing and monitoring: the former prolongs testing during live usage by proactively launching test cases in the real
Trends and Research Issues in SOA Validation
Figure 3. STG Framework
execution environment, the latter performs passive testing by observing how the services behaves spontaneously. On-line testing is the most novel of the approaches: it poses interesting research issues, especially when authorization and authentication policies are followed within service federations. We believe that such approach might be highly effective in addressing SOA validation needs, but it can only be feasible in the context of an agreed SOA Test Governance framework as the one we introduced. We are currently working at the implementation of an on-line testing framework within the TAS3 project, whose aim is to develop a Trusted Architecture for Securely Shared Services. We are also refining the definition of STG and we intend to apply it to support the validation of
ultra-large scale service choreographies within the new FP7 project CHOReOS.
ACKNOWLEDGMENT This paper provides an overview of work that has been partially supported by the European Project FP7 IP 216287: TAS3, by the European Project FP7 IP 257178: CHOReOS and by the Italian MIUR PRIN 2007 Project D-ASAP.
REFERENCES W3C (2005). Web Services Choreography Description Language (ver. 1.0 ed.). W3C.
113
Trends and Research Issues in SOA Validation
Baresi, L., Di Nitto, E., & Ghezzi, C. (2006). Toward open-world software: Issue and challenges. Computer, 39, 36–43. doi:10.1109/MC.2006.362 Bartolini, C., Bertolino, A., Lonetti, F., & Marchetti, E. (2011). Approaches to functional, structural and security soa testing. Chapter 17 in this book. Bertolino, A. (2010). Can your software be validated to accept candies from strangers? Keynote at Spice Conference’10, 18-20 May 2010 - Pisa, Italy. Bertolino, A., De Angelis, G., Frantzen, L., & Polini, A. (2008). The PLASTIC framework and tools for testing service-oriented applications. In (De Lucia & Ferrucci, 2009), (pp. 106–139). Bertolino, A., De Angelis, G., Frantzen, L., & Polini, A. (2008). Model-based Generation of Testbeds for Web Services. In Proc. of the 20th IFIP Int. Conference on Testing of Communicating Systems (TESTCOM 2008), LNCS. Springer Verlag. – to appear. Bertolino, A., De Angelis, G., & Polini, A. (2007). A QoS Test-bed Generator for Web Services. In Proc. of ICWE 2007, number 4607 in LNCS. Springer. Bertolino, A., Frantzen, L., Polini, A., & Tretmans, J. (2006). Audition of Web Services for Testing Conformance to Open Specified Protocols. In Architecting Systems with Trustworthy Components, volume 3938 of Lecture Notes in Computer Science, (pp. 1–25). Springer. Bertolino, A., & Polini, A. (2009). SOA test governance: Enabling service integration testing across organization and technology borders. In WebTest ’09: Proc. IEEE ICST Workshops, (pp. 277–286)., Washington, DC, USA. IEEE CS. Bianculli, D., & Ghezzi, C. (2008). Dynamo-AOP user manual. PLASTIC EU Project. BPT homepage, T. (2010). http://bptesting. sourceforge.net/.
114
Browne, P. (2009). JBoss Drools Business Rules. Packt Publishing. Canfora, G., & Di Penta, M. (2008). Serviceoriented architectures testing: A survey. In (De Lucia & Ferrucci, 2009), (pp. 78–105). Chow, R., Golle, P., Jakobsson, M., Shi, E., Staddon, J., Masuoka, R., & Molina, J. (2009). Controlling data in the cloud: outsourcing computation without outsourcing control. In CCSW ’09: Proceedings of the 2009 ACM workshop on Cloud computing security, (pp. 85–90)., New York, NY, USA. ACM. Ciotti, F. (April 2007). WS-Guard - Enhancing UDDI Registries with on-line Testing Capabilities. Master’s thesis, Department of Computer Science, University of Pisa. De Angelis, F., De Angelis, G., & Polini, A. (2010). A counter-example testing approach for orchestrated services. In Proc. of the 3rd International Conference on Software Testing, Verification and Validation (ICST 2010), (pp. 373–382). Paris, France. IEEE Computer Society. De Lucia, A., & Ferrucci, F. (Eds.). (2009). Software Engineering, International Summer Schools, ISSSE 2006-2008, Salerno, Italy, Revised Tutorial Lectures, volume 5413 of Lecture Notes in Computer Science. Springer. Eugster, P. T., Felber, P. A., Guerraoui, R., & Kermarrec, A.-M. (2003). The many faces of publish/subscribe. ACM Computing Surveys, 35(2), 114–131. doi:10.1145/857076.857078 Forum, G. G. (2005). Web Services Agreement Specification (WS–Agreement) (Version 2005/09 ed.). OGF. Frantzen, L., Tretmans, J., & Willemse, T. (2005). Test generation based on symbolic specifications. In Grabowski, J., & Nielsen, B. (Eds.), FATES 2004, number 3395 in LNCS (pp. 1–15). Springer.
Trends and Research Issues in SOA Validation
http://www.softwaretestingstandard.org/part2. php. Jordan, D., & Evdemon, J. (2007). Web services business process execution language version 2.0. Technical report. The OASIS Consortium. Linthicum, D. (2008). Design&validate soa in a heterogeneous environment. ZapThink White Paper, WP-0171. Luckham, D. C. (2001). The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. OASIS consortium. (2005). Universal Description, Discovery, and Integration (UDDI). http:// www.oasis-open.org/committees/tc_home. php?wg_abbrev=uddi-spec. accessed on June 30th, 2010. Pacyna, P., Rutkowski, A., Sarma, A., & Takahashi, K. (2009). Trusted identity for all: Toward interoperable trusted identity management systems. Computer, 42(5), 30–32. doi:10.1109/ MC.2009.168 PLASTIC Consortium. (2008a). http://plastic.isti. cnr.it/wiki/tools. PLASTIC Consortium. (2008b). http://plastic. isti.cnr.it/wiki/tools. Raimondi, F., Skene, J., & Emmerich, W. (2008). Efficient online monitoring of web-service slas. In SIGSOFT ’08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, (pp. 170–180)., New York, NY, USA. ACM.
Tretmans, J. (1996). Test generation with inputs, outputs, and quiescence. In Margaria, T. & Steffen, B. (Eds.), Tools and Algorithms for Construction and Analysis of Systems, Second International Workshop, TACAS ’96, Passau, Germany, March 27-29, 1996, Proceedings, volume 1055 of Lecture Notes in Computer Science, (pp. 127–146). Springer. Tsai, W., Huang, Q., Xiao, B., & Chen, Y. (2006). Verification framework for dynamic collaborative services in service-oriented architecture. Quality Software, International Conference on, 0, 313–320. Utting, M., & Legeard, B. (2006). Practical Model-Based Testing: A Tools Approach. MorganKaufmann. Wetzstein, B., Karastoyanova, D., Kopp, O., Leymann, F., & Zwink, D. (2010). Cross-organizational process monitoring based on service choreographies. In SAC ’10: Proceedings of the 2010 ACM Symposium on Applied Computing, (pp. 2485–2490)., New York, NY, USA. ACM. Windley, P. (2006). SOA governance: Rules of the game. on line at http://www.infoworld.com. Working Group 26 ISO/IEC JTC1/SC7 Software and Systems Engineering committee (2008). ISO/ IEC 29119 Software Testing–Part 2.
ENDNOTES
1
2 3
As of October 2010, Antonino joined SAP Research, France. He contributed to this chapter while he was a researcher at CNRISTI. http://www.osgi.org/ http://www.ws-i.org/
115
116
Chapter 5
Service-Oriented Collaborative Business Processes Lai Xu Bournemouth University, UK
Peng Liang Wuhan University, China
Paul de Vrieze Bournemouth University, UK
Keith Phalp Bournemouth University, UK
Athman Bouguettaya CSIRO ICT Centre, Australia
Sherry Jeary Bournemouth University, UK
ABSTRACT The ability to rapidly find potential business partners as well as rapidly set up a collaborative business process is desirable in the face of market turbulence. Traditional linking of business processes has a large ad hoc character. Implementing service-oriented business process mashup in an appropriate way will deliver the collaborative business process more flexibility, adaptability and agility. In this chapter, we describe new landscape for supporting collaborative business processes. The different solutions and tools for collaborative business process applications are presented. A new approach for supporting situational collaborative business process, process-oriented mashup is introduced. We have highlighted the security and scalability challenges of process-oriented mashups. Further, benefits of using processoriented mashup are discussed.
INTRODUCTION Modeling and managing collaborative business processes that span multiple organizations involves various challenges. The main challenges are regarding the ability to cope with change, decentralization, and the required support for interoperability. We will have to deal with a raising complexity of collaborative business processes
and a demand to configure those processes to allow them to respond to changing environments and requirements. The Internet lies at the core of a connected world, acting as a conduit for the exchange of information, allowing tasks to be processed collaboratively. It enables the formation of communities amongst users with similar interests. An Internet interconnected world has increased both
DOI: 10.4018/978-1-60960-794-4.ch005
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Service-Oriented Collaborative Business Processes
business and personal efficiency and performance (Litan & Rivlin, 2001). As the business environment changes rapidly the ability to rapidly find potential business partners as well as rapidly set up a collaborative business process is desirable in the face of market turbulence. Collaborative business processes are increasingly driven by the need for business agility, adaptability, and flexibility. To stay competitive in the global market and a company and its systems need to be able to adapt to the continuously changing business conditions. This leads to increased pressure to be able to build collaborative business applications quickly in order to respond to situational needs of the business. Collaborative business applications include both data-oriented and process-oriented applications. Within the context of collaborative business applications, data-oriented applications deal with where the data comes from, where it goes to, and how data is processed. A process-oriented application handles a different kind of collaborative business applications. A process-oriented application is not centered around the processing of data, but the control of the data, activities, and state plays a central role. For example when, where and how to process data or trigger activities by whom. Cross organizational workflow systems and business process management systems are the typical systems that support process-oriented applications. Service-orientation allows a way of thinking of business process management in terms of computational infrastructures, services, servicebased development and outcomes of those services (Papazoglou & Georgakopoulos, 2003). Serviceoriented architecture (SOA) is a significant computing paradigm and is being embraced by organizations worldwide as the key to business agility. Web 2.0 technologies such as AJAX enable efficient user interactions for successful service discovery, selection, adaptation, invocation and service construction. SOA and Web 2.0 technologies also balance automatic integration of services
and human interactions, separating content from presentation in the delivery of the service. Another Web technology, such as Web services, implements functionality using predesigned building blocks. Integrating SOA, Web 2.0 technologies and Web services into a service-oriented application connects business processes in a horizontal fashion. In the context of a web based service-oriented environment, the tools and applications for handling data-oriented applications are widgets, gadgets, pipes and data-oriented mashups. The traditional tools and applications for handling process-oriented applications are workflow systems and business process management systems, e.g. ERP, CRM, and SCM systems. These are heavyweight systems that are far from trivial to reconfigure for new processes. Inspired by the ideal of data-oriented mashups, i.e., supporting end users, easy usage, integrating web resources and data sources, and good virtualization, we propose a new concept, a process-oriented mashup, which allows users to specify their needs, find related web resources, and eventually execute the resulting process for rapidly building business processes. Users are enabled to automate their own processes without the active involvement of IT specialists. In this chapter, we examine the capabilities for building collaborative business using service computing technologies. We first identify potential Internet technologies, present background information and a motivated example, and the needs for supporting collaborative business processes. We provide existing solutions for the motivating example and analyze the problems of using existing solutions. This is followed by discussion on process-oriented mashup and key issues. Comparison between similar technologies is also explained. We highlight process-oriented mashup challenges on security and scalability in the chapter. A preliminary design of the processoriented mashup is introduced. An enterprise application is used to explain benefits of using the process-oriented mashup. Finally, we outline the future research directions and conclusions.
117
Service-Oriented Collaborative Business Processes
BACKGROUND Collaborative business processes have been a worldwide phenomenon for the past four decades (Gereffi & Sturgeon, 2004). Growth of business collaboration is driven by a number of business forces such as competition escalation, organizational reengineering, and new technology trends. Over the past decade, the number and quality of suppliers offering price-competitive and highquality business services has increased significantly. Organization has become more able to focus on the company’s core business strengths. In addition, large organization sizes are no longer a necessary advantage in production of products or services, and neither is small size—quality, flexibility, agility, and the ability to meet diverse consumer demands count for more (Drucker, 1992). Firms now respond to change by outsourcing when they face heightened competition pushes. Traditionally, after the part of business services is assigned, the initial organization can hardly monitor or get to control of the outsource services. Even a minor change of service is not easy. The advent of global digital networks, the Internet, the World Wide Web, and more recently, Web services, has drastically lowered the cost of coordination between firms and improved the possibilities for organizations and individuals to communicate in an effective and standard manner. New environment, newer technology, and rapid technological change provides an avenue for reducing human and equipment resources that do not fit with a company’s strategic direction for meeting the latest needs with up-to-date resources at competitive rates by outsourcing those business processes. Furthermore, the current technologies are also allowed to get control of business process collaboration. Service-oriented architectures (SOA) are rapidly becoming the dominant computing paradigm. It is now being embraced by organizations everywhere as the key to business agility. Web 2.0 technologies such as AJAX on the other hand
118
provide effective user interactions for successful service discovery, selection, adaptation, invocation and service construction. SOA and Web 2.0 technologies also balance automatic integration of services and human interactions, disconnecting content from presentation in the delivery of the service. Semantic technologies, such as WSMO, WSMO-lite (for WSDL Web services) and microWSMO (for RESTful Web services), or Semantic Web services could help to generate a business process out of services (semi-) automatically. The goal-oriented approach of these languages allows users to specify general goal-related requirements. According to such a goal, computers can reason based on web service annotations which services should be composed. These services thus provide a promising future for flexibly generating collaborative business processes. SOA, Web 2.0, and semantic technologies can be integrated to implement and fulfill collaborative business processes. For running web services consistently across the enterprise, an enterprise infrastructure that provides enterprise architecture and security foundation is necessary. In short, Internet and related technologies have changed the way we do business. Knowing, adapting to and managing changes are an important aspect of today’s business environment. In the chapter, we will investigate how to apply new web technologies, to establish collaborative business processes. A new approach, process-oriented mashup, is introduced.
Motivating Example: International Moving Services In this section, we introduce a motivating example, international moving services. We demonstrate how to using existing Web services, widgets, feeds, and other APIs to build a virtual enterprise (VE) for international moving services (IMS). An international moving service aims to facilitate international relocations in various ways. These
Service-Oriented Collaborative Business Processes
services go beyond moving items, but can include things such as visa applications and assistance in finding a new residence. In brief, the goal of international moving services starts with helping customers to find moving companies and request quotes. A very brief, incomplete and abstract description of the various services offered by an international moving service would include: •
•
•
•
•
Find Moving Companies: compare the services of international movers by requesting free quotes for the customer; provide moving tips, and information documents needed for international moving such as official government customs, visa and immigration, health, weather, etc. Travel Arrangements: find cheapest tickets and/or car renting in both places of departure and destination if needed. Temporary Stay Arrangement: find hotels or holiday/service apartments in both departure and destination if needed. Home Search: pre-select properties according to client requirements such as proximity to a childcare centre and provide neighborhood guide which contains information on doctors, shopping, schools, leisure activities etc. School/childcare Search: provide explanation of the local education system, options including public, private and international
•
•
schools, provide information on pre-school options including nurseries, toddler groups and other childcare facilities, provide list of possible schools, childcare or other facilities relating to the home search are. Settling-in Services: advice on banking systems, provide information on insurance of health, home, car etc., and advise on importing a car into the destination if applicable. Leaving Assistance: arrange property hand-back or sale, close utility accounts and arrange final bills, and manage property if client leaves before end of tenancy.
An overview of VE-IMS’s services can be found from Figure 1. Being a VE-IMS, the payment, CRM, and bookkeeping functions that should be included for being a normal business are sourced from third parties. However, we only concentrate here on the VE’s core business processes. General business related processes are not discussed here. Because of the various requirements from its customers, the services provided by a VE-IMS are dependent on the particular situation of its customers. Different customers require a different process. This process is supported by a specialpurpose piece of software, which we call an enterprise service mashup, with particular services, processes or activities. In addition to added ca-
Figure 1. Meta Model for the VE-IMS Example
119
Service-Oriented Collaborative Business Processes
pability, new service mashup can modify, enhance, customize or extend an existing service mashup, or include and combine parts or components (or both) from multiple existing service mashups. A preliminary service of the VE-IMS is to help customers to find a moving company for shipping their household effects to the new place of residence. Figure 2 shows the process of the finding a moving company. First, the VE-IMS will request free quotes from moving companies according to the customer’s place of departure and destination, if needed, arrange for visits, and provide a list of competent movers with their quotes. Another, extended, service the VE-IMS may provide is finding an international mover and arranging temporary places of residence at both the locations of departure and destination, based on the dates of moving, travel, and the arrival of the household effects. The temporary place of residence at the destination should be close to a certain address such as the customer’s working place. Customer can also ask for travel arrangements to be made. The time to fly and time of staying the temporary place should be worked out
Figure 2. Subprocess of finding a moving company
Figure 3. Process for the Scenario 2
120
to minimize the total costs. Further, the customer may want the VE-IMS to find an available Childcare place for the children of the customer as soon as possible, then find a rental home close by around the time that the household effects arrive. The second extended international moving service is shown in Figure 3.
Available Web Resources Supporting the Example We have found following available Web services, feeds, widgets, and mashups from Web sites like syndic8.com and programmableweb.com. A list of available feeds, Web services, widgets, gadgets, mashups that can be used as components in implementing the example: 1. Childcare Position Offered in France 2. Childcare Position Wanted in France 3. Dental Plan Comparison: Compare Dental Plans Side-By-Side, Dental News, And Dental Coverage Information
Service-Oriented Collaborative Business Processes
4. Cheap Flights Special Deals: Europe low cost, the worldwide cheap fights search engine. Price comparison on low cost fights in real time. 5. Hotels & Accommodations Special Deals: Find hotels anywhere in the world using our worldwide hotel search engine. Accommodations price comparison in real time. 6. Home Value Calculator: Uses Zillow data to calculate the value of single family homes in the U.S. Small widget suitable for placing on your Google home page. 7. My Camps Facebook APP: Facebook application for searching, rating, reviewing, and sharing summer camps. Put your favorite camps on the map and meet other friends who share the same experiences. 8. Real Estate Daily Widget: Real estate facts, terms, people, Websites, history, and events in a Google Gadget. Each day of the year contains a reference about the real estate and housing industries. 9. Easy One Loan and Home Values: Mashup of Zillow and Yahoo Maps as supplement to on-line mortgage service. 10. Child Care Finder: Find babysitters, nannies, and other care options visually with Google Maps. 11. PeekaCity API Users Google Streetview for neighborhood amenities PeekaCity is used primarily by real estate agents as a service to their customers (currently in Chicago and Dallas/Fort Worth). The addition of Street View enables customers to view street level photos of properties, and virtually drive up and down the streets in the neighborhood. 12. Moving Companies Feeds from 123movers. 13. Moving Tips Feeds from 123movers. We provide information of which services can be implemented by which available Web services, feeds, widgets, and mashups in Table 1. It shows that the core business of VE-IMS can be fulfilled
by existing resources manually in the certain areas, i.e. up to a user’s requirements. The list of available Web services, feeds, widgets, and mashups does not include universal school or child care information, home searching information even information of universal cheap fight tickets. Therefore, we do need a mechanism which supports the possibility to look for the required resources such as Web services, feeds, and widgets and can compose them automatically. To be able to find related resources, we need semantic technologies. We must thus annotate Web services, feeds, widgets and so on in advance to support their automatic composition by users. In this example case, it will be expensive and difficult to build a traditional workflow system to support the business process. It means that we have to know all information in advance or providing an interface to add information in the traditional workflow solution. The dependences are various, such as finding a home close to the best school or the available childcare, find the good schools close to the home address. It would however be handy for using a business process mashup solution, specially, if automatically invocation of needed feeds, Web services, etc. and execution processes are supported. The different processes of VE-IMS can be implemented by different process instances. Users (i.e. owner of VE-IMS) may only need to edit the certain processes to be able to meet all requirements from new customers. Table 1. Available Web Services for the VE-IMS Services Find Moving Service
12, 13
Travel Arrangement
4
Temporary Stay Arrangement
5,7,8,11
Home Search
6,8,9,11
School/childcare Search
1,2,10,11
Settling in Service
3,11
Leaving Assistance
6
121
Service-Oriented Collaborative Business Processes
SOLUTIONS FOR COLLABORATIVE BUSINESS PROCESS AUTOMATION There are existed solutions to support collaborative business processes across multiple organizations. Complexity of implement such collaborative business process automation depends on the flexibility of the collaborative process, the cardinality of participating business processes and the correlation of collaborating process instances. Setting e-contracts among involved partners’ workflow systems is one of solutions for supporting collaborative process automation. Web service orchestration and choreography is another solution by exchange information and data in a loosely coupled environment.
Contracting Among Involved Partners’ Workflow Systems Collaborative business processes are used to facilitate collaborations, while collaborations origin from contracting among organizations. Previous work on contracting (Xu & Jeusfeld. 2003; Milosevic et Al. 2006; Colombo et al. 2002; Chiu et al. 2002) has ever discussed to support collaborative business processes on modeling processes, tracing processes, and (pre-active) monitoring processes. Applying the contracting approach for implementing the motivating example is possible, but it does not satisfy flexibility requirements. Customers of the international moving can ask different destinations and different packages of services. Therefore, it is difficult to know in advance how many partners and which partners involved in the processes.
Web Service Orchestration and Choreography In a loosely coupled mode, participating businesses of a collaborative business process can be implemented by exchanging information and data. The approaches like WS-BPEL (Andrews et
122
al. 2003), ebXML (http://ebxml.org), RosettaNet (http://rosettanet.org), and IBM WebSphere (IBM 2005) enhance a collaborative business process with messaging exchange mechanisms. The ebXML business process specification schema (ebXML BPSS) (UN/CEFACT, 2003) provides a standard framework by which business systems may be configured to support execution of business collaboration consisting of business transactions. However, ebXML BPSS only supports two partners involved business process collaborations. When multiple partners are involved in business process collaborations, the specification needs to be broken down into multiple bilateral relationships. This bilateral model easily results in increasing loads and complexity of tracking business processes. As a result of this and other limitations ebXML BPSS supported business collaborations lack flexibility and agility needed to respond to a changing environment. Specifications, like WS-BPEL and WS-CDL, support service orchestration and choreography according to predefined processes or rules (W3C, 2005). Such a static coordination mode can not easily capture the dynamics of business processes in collaborations. Being agile in adapting business process of organizations to market dynamics, the new approach should look beyond the traditional solutions through collaborative interactions and dynamic e-business solution binding. Our process-oriented enterprise mashup however aims to handle dynamic, flexible, end user friendly creation, modification and automation of business processes.
Process-Oriented Mashup and Key Issues A new and user friendly tool, process-oriented mashup, is desired to handle situational businesses applications by the business users. To be able to allow business people to self-serve using process-oriented enterprise mashups, many issues need to be resolved first. A lightweight business
Service-Oriented Collaborative Business Processes
process modeling tool allows the users of process oriented enterprise mashups to specify their requirements easier. Process-oriented mashups in the enterprise can help to solve both business and IT challenges, especially for small, medium and virtual enterprises that have less resources to create traditional BPM solutions. Businesses seek greater agility, greater configurability, cross-platform, and need to respond faster to an increasing pace of business. A process-oriented enterprise mashup offers the next step in technology to aid business people to find best deals over the Web. Process-oriented mashups should also be end-user friendly. The creation of process-oriented applications should be based on re-usability and adaptability. It should not require high IT skills from the end users to create or customize mashups. A pattern is an abstraction from a concrete form which keeps recurring in specific non-arbitrary context (Riehle & Zuellighoven, 1996). The use of patterns is a proven practice in the context of programming, as evidenced by the impact made by the design patterns of Gamma et al. (1995). Process-oriented enterprise mashups need to provide process modeling patterns, modeling fragments /process modeling templates and even completed process models of typical cases to our end users. The end users can reuse, edit or add process models using patterns and/or templates. In using process patterns, templates, it is very important to compare and specially to point out the differences of similar models (process patterns and templates). Because the end users are not experts in process modeling, the process model needs to be verified before invoking Web services. Process verification should be supported. In this way, the end users can run their processes easier while reasonable execution results can be expected. Besides, current standards for describing Web services use syntactic (XML-based) notations such as WSDL. As these descriptions do not provide machine understandable service semantics, automation of Web service discovery, composition and
invocation is thus not possible. Using semantic technologies such as SA-WSDL (W3C, 2007), OWL-S (Martin et al., 2004), WSMO (Fensel et al., 2006), and WSMO-Lite (Vitvar et.al, 2008) to describe Web services instead, Web services discovery, contracting, mediation, composition, and invocation can be performed automatically. Computers use machine processable descriptions of Web services to reason and invoke according user specified requirements. Within Web 2.0, there are many services like mashups, gadgets, pipes which do not use standard Web services technology to describe their interface, communication or enactment, but work by merit of interpreting string. These entities should also be annotated. Semantics facilitate the management of categories of process templates, Web services, Web 2.0 services and other resource as whole. It aids users to discover, select and finally automate services. From a technical standpoint process-oriented mashups are still an immature technology. Besides the challenge of semantic heterogeneity, the technological basis for process-oriented mashup is mature though. A big technical challenge to mashups is software evolution. Mashups are built flexibly and fast. Where mashups are further predominantly built up of external components, mashups are highly sensitive to change of these components. External components can come from the mashup platform, from within the enterprise, and from outside providers on contract or without contract. Especially components from outside providers can easily change, and require change to the mashups.
COMPARISON BETWEEN SIMILAR TECHNOLOGIES In the following sections, we introduce dataoriented mashups, web services composition environments, traditional workflow systems, and process oriented mashups.
123
Service-Oriented Collaborative Business Processes
Figure 4. Process-oriented Enterprise Mashup
Data-Oriented Mashups Current popular mashups, such as provided by IBM mashup tools (IBM), Yahoo pipes (Yahoo), Google mashup editor (Google), etc., aggregate data from different sources and virtualizes it. A process-oriented mashup on the other hand provides more control on the data or activities and state information, i.e. when, where and how to process data or trigger activities by whom. Figure 4 shows the components of a process-oriented mashup. From Figure 4 we can see the difference between data-oriented mashups and processoriented mashups. Disregarding the ‘‘control flow’’, the Figure shows the structure of current data-oriented mashups, where data comes from, which kinds of Web services or APIs process the data, and presentation of data. Data-oriented mashups can provide collaborative business applications, such as integrating data from different sources. Data-oriented mashups have the capability to combine data with (possibly external) functionality to create and produce useful outputs. Data-oriented mashup suitable applications are e.g. embedded renting or selling house information in a map with all facilities’ information, automatically forwarding related information of federal animal disease to a veterinary’s costumers. 124
Mashups have a broader range of functionality than Web service compositions. Composing services is only a part of what mashups do. For example service compositions do not have an independent instantiation. They are initiated by calling the composed service. Mashups on the other hand, especially business process mashups, can have active parts that can monitor the environment or be notified of events. Mashups, being broader in scope than Web service compositions, actually differ most in the area of approach. A mashup is the result of fast development on a small scale. In the UNIX world, scripts are used for various tasks within the system, and for administration. These tasks can be performed by complete programs, but for many tasks, scripts are more suitable and convenient. Mashups can be compared with scripts, where Web service compositions are programs. The capabilities (looking only at Web service mashups) are similar, but the strengths of the approaches are in different areas.
Traditional Workflow Systems Traditional workflow systems provide user a stable system more reflect to routine. Business process mashup can be complement of a traditional workflow system. Because of agile ability,
Service-Oriented Collaborative Business Processes
a process-oriented mashup can reflect to change better, changes of user requirements, changes of process related resources, etc. It should be cheaper and easier to use. Because of the use of Internet as the application basis, it may also include accessibility problems, security problems and so on. Process-oriented mashups are on the very beginning of their journey, and have a long way to go. From a technical perspective, besides the need to add control-flow in business process mashups, the data-flow also needs to cover data visibility. This can be a pain point for supporting business process in a mashup way. Data integration related issues of the current mashup tools have been analyzed in (Lorenzo, et al., 2009). Data flow operators allow performing operations either on the structure of the data, or on the data itself. Besides, data is generated and updated using different data refresh strategies, like pull strategy and push strategy. While the pull strategy is based on frequent and repetitive requests from the client, in the push strategy, the client does not send requests but needs to register to the server. There are two possible strategies to handle pull interval. The pull interval of the global strategy is set for the whole application. In the local strategy, each data source is given its own refresh interval. According to definition of the data flow patterns, the data operators and data refresh strategies only cover data transfer, data interaction, and data routing. Data visibility is not covered by currently dataoriented mashups.
Existing Enterprise Mashups and Related Work Most existing enterprise mashups, such as SAP enterprise mashup Rooftop (Hoyer & StanoevskaSlabeva, 2009), JackBe’s enterprise mashup up Presto (jackbe.com), IBM Mashup Center, provide connectors to databases, customer relationship management (CRM) systems, content management systems (CMS), or MS Excel sources. They are aggregation tools at a business data level
which original mashup concept covered within corporate environments or lower levels of a business function level. Business functions can be considered at different levels of aggregation. At a high level of aggregation, business functions like “Procurement’’, “Sale’’ and “Operations’’ may be distinguished, where “Sale’’ is further decomposed into “Pre-Sales’’, “Sales Order Processing’’ and “Rebate Processing’’ and so on. The lowest level of aggregation business functions is called tasks or activities. The Rooftop mashup only deals with the lower level business function, such as monitoring the progress of “Contacting’’, which is updated statue of contract performance. It does not deal with higher level business functions. The on-going EU project FAST (fast.morfeoproject.eu) is taking the same position for handling low level business data aggregations. It aims at providing an innovative visual programming environment that will facilitate the development of next-generation composite user interfaces. The platform provides the data oriented aggregation and does not necessarily take a business process perspective. The platform is categorized into a data-oriented mashup platform, but focusing more strongly on visualization, while has not management of business process features. The on-going EU project SOA4All (www. soa4all.eu) aims to provide a platform to build process-oriented applications for end users. It will provide a lightweight process composition environment. The main research focus is on semantic technologies and automatic discovery and composition of semantically annotated Web services. It does not consider composition or discovery of other Web resources, such as widgets, gadgets, pipes, feeds and mashups.
Process-Oriented Mashup Challenges on Security and Scalability When exposing enterprise information systems, it is important that proper security and authorization
125
Service-Oriented Collaborative Business Processes
is in place. Security issues such as privacy and confidentiality of collaborative business process modeling and enactment are of particular importance in the business collaboration scenario, where trust and security issues are highly featured. Traditional inter-organizational business process approaches present the same process for all participating organizations, and therefore neutralizes the diversity of participants in term of authority levels and perception scopes. Mashups are however built flexibly and fast. Where mashups are predominantly built up of external components, mashups are highly sensitive to change of these components. External components can come from the mashup platform, from within the enterprise, and from outside providers on contract or without contract. Especially components from outside providers can easily change, and require change to the mashups. To handle privacy, confidentiality, and changing of components are important security issues for process-oriented mashups. Mashups act in the name of their owners, but these owners should not need to log in to each individual Web service. This means that single sign on technologies need to be supported by the mashup platform. Mashups represent an interesting challenge in the area of IT Governance. End-user programming means that the control over the application functionality moves to the end users, away from specialized departments. This may create political problems as well as genuine management problems. For example, when an enterprise mashup becomes business critical, who then is responsible for its functioning? The ‘‘IT department’’ has no influence or knowledge of the design. The ‘‘Developer’’ is not schooled in program design and cannot be expected to be held responsible for flaws. This problem however, is not altogether different from mission critical spreadsheets. It is a problem that must be understood and managed, not an unmanageable problem. A solution could be for example to transfer mashups that become
126
mission critical to specialized departments, or to have a more permanent solution written. Another management issue in the mashup area is the issue of duplicated work. When two people work on similar mashups part of the effort is wasted in duplication. On the other hand, coordination costs are avoided. Given the current difficulties to have custom solutions created, it seems that the benefits gained outweigh the costs of duplication. Especially when good sharing mechanisms are used that enable easy retrieval of existing mashups. These possible issues show the limits of mashups. Mashups cannot take the place of more elaborate information systems. The role of mashups is more in addition to existing information systems, providing automation to tasks that now have to be performed manually, as traditional information system development for the tasks is not economically viable. After having identified all key issues of process-oriented enterprise mashup, a preliminary design of process-oriented enterprise mashups is presented in the next section. Process-oriented enterprise mashups help to solve these challenges through self-service application development, enabling to move to the next level of innovation, speed, and agility by allowing users to combine and remix different sets of data in new ways. In this way, process-oriented enterprise mashups can provide insight into corporate data that was simply not possible before. Process-oriented mashups are not designed for large amount of users. It is designed for situational business applications which are take business user’s perspective to develop and deploy quickly and “good enough” applications. Scalability will be an issue when the mashup becomes popular. For overcoming such problem, an IT professional design, development systems are needed. Such mashup applications will be eventually added into the organizations’ core business process systems.
Service-Oriented Collaborative Business Processes
PRELIMINARY DESIGN OF PROCESS-ORIENTED MASHUP To visualize what an enterprise business process mashup may look like, we focus a simple process which mentioned in the motivating example, the request of quotes for removalists. This is at its base a fairly simple process, but it can be enriched in many ways. A key part of the business process mashup is some form of control panel. This control panel at least allows the user to instantiate new processes, view his current processes, and to edit and view the available process models. The create process dialog (see Figure 5) allows the user to select a base process model for the process. The dialog also allows the user to name the process instance, and to provide a description. Both the name and description are there to allow the user to retrieve the process instance more easily; they are not needed for the processing itself. The process model for requesting quotes is actually a parameterized model. Therefore a dialog is presented that allows the user to specify when to stop waiting for quotes after sufficient quotes have been received, how many quotes to request, and the minimum amount of quotes needed.
After specifying the process parameters, the user can review, and if necessary edit the process model. The process viewer (Figure 6) allows users to see all their process instances, and take a deeper look into them. The user can for example look into sub-processes directly from the main process. The idea is to allow users a better understanding of the entire process, especially where the sub-process is needed to perform activities over a number of items resulting from earlier process activities. If a sub-process is part of an instantiated process, and the sub-process itself has been activated, it is also possible to see the actual process instances. For example suppose that another removalist has heard about the pending removal, and that the user has allowed them to provide a quote. When the user then receives the quote from this removalist, he would still like to make the quote part of the regular process. This means that the user needs to edit the process instance. First, the user needs to find the related instance, based on the process name and description. When a process instance has been selected, the user can edit it. Therefore he is presented with the process instance editor (see Figure 7). This editor allows the user to see and edit the current progress in the process. As we see in the figure, at
Figure 5. Process creation screen
127
Service-Oriented Collaborative Business Processes
Figure 6. Process editor for new process
this stage, some removalists have already submitted quotes, and we can see all process instances. The sub-process view can also be collapsed to keep the overall overview. In Figure 8, the sub-process is collapsed. Further, to allow the additional quote to be part of the process, a new activity (‘‘Add additional quote’’) has been added. As the additional quotes need to be added to the results of the quote requesting and receiving sub-process, this is visualized as the results of the additional quotes activity going to the transition between the sub-process, and the conditional that allows the process to go on if there are sufficient quotes and the quote deadline has passed. The way in which we have edited this process instance should be fairly easy and allow the user to still leverage the process system for dynamic processes where exceptions occur. The results of this edit however, cannot be used as a process model as the additional quotes activity is not initiated anywhere. The business process mashup must also have the ability to have process edits reflected in the process model, and perhaps even use one edit to update a number of process instances.
128
Benefits of Using ProcessOriented Mashups Our example is a process-oriented mashup server as an open service platform for a virtual organization of international moving services. The process-oriented mashup can be used by owners of the virtual organization to model and execute core business processes of international moving. There are many advantages of using these process-oriented mashups. Firstly, the mashup platform allows end users such as owners of the virtual organization to handle simple process development tasks themselves instead of requiring a more expensive and potentially longer IT development project. Moreover, cheaper process development makes it economic to include those processes in the platform, which are rarely used, so that the long tail effect can be harvested. Secondly, development tasks that cannot be handled by end users and are done by the service developers from external service providers will also be faster and cheaper due to the seamless interaction among parties over the platform (e.g., for communicating requirements or performing beta tests in the target environment).
Service-Oriented Collaborative Business Processes
Figure 7. Process instance review
Figure 8. Process instance editing
Thirdly, the mashup platform is a shared process repository so that new processes or modifications become immediately visible to users, reducing propagation times and costs. Finally, the modularity of the underlying SOA allows a owner to buy only those services they really need instead of complete products, reducing the total cost of their IT infrastructure. In addition to these efficiency gains, the platform also has the benefit of an effectiveness gain because the platform allows owners of the virtual organizations to handle different requests from consumers in a central place, which leads to a better service for the consumers.
For service providers, resellers, and consultants, the main advantages are a potentially larger market reach and easier market access, because the process-oriented enterprise mashup platform provides a central entry point for service consumers so that it is easy to find, try out and integrate services. Moreover, service providers, resellers, consultants, and developers can also benefit from an efficiency gain because of an easier integration and development process, and improved agility because of a faster interaction with customers (e.g., when developing a new service) and market trends in general.
129
Service-Oriented Collaborative Business Processes
CONCLUSION AND FUTURE RESEARCH ISSUES The internet has a continuing impact on everyday life. Web based technologies not only impact our communication patterns, but also provide opportunities to bring information or knowledge to our daily activities. Process composition in a process-oriented mashup is shown to be a potentially useful technology for SOA-based business integration. It provides an agile approach to adapt to fast-paced business environments. Process-oriented mashups coordinate different process orchestration activities. This allows users to automate their activities. When a user needs to perform a task this often involves getting information from one place, aggregating, filtering, and shaping it, and then sending the result to a different place, and then again until some time when the process is finished. As an example of an enterprise application, we have provided a scenario of international moving services of a virtual organization. Showing how owners of the virtual organizations can use a process-oriented enterprise mashup to build their applications and how they can benefit from using the mashup. In the registration of a business process in an international moving service can be automated, such that those elements which do not by their nature require human intervention can be carried out automatically, or, if wanted, with the click of a single button. A process-oriented mashup allows the users to specify this execution themselves, and it will be executed as if they had performed the tasks manually. This approach is different from traditional workflow management approaches in that the mashup is focused on a single user/role, and the actions are performed as they would be if they were done manually. These process mashups are situated at a lower level in the process and are more focused on concrete tasks than abstract tasks that must be decomposed into other, more detailed tasks.
130
This chapter explains new landscape for supporting collaborative business processes: a service-oriented approach. The different solutions and tools for collaborative business process applications are presented. A new approach for supporting situational collaborative business process, process-oriented mashup is introduced. We have highlighted the security and scalability challenges of process-oriented mashups. Further, benefits of using process-oriented mashup are discussed. To fully exploit the potential of Web-centric compositions, we are starting new work in several areas. We are exploring a lightweight business process modeler (Xu et al. 2010), (Xie, de Vrieze & Xu, 2010a), (Xie, de Vrieze & Xu, 2010b). We are providing most popular process templates which are based workflow patterns (van der Aalst, 2003). We are aiming that users can pick up and to run. Further we are also interested in annotating the resources using WSMO-Lite and Micro-WSMO. We currently do not have a unified way to browse or an efficient way to discover different web resources. We are interested in discovering a unified way to list all web resources, access existing enterprise services and data/content repositories. Semantic technology can facilitate web resource discovery. The term web resources here refers to web services, widgets, gadgets, pipes and mashups. On Site 6, a leading mashup directory, there are over 4100 registered mashups listed. Every month, about 100 new mashups are added. Descriptions of feeds can be obtained, for example, from social bookmarking web sites like Site 7, which lists about 560,000 feeds. These show there are too many web based resources for end users to manually discover, select or compose. The process-oriented mashup environment needs to unify the web resources landscape and provides an efficient discovery mechanism and maybe semantic technology can help facilitate this. Secondly the influence enterprise security on the use of mashups and is critical for popularity of mashup applications. This needs further exploration.
Service-Oriented Collaborative Business Processes
A further issue is how to improve the executability of service- oriented business process applications? There are special issues for serviceoriented business process applications. Even if the control-flow of the business process is correct, conflicting pre-conditions and postconditions for invoking web services can still lead to an unexecutable business process application. Therefore, beyond the checking process models, there are more issues for executability verification of service-oriented business process applications. Finally, we would like to emphasize that process-oriented mashups will not completely replace core business process management systems. Process-oriented mashup applications address different needs and are built for just a handful of users, applications that are used for only a few weeks or months, or situational applications that address a small piece of functionality. For example, perimeter ERP applications, such as vacation scheduling, seminar and presentation management, etc., are normally not included in an organization’s ERP system. However, they can be desirable for individuals who manage those matters on a daily basis.
Colombo, E., Francalanci, C., & Pernici, B. (2002) Modeling coordination and control in cross-organizational workflows. In Proceedings of DOA/CoopIS/ODBASE, 91–106
REFERENCES
Lorenzo, G. D., Hacid, H., Paik, H., & Benatallah, B. (2009). Data integration in mashups. SIGMOD Record, 3(1), 59–66. doi:10.1145/1558334.1558343
W3C, (2005), Web services choreography description language version 1.0 W3C, (2007) Semantic annotations for WSDL and XML schema, http://www.w3.orgorg/TR/sawsdl/. Andrews, T., Curbera, F., Dholakia, H., et al. (2003). Business process execution language for web services (BPEL4WS) 1.1. Chiu, D. K. W., Karlapalem, K., Li, Q., & Kafeza, E. (2002). Workflow view based e-contracts in a cross-organizational e-services environment. Distributed and Parallel Databases, 12(2–3), 193–216. doi:10.1023/A:1016503218569
Drucker, P. (1992). The new society of organizations. Harvard Business Review. Fensel, D., Lausen, H., Polleres, A., de Bruijn, J., Stollberg, M., Roman, D., & Domingue, J. (2006). Enabling Semantic Web Services: The Web Service Modeling Ontology. Springer. Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1995). Design Patterns Elements of Reusable Object-Oriented Software. Reading, MA, USA: Addison-Wesley Publishing Company. Gereffi, G., & Sturgeon, T. J. (2004). Globalization, employment, and economic development. Sloan. Google mashup editor, http://editor.googlemashups.comHoyer, V., Stanoevska-Slabeva, K. (2009). Towards a reference model for grassroots enterprise mashup environments, in: the 17th European Conference on Information Systems (ECIS 2009). IBM mashup center, http://www-10. lotus.com/ldd/mashupswiki.nsf
Martin, D., Burstein, M., Hobbs, J., Lassila, O., McDermott, D., McIlraith, S., et al. (2004) OWL-s: Semantic markup for web services, http://www. daml.org/services/owl-s/1.1/overview/. Milosevic, Z., Sadiq, S. W., & Orlowska, M. E. (2006). Towards a methodology for deriving contract-compliant business processes. In Proceedings of the 4th International Conference on Business Process Management, 395–400
131
Service-Oriented Collaborative Business Processes
Papazoglou, M. P., & Georgakopoulos, D. (2003). Service-oriented Computing. Communications of the ACM, 46(10), 24–28. doi:10.1145/944217.944233 Riehle, D., & Zuellighoven, H. (1996). Understanding and using patterns in software development. Theory and Practice of Object Systems, 2(1), 3–13. doi:10.1002/(SICI)10969942(1996)2:1<3::AID-TAPO1>3.0.CO;2-# Robert, E. Litan & Alice M. Rivlin (2001). Projecting the Economic Impact of the Internet. The American Economic Review, Vol. 91, No. 2 and Proceedings of the Hundred Thirteenth Annual Meeting of the American Economic Association (May, 2001), pp. 313-317 UN/CEFACT, (2003), ebXML Business Process Specification Schema, Version 1.09. van der Aalst, W., ter Hofstede, A., Kiepuszewski, B., & Barros, A. (2003). Workflow Patterns. Distributed and Parallel Databases, 14(1), 5–51. doi:10.1023/A:1022883727209 Vitvar, T., Kopecky, J., Viskova, J., & Fensel, D. (2008), WSMO-lite annotations for web services, in: 5th European Semantic Web Conference. Workshop Series in Industry Studies Rockport Massachusetts, June 14-16, 2004.
132
Xie, L., Xu, L., & de Vrieze, P. T. 2010a. Lightweight Business Process Modelling. In: International Conference on E-Business and E-Government (ICEE2010), 7-9 May 2010, Guangzhou, China. Xie, L., Xu, L., & de Vrieze, P. T. 2010b. Process Modelling in Process-oriented Enterprise Mashups. In: The 2nd IEEE International Conference on Information Management and Engineering (IEEE ICIME 2010), 16-18 April 2010, Chengdu, Sichuan, China. Xu, L., de Vrieze, P. T., Phalp, K. T., Jeary, S., & Liang, P. 2010. Lightweight Process Modelling for Virtual Enterprise Process Collaboration. In: PRO-VE’2010: 11th IFIP Working Conference on Virtual Enterprises, 11-13 October 2010, SaintEtienne, France. Xu, L., & Jeusfeld, M. (2003). Pro-active Monitoring of Electronic Contracts. In: The 15th Conference On Advanced Information Systems Engineering (CAiSE 2003), 16-20 June 2003, Klagenfurt, Austria. Yahoo pipes, http://pipes.yahoo.comhttp:// www.oreillynet.com/pub/a/oreilly/ tim/ news/2005/09/30/what-is-web-20.html
Section 2
Performance
134
Chapter 6
Performance Management of Composite Applications in Service Oriented Architectures Vinod K. Dubey Booz Allen Hamilton, USA Daniel A. Menascé George Mason University, USA
ABSTRACT The use of Service Oriented Architectures (SOA) enables the existence of a market of service providers delivering functionally equivalent services at different Quality of Service (QoS) and cost levels. The QoS of composite applications can typically be described in terms of metrics such as response time, availability, and throughput of the services that compose the application. A global utility function of the various QoS metrics is the objective function used to determine a near-optimal selection of service providers that support the composite application. This chapter describes the architecture of a QoS Broker that manages the performance of composite applications. The broker continually monitors the utility of the applications and triggers a new service selection when the utility falls below a pre-established threshold or when a service provider fails. A proof-of-concept prototype of the QoS broker demonstrates how it maintains the average utility of the composite application above the threshold in spite of service provider failures and performance degradation.
INTRODUCTION Service Oriented Architecture (SOA) enables a market of service providers delivering functionally equivalent services at different Quality of Service (QoS) and cost levels. This presents a unique opportunity for consumers to pick and
choose services that meet their business and QoS needs. The selected services can be orchestrated in a process flow to optimize the execution of business processes in a cost-effective manner. We assume that service providers publish their QoS levels and offer resource reservation to guarantee them within a certain range. We also assume that
DOI: 10.4018/978-1-60960-794-4.ch006
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Performance Management of Composite Applications in Service Oriented Architectures
the characteristics of QoS metrics, more specifically QoS levels offered by service providers, may change over time. For instance, the performance of a service may degrade due to heavy workload conditions of the associated service provider or due to some unforeseen unavailability of the service. As a result, this may affect the end-to-end QoS of a business process that may depend on such services. This requires monitoring the performance of services and business processes at runtime and, if needed, taking corrective measures to ensure that the QoS levels of running business processes were not compromised. This chapter describes the architecture of a QoS Broker that manages the performance of composite applications. The broker facilitates near-optimal service selection for the composite application, continually monitors the utility of the applications, and triggers a new service selection when the utility falls below a pre-established threshold or when a service provider fails. A proof-of-concept prototype of the QoS broker demonstrates how it maintains the average utility of the composite application above the threshold in spite of service provider failures and performance degradation.
Background We assume there will be a market of service providers delivering services at different QoS and cost levels. The service providers specify their QoS metrics in terms of response time, availability, and throughput. Response time (R) refers to the time it takes for a service to respond to a user’s request. It is measured in time units such as sec or msec. Throughput (X) represents the number of requests or transactions completed per unit of time. Availability (A) refers to the fraction of time a system is up and available for use. A composite application in this chapter refers to a system composed of various services that support the execution of a business process. A business process is defined as a collection of activities connected to one another in a certain workflow
to address business needs of an enterprise. The execution time, availability, and throughput of composite applications are computed as a function of the QoS metrics of the individual service providers selected for the business process. This computation of the end-to-end QoS for a business process must take into consideration the constructs used in the business process workflow. For example, the end-to-end execution time of a business process with sequential activities is additive in nature, while for parallel constructs (e.g., fork-and-join), it is the maximum of the response times of the service providers that support the parallel activities. The throughput of composite applications with sequence and flow constructs is the minimum of the throughputs of all service providers chosen to support the business process, while the end-to-end availability is the product of the availabilities of the individual service providers. We consider a global utility function of these metrics as the objective function used to determine the near optimal service selection to support the execution of a business process. A utility function measures the usefulness of a service or business process to a consumer in terms of various QoS metrics. The utility is typically represented as a scalar with no units. The utility increases with the decrease in response time and increases with throughput, availability, and security. Typically, utility functions are monotonically decreasing for response time and monotonically increasing for throughput, and availability. The monotonicity assumption corresponds to rational user expectations. For example, one would expect a user to see less utility in a system as its response time increases than the other way around. Utility functions have been used for achieving self-optimization in distributed autonomic systems (Kephart and Das, 2007; Tesauro et. al., 2005). Bennani & Menascé (2005) used utility functions combined with analytical queuing network models to dynamically allocate servers to applications being hosted by an Internet data center. Menascé
135
Performance Management of Composite Applications in Service Oriented Architectures
et al. (2010) presented a framework for utilitybased service oriented design for self-architecting software systems. An example of a utility function U (r) for response time r is the sigmoid function used in Menascé et al. (2010): U (r) = [ (1 + eα.β) eα (β – r) ] / [(1 + eα (β – r)) eα.β] where β is the Service Level Objective for the response time and α is a shape parameter. Given a market of service providers delivering services at different QoS and cost levels, mechanisms need to be devised to optimally select services at runtime to support a business process execution so that the selected services together meet end-to-end QoS and cost requirements of the business process and maximize the utility for a consumer. This problem is referred in the literature as the QoS-aware Service Selection or Optimal Service Selection problem and is an NP-hard problem. A naïve approach to solve this problem could be to carry out an exhaustive search. However, due to NP-hard nature of this problem, it may not be a practical approach to find an optimal solution for business processes via exhaustive search especially when service selection decisions may need to be taken at runtime to respond to service failures or degradation. For example, for a business process with N activities and m service providers per activity on average, the number of possible service selections is mN. This means that for a business process with 10 activities and 10 service providers per activity, there will be 10 billion possible service provider allocations. Thus, if the evaluation of each service selection takes one tenth of a millisecond, an exhaustive search will take more than 10 days to find the optimal solution. Therefore, one needs to explore efficient heuristic service selection algorithms that provide a near optimal solution at significantly less computational cost than that required to find an optimal solution. A number of authors have tried to solve the service selection optimization problem using
136
linear integer programming. These authors dealt with either sequential business processes or, when they considered general business processes, they used deterministic QoS metrics and assumed linear and deterministic objective functions. Zeng et al. (2004) proposed using multiple QoS metrics such as price, execution time, and reliability, and a combination of local optimization and a global planning approach by means of integer programming to select an optimal execution plan for a composite service. Cardoso et al. (2004) proposed using QoS distribution functions for workflow activities and applied reduction rules to compute deterministic QoS values for a composite service. They also proposed using simulation to estimate the QoS of composite services. Yu et al. (2007) presented a different approach using a utility function of QoS metrics to optimize end-to-end QoS requirements and defined the problem as a multi-dimension multi-choice 0-1 knapsack problem as well as a multi-constraint optimal path problem. Menascé and Dubey (2007) present a scheme using utility functions of QoS metrics such as response time, throughput, and probability of rejections as service selection mechanism based on a predictive analytical queuing network performance model. Ardagna and Pernici (2006) modeled service composition as a mixed integer linear problem where both local and global constraints are taken into account. Kyriakos and Plexousakis (2009) proposed a mixed-integer programming based approach for QoS-based Web service selection. Cardellini et al. (2007) proposed a flow-based service selection for Web service composition supporting multiple QoS classes. Algorithms for the selection of the most suitable candidate services to optimize the overall QoS of a composition are also discussed in Jaeger et al. (2005). Rosario et al. (2008) presented probability distribution of QoS for service composition and monitoring. A different approach on Web service composition based on a user’s qualitative preferences utilizing CP-nets was discussed by Santhanam et al. (2008).
Performance Management of Composite Applications in Service Oriented Architectures
Singh (2004) proposed a QoS ontology-based agent framework for dynamic service selection. Lecue and Mehandjiv (2009) investigated QoSdriven Web service composition by considering their semantic links. Serhani et al. (2005) proposed broker-based verification and certification of Web services for their functional and QoS claims. Ye and Mounla (2008) discussed a hybrid approach to QoS-aware service composition using integer programming, genetic algorithms, and case-based reasoning; they used response time constraints as service selection criteria. A number of authors who have tried to solve the QoS-based service composition optimization problem validated their work focusing on workflows with sequential execution. The linear programming approach discussed by Zeng et al. (2004), and confirmed by Canfora et al. (2005) and Yu et al. (2007), suggested that the approach is not scalable with the number of candidate services and activities in the business process. Genetic algorithms proposed by Canfora et al. (2005, 2008) and a heuristic approach presented by Berbner et al. (2006) for service composition with sequential execution is an improvement over the linear programming approach as they demonstrate the heuristics are very fast and perform significantly better than linear programming based approaches. Menascé et al. (2008) described optimal service selection in SOAs for BPEL-compliant business processes, which allows constructs such as sequence, switch, and flow. The paper considers probability distribution functions of QoS metrics and presents an efficient algorithm that finds the optimal solution for the composite services without having to explore the entire solution space. The paper also presents an efficient heuristic approach that performs very close to the optimal solution while examining a very small portion of the solution space. Dubey and Menascé (2010), considered service provider selection for business processes as a constrained non-linear optimization problem that aims to find an optimal service selection based on maximizing a utility function
of multiple QoS metrics under multiple QoS and cost constraints. They presented an efficient optimal service selection algorithm that addresses service selection for business processes of modest size. They also presented an efficient heuristic algorithm that obtains a sub-optimal solution that is very close (99.5% as good as the optimal solution) at a small fraction of the computational cost in terms of the number of points examined in the solution space and computation time. This algorithm is part of the QoS Broker framework and is briefly described in the next section. Several authors have investigated selfadaptation in software systems. For example, Ardagna and Pernici (2007) presented adaptive service composition in flexible processes by considering optimal service selection using mixed integer linear programming. Canfora et al. (2008) developed a framework for QoS-aware binding and rebinding of composite services that focuses on QoS-based dynamic service selection using genetic algorithms. Gjorven et al. (2008) described an adaptation middleware to support service selection and composition. Salehie and Tahavildari (2009) presented a landscape of research and challenges in self-adaptive software by conducting a detailed survey. Menascé et al. developed a self-architecting framework for service-oriented software systems (Menascé, Ewing, et al., 2010). Yau et al. (2009) developed an adaptive service based software system with features to monitor and manage QoS. Sheng et al. (2009) presented a configurable and adaptive service composition model that provides distinct abstractions for service context and exceptions. A recent special issue of the IEEE Transactions on Services Computing guest-edited by Yu and Boughetaya (2010) included five papers that addressed query models and efficient selection of web services. Chapter 12 of this book, by Merad et al., discusses a game-theoretic solution to the problem of optimal selection of services.
137
Performance Management of Composite Applications in Service Oriented Architectures
A QOS BROKER The QoS Broker (QB) discussed in this chapter represents a framework to provide QoS brokering between business process consumers and service providers (SP) for optimal service selection and to facilitate the execution of business processes in a business process engine. The framework assumes that a business process can be designed using proxy services that provide placeholders for an optimal set of concrete services to be invoked at runtime. Proxy services have the same interface and signature as actual candidate services, which will be used to execute the activities of business processes at runtime. The framework uses a service mediation component that enables a loosely coupled integration scheme and serves as an intermediary between different components of the architecture. The data transformation from one service to another, content-based message routing, and service interactions are also facilitated via the service mediation layer. Content-based routing is used to orchestrate services for business process execution by routing requests to target service providers based on service-tokens obtained through the heuristic-based service selection. This eliminates the need for tightly coupled point-topoint integration approaches and provides loose coupling between business consumers, broker, and service providers, which is the hallmark of Service Oriented Architectures. For example, the Enterprise Service Bus (ESB), which is readily available at most services platforms, provides such service mediation capability. The QB implements algorithms for heuristicbased service selection and for evaluating endto-end QoS metrics for any BPEL-compliant business process. It provides capabilities for service providers to register their services along with QoS and cost functions. The QB stores registered services and processes in a service registry/ repository, referred to as a library of processes and services in Figure 1, and which could be an
138
extension of any standard-based service registry/ repository (e.g., UDDI, ebXML). The registration functionality enables consumers to register their business processes to receive brokering for optimal selection of services and accepts requests to execute business processes using preselected SPs. When a consumer sends a service selection request for his/her business process, the QB selects services that will maximize a utility function of multiple QoS metrics—execution time, availability and throughput—for the consumer under QoS and cost constraints. Once services are selected for a business process, the QB returns a token containing the selected service providers to the consumer. When a consumer sends a business process execution request, the request contains a selected services token, which is used by the process engine to orchestrate services in a process flow to execute the business process. An optimal service selection set for a business process can be used to execute the business process multiple times. The QB employs a QoS Monitor for runtime QoS monitoring and management of the business process and the selected services. Figure 1 shows the logical architecture of the QoS Broker framework. The following subsections discuss some of its key components.
Computation of End-to-End QoS Metrics We assume that service providers specify their response time through a probability distribution function, which could be obtained by fitting observed service providers data to a distribution. The QoS metrics for availability and throughput are assumed to be deterministic. A service provider may need to dynamically allocate resources to meet its agreed upon QoS goals using autonomic computing techniques (Bennani & Menascé, 2005). It is also assumed that service providers will charge more to provide better QoS values since they need to deploy more resources to meet these goals. The
Performance Management of Composite Applications in Service Oriented Architectures
Figure 1. The logical architecture of the QoS Broker framework
cost functions are assumed to be monotonically decreasing for response time and monotonically increasing for throughput and availability. The total execution cost of the business process is the sum of the costs due to response time, availability, and throughput. The first step in solving the service provider allocation is to obtain expressions for end-to-end QoS and total cost of a business process expressed in the Business Process Execution Language (BPEL). We addressed this problem in (Menascé, Casalicchio, et al., 2010) for end-to-end execution time and cost, where we considered the execution time as non-deterministic and showed how to compute the end-to-end average execution time for business processes. Expressions for the endto-end availability and throughput of business processes with common workflow constructs such as sequence, switch, and flow can be obtained by parsing BPEL files using the SAX parser and using the algorithms discussed in (Dubey & Menascé 2010).
A Heuristic Algorithm for Near Optimal Service Selection The heuristic service selection used in the QoS Broker framework is based on the well-known hill climbing search technique (Russell & Norvig, 2003). The technique at its core has two basic steps: 1) define a neighborhood of the point currently being visited and 2) move to the best point in the neighborhood. The process continues until a near-optimum solution is found given a stopping criterion (e.g., reached a limit in number of steps or no better solution can be found). The hill-climbing based heuristic service selection algorithm for business processes involves the following steps: 1. Identify the solution space, which depends on available service providers, their cost and QoS levels as well as on QoS and cost constraints from consumers. Each point in the solution space is a possible service selection option for the business process.
139
Performance Management of Composite Applications in Service Oriented Architectures
2. Select an initial point in the solution space to visit by randomly selecting services for each activity of the business process. 3. Define the neighborhood of the point being visited. An approach to define the neighborhood is described below. 4. Find a point in the neighborhood that maximizes an objective function (a utility function in our case). 5. If such a point exists, make it the next point to visit; otherwise, stop. 6. Go to step 3 if the stopping criterion (e.g., limit on the number of points to visit) has not been reached; otherwise, stop. We now describe how we define the neighborhood to obtain a very efficient service selection algorithm. Assume that we are currently visiting point z0. We define three sets of candidate service selections for the business process with respect to three QoS metrics: response time, availability, and throughput (Figure 2). Each set is obtained by considering each activity of the business process and replacing the current service provider for that activity by a service provider that provides the best possible improvement with respect to the
metric associated with the set. The neighborhood of point z0 is the union of these sets. Although we implemented our approach using a utility function based on three QoS metrics, the approach is general and can easily include other QoS metrics. We then compute the utility for each point in the neighborhood and move the search to the point with the highest utility in the neighborhood. If no point in the neighborhood has a utility higher than that of the currently visited point, the search ends. Since hill climbing-based search technique may stop prematurely at a local optimum, we use several random restarts and select the best allocation among all restarts. The number of random restarts depends on the computation budget established for the search. We used 10 random restarts in our experiments.
QoS Monitor The QoS Monitor serves as a coordinator in the QoS Broker framework. It receives requests from consumers for business process execution, initiates optimal service selection to find near-optimal service selection for the business process, sends the business process execution request to the
Figure 2. Identifying neighbors for the service selection algorithm
140
Performance Management of Composite Applications in Service Oriented Architectures
business process engine, monitors the execution of the business process and its QoS metrics levels as well as the resulting utility values by evaluating the utility functions. If the utility level of the running business process goes below a certain threshold due to occasional service degradation or if a service becomes unavailable due to some unforeseen condition, the monitor triggers an event to obtain a new service selection for the business process excluding the distressed service provider(s). The new service selection is then used for subsequent business process execution requests, bringing back the utility of the running business process to an acceptable range. The process described above for near-optimal service selection is used when the utility falls below a specified threshold. Service providers that are unavailable or whose performance has degraded with respect to their advertised performance SLAs are removed from the search. These service providers are included in future searches as soon as they recover or their performance improves.
Service Providers
Business Process Engine and Service Mediation
Consumers are the entities that consume the QoS brokering services provided by the QB. They register their business processes for brokering with the QB and send requests for the optimal service selection as well as for the execution of business processes under QoS and cost constraints. Consumers may also provide their utility functions of QoS metrics to be optimized as a service selection criterion. An optimal service selection for a business process could be used to execute the business process multiple times.
The business process engine receives requests from the QoS monitor and provides the infrastructure to execute the business process. It uses a service mediation module to route requests to the selected service providers using content-based routing. The target service identifiers for different activities are contained in the service allocation z, which is part of the request. The engine orchestrates service invocations in a logical order that represents the sequencing of the activities in the business process. The business process engine catches exceptions thrown by service providers resulting from the failures of the participating services. As a result, the business process engine aborts the execution of the business process and sends a fault message to the QoS monitor discussed above.
Services from service providers contain operations that can be used to implement individual steps of business processes. Service providers register their services, as well as QoS and cost functions with the QB. Knowledge of service providers’ performance metrics is essential for the QoS management in SOA. Different approaches can be used to measure performance metrics of service providers. For example, service providers may provide different distributions (probability distribution functions and cumulative distribution functions) of QoS metrics (e.g., response time) and associated cost functions. QoS distributions could be obtained by service providers themselves by fitting distributions to historical log data or by external QoS Brokers that monitor service providers at regular intervals and fit the data to distributions.
Consumers
Proof of Concept Prototype We developed a proof-of-concept prototype to demonstrate the feasibility of the QoS Broker framework. We simulated the execution of a business process and used random delays to simulate performance degradation and service failures of service providers. We implemented a QoS Monitor to monitor the execution of business processes
141
Performance Management of Composite Applications in Service Oriented Architectures
as well as the performance of service providers participating in the business process execution. The QoS monitor measures performance metrics of the running business process and uses these metrics to compute the running averages of utility levels resulting from the BP executions. If the utility is measured to be below a certain threshold as a result of performance degradation of the service providers or due to lower availability as a result of service or business process failures, the monitor coordinates with the QB to obtain another allocation of the services and uses the new allocation for subsequent execution of the business process. This enables the QoS monitor to bring back the utility of the business process executions above a threshold. We provided a dummy implementation of services which typically respond after some delay according to the published response time metric, R, of the service provider. To simulate service failures, services use random number generators Figure 3. A business process example
142
that generate real numbers r uniformly distributed between 0 and 1. If r ≤ PFailure, where PFailure = 1 - Availability (of the SP), then the service does not reply. Instead, it logs an error message and throws a Service-Failed exception to the business process engine, which in turn catches and handles the exception thrown by this service provider. The experiment also considers a warm up period during which the services respond normally and there are no service degradation or failures. After the warm up period, a random degradation is added on top of the published QoS information. As a result, services respond after a wait time equal to R + degradation.
Business Process Used The business use case used in this experiment represents a fictitious Travel Reservation Business Process.
Performance Management of Composite Applications in Service Oriented Architectures
Figure 3 depicts the business process, which contains nine activities that are executed in a certain logical workflow. The business process starts after receiving a Travel Reservation request from a customer. It is assumed that the traveler fills out travel details using an online user interface. The first activity of the business process validates the information contained in the request, such as valid departure and destination cities, travel dates, and credit card expiration. After validating the request, the business process checks whether the traveler is in the Federal No-Fly-List database. If found on the No-Fly-List, then the request is routed to an activity that notifies the Transport Security Administration (TSA) and sends a message to the customer. However, if the traveler is not on the Federal No-Fly-List, the business process checks for the availability of airline seat(s). Assuming the response is positive, the business process invokes the airline reservation activity. Then, it invokes the hotel and rental car reservation activities, which are executed in parallel. The results of the airline, hotel, and car rental reservations are then routed to an activity that generates an invoice followed by payment processing using the customer’s credit card information. Finally, the business process sends a confirmation message to the customer.
Experimental Results We conducted a large number of experiments to assess the efficacy of our service selection heuristic algorithm using randomly generated graphs. The results are discussed in (Dubey & Menascé 2010). For the experiments reported here, we used the Travel Planner business process described in the previous section. This business process has nine activities, where each activity is supported by seven service providers with equivalent functionality. The QoS characteristics of service providers are listed in Table 1. There are two sets of five columns in the table. Each set
contains the values of the activity id (ai), the id of an SP that supports that activity, the average response time E [R] of the service provider, its availability A and its throughput X. The response time follows an exponential distribution with the average given by E[R]. For the sake of this experiment, the service providers’ services were implemented as dummy services using Java threads, which respond after an explicit delay following the exponential distributions. The experiment was run for 30% constraint strength, which means that we used a service selection criteria that gives 30% better performance in all QoS categories, but at 30% less cost, compared to the most relaxed constraints. The global utility function used for this experiment is the same as discussed in (Dubey & Menascé, 2010). To compute the utility threshold (UThreshold) levels, we ran experiments without performance degradation or service failures and computed the average global utility from 550 executions of the business process, using a near optimal service selection. UThreshold was then chosen to be a certain percentage point below the average global utility. We used three levels of UThreshold, ranging from a larger utility deviation tolerance to a smaller one: 10% below the average utility, 4% below the average utility, and 2% below the average utility. For each threshold level, the experiment was run 20 times where each run consisted of 550 executions of the business process. After a ramp up period of 50 executions in each run, performance metrics such as the cumulative moving averages of the execution times, availability, and throughput as well as the cumulative moving average of the resulting utility were computed from executions of the business process. Subsequently, 95% confidence intervals for the average utility were computed from the observations of the 20 runs, using the t-distribution. We also tracked when service reselections took place during the BP executions as a result of service failures or
143
Performance Management of Composite Applications in Service Oriented Architectures
Table 1. Sets of service providers for the implementation of business activities a1—a9 ai
SP
E[R]
A
X
ai
SP
E[R]
A
X
1
1
1.5
0.9887
67.0
2
1
1.0
0.9984
100.0
1
2
2.0
0.9445
50.0
2
2
1.5
0.9893
67.0
1
3
2.5
0.9398
40.0
2
3
2.0
0.9795
50.0
1
4
3.0
0.9394
33.0
2
4
2.5
0.9600
40.0
1
5
3.5
0.9347
29.0
2
5
3.0
0.9527
33.0
1
6
4.0
0.9338
25.0
2
6
3.5
0.9497
29.0
1
7
4.5
0.9286
22.0
2
7
4.0
0.9295
25.0
3
1
2.0
0.9859
50.0
4
1
0.7
0.9989
143.0
3
2
2.5
0.9807
40.0
4
2
1.2
0.9973
83.0
3
3
3.0
0.9776
33.0
4
3
1.7
0.9903
59.0
3
4
3.5
0.9657
29.0
4
4
2.2
0.9724
45.0
3
5
4.0
0.9647
25.0
4
5
2.7
0.9693
37.0
3
6
4.5
0.9622
22.0
4
6
3.2
0.9688
31.0
3
7
5.0
0.9606
20.0
4
7
3.7
0.9675
27.0
5
1
0.8
0.9984
125.0
6
1
2.3
0.9849
43.0
5
2
1.2
0.9677
83.0
6
2
2.7
0.9563
37.0
5
3
1.5
0.9613
67.0
6
3
3.1
0.9535
32.0
5
4
1.8
0.9609
56.0
6
4
3.5
0.9387
29.0
5
5
2.1
0.9453
48.0
6
5
3.9
0.9282
26.0
5
6
2.4
0.9245
42.0
6
6
4.3
0.9221
23.0
5
7
2.7
0.9209
37.0
6
7
4.7
0.9184
21.0
7
1
1.9
0.9886
52.6
8
1
1.4
0.9898
71.4
7
2
2.4
0.9793
41.6
8
2
1.9
0.9695
52.6
7
3
2.9
0.9490
34.4
8
3
2.4
0.9509
41.6
7
4
3.4
0.9408
29.4
8
4
2.9
0.9508
34.5
7
5
3.9
0.9396
25.6
8
5
3.4
0.9431
29.4
7
6
4.4
0.9198
22.7
8
6
3.9
0.9418
25.6
7
7
4.9
0.9161
20.4
8
7
4.4
0.9383
22.7
9
1
2.4
0.9821
41.6
9
2
2.9
0.9654
34.5
9
3
3.4
0.9624
29.4
9
4
3.9
0.9598
25.6
9
5
4.4
0.9596
22.7
9
6
4.9
0.9516
20.4
9
7
5.4
0.9504
18.5
when running averages of the global utility went below the threshold.
144
Figures 4 to 7 document the results of the experiments conducted for the three utility threshold levels. The results shown in Figures 4 and 5 depict
Performance Management of Composite Applications in Service Oriented Architectures
Figure 4. Variation of average utility under three threshold
three utility lines for each of the 550 executions of the business process. The lower horizontal line is the UThreshold line. The upper one is the cumulative moving average of BP executions without any service failures or performance degradations, called normal condition utility line, for ease of discussions in this chapter. The middle curve shows the cumulative moving average of the BP executions when there are random service failures and/ or performance degradations of service providers. Figure 4 shows results for experiments run with a utility threshold 10%, 4%, and 2%, respectively, below the average utility. For the experiment with utility threshold 10% below average, the
normal condition utility was measured to be around 0.88 as shown by the upper horizontal line and the lower horizontal line indicates the UThreshold line, which is 10% below average utility level and is set to be 0.79. The UThreshold were 4% and 2% below average respectively, which were measured to be around 0.84 and 0.86, respectively, on the utility axis. The figure shows the variation of cumulative moving averages of global utility Ug for each threshold level, for 550 executions of the business process. The graphs show that the QB was able to reselect services for the business process at runtime when service failures occurred or when the average utility of the BP execution
145
Performance Management of Composite Applications in Service Oriented Architectures
Figure 5. Variation of the utility against three thresholds (shown with 95% confidence intervals)
was measured below the threshold as depicted by the middle curves. The service selection locations are indicated by vertical gray lines. The intersections of the vertical lines with execution numbers on horizontal axis further obviate the execution points where service selections took place. Horizontal axis labels start from 101. The utility observation at the 101th execution is an average of
146
the last 50 executions (from 51 through 100). No observations were taken during the ramp up period of the first 50 executions of the business process. The graphs show that the QoS Monitor was able to ensure that the utility of the BP executions stayed above a threshold via service selection and reconfigurations of the business process at runtime.
Performance Management of Composite Applications in Service Oriented Architectures
Figure 6. Average cost of business process execution
Figure 7. Service selection due to failures and performance degradation
Figure 5 shows the average variation of the cumulative moving average of the global utility Ug computed for 20 runs for each of the 550 executions of the business process (shown between the normal condition utility line and the utility threshold line). The variation in Ug was computed for all three threshold levels when there were service failures
and performance degradation. This line also shows the 95% confidence intervals computed from the measurements of 20 runs using the t-distribution. The results show that at the 95% confidence level, the QB was able to support the runtime service selection and configuration of the business process to ensure that the cumulative moving average of
147
Performance Management of Composite Applications in Service Oriented Architectures
the global utility of BP executions stayed above the threshold line. The graph also shows that in the beginning of the executions, the running averages are close to the normal condition utility line. As the number of executions increases, the running average utility decreases but still stays above the threshold line. The decrease in utility can be explained by the cumulative performance degradation and failures at service providers with the increase in BP executions. We also measured the average cost and number of service selections for experiments for each utility threshold level. Figure 6 shows that as the threshold line tightens from 10% to 2% below average, there is a tendency for the cost of service provider selections to increase, i.e., the QB selects services that may be more expensive because they offer better quality service. It should be noted that the 95% confidence intervals shown in the figure overlap. Thus, the difference in the average cost of services for the three categories may not be statistically significant. However, as shown in Figure 7, the service selection counts due to failures and degradations are very pronounced. That is, when the threshold tightens from 10% below average to 2% below average, the service selection due to failures decreases significantly. This is expected as the services selected for 2% below average threshold may have higher Quality of Service (e.g., higher availability) resulting in smaller probability of failures. At the same time, the number of service selections due to performance and utility degradation increases significantly from 10% to 2% below average thresholds. This is also expected as at 2% below average threshold, degradation in the cumulative moving average of the utility will hit the threshold more often than in the case of 10% below average.
148
FUTURE RESEARCH DIRECTIONS The results above are based on the assumption that the failed services recover instantly, which may not be an unreasonable assumption for environments where there may be an infrastructure for automated-failover of services. However, as a part of ongoing research activities, we are investigating the behavior of runtime QoS management when the service recovery may not be instantaneous. Also, it may not be feasible to have a single QB broker for a very large distributed system. A centralized broker represents a single point of failure and a potential bottleneck. Distributed architectures (e.g., P2P or hierarchical arrangements) need to be investigated.
CONCLUSION This chapter described an architecture of a QoS Broker framework that embeds service provider selection, run-time monitoring, QoS goal verification, and service resilience. This also includes methods to compose QoS metrics for business processes in SOA. As part of the framework, we described an efficient heuristic algorithm that obtains a near-optimal solution at a small fraction of the computational cost in terms of the number of points examined in the solution space and computation time. We also described a proof-ofconcept prototype developed to demonstrate the runtime QoS monitoring and management of the business process and service providers in SOA, using the heuristic service selection algorithms discussed in the (Dubey & Menascé, 2010) and also briefly described in this chapter. The heuristic described here was shown to scale linearly with the number of service providers per activity of the business process (Dubey & Menascé, 2010). We simulated the execution of a business process and used random delays to simulate
Performance Management of Composite Applications in Service Oriented Architectures
performance degradation and service failures. We implemented a QoS Monitor to monitor the execution of business processes as well as the performance of service providers participating in the business process execution. The Monitor computed cumulative moving averages of the utility levels of business process executions and coordinated with the QB to obtain new configurations of services when services failed or when the utility of the running business process went below certain threshold. Based on extensive experiments, we showed with 95% confidence that the QB was able to support service selection at run time in the presence of failures and performance degradation to ensure that the mean of the cumulative moving averages of the utility stayed above a pre-established threshold.
ACKNOWLEDGMENT This work was partially supported by Grant CCF0820060 from the National Science Foundation.
Berbner, R., Spahn, M., Repp, N., Heckmann, O., & Steinmetz, R. (2006). Heuristics for QoS-aware web service composition. Proceedings of the IEEE International Conference on Web Services, 72-82. Canfora, G., DiPenta, M., Esposito, R., & Villani, M. L. (2005). An approach for QoS-aware service composition based on genetic algorithms. GECCO ‘05: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, 1069-1075. Canfora, G., DiPenta, M., Esposito, R., & Villani, M. L. (2008). A framework for QoS-aware binding and rebinding of composite Web services. Journal of Systems and Software, 81(10), 1754–1769. doi:10.1016/j.jss.2007.12.792 Cardellini, V., Casalicchio, E., Grassi, V., & Lo Presti, F. (2007). Flow-based service selection for Web service composition supporting multiple QoS classes. Proceedings of the IEEE International Conference on Web Services, 743-750.
REFERENCES
Cardoso, J., Sheth, A., Miller, J., Arnold, J., & Kochut, K. K. (2004). Quality of Service for workflows and web service processes. Journal of Web Semantics, 1(3), 743–750. doi:10.1016/j. websem.2004.03.001
Ardagna, D., & Pernici, B. (2006). Dynamic Web service composition with QoS constraints. International Journal of Business Process Integration and Management, 1(3), 233–243. doi:10.1504/ IJBPIM.2006.012622
Dubey, V., & Menascé, D. A. (2010). Utility-based optimal service selection for business processes in service oriented architectures. ICWS ‘10: Proceedings of the 2010 IEEE International Conference on Web Services, 542-550
Ardagna, D., & Pernici, B. (2007). Adaptive service composition in flexible processes. IEEE Transactions on Software Engineering, 33(6), 379–384. doi:10.1109/TSE.2007.1011
Gjorven, E., Rouvoy, R., & Eliassen, F. (2008). Cross-layer self-adaptation in service-oriented architectures. MW4SOC ‘08: Proceedings of the 3rd Workshop on Middleware for Service Oriented Computing, 37-42.
Bennani, M. N., & Menascé, D. A. (2005). Resource allocation for autonomic data centers using analytic performance models. In K. Schwan & Y. Wang (Eds.), ICAC’05: The Second IEEE International Conference on Autonomic Computing, 229-240.
Jaeger, M., Muhl, G., & Golze, S. (2005). QoSaware composition of web services: A look at selection algorithms. Proceedings of IEEE International Conference on Web Services, 1-2.
149
Performance Management of Composite Applications in Service Oriented Architectures
Kephart, J. O., & Das, R. (2007). Achieving self-management via utility functions. IEEE Internet Computing, 11(1), 40–48. doi:10.1109/ MIC.2007.2 Kyriakos, K., & Plexousakis, D. (2009). Mixedinteger programming for QoS-based Web service matchmaking. IEEE Transactions on Services Computing, 2(2), 122–139. doi:10.1109/ TSC.2009.10 Lecue, F., & Mehandjiv, N. (2009). Towards scalability of quality driven semantic Web service composition. ICWS ‘09: Proceedings of the 2009 IEEE International Conference on Web Services. 469-476 Menascé, D. A., Casalicchio, E., & Dubey, V. (2008). A heuristic approach to optimal service selection in service oriented architectures. WOSP ‘08: Proceedings of the 7th International Workshop on Software and Performance, 13--24. Menascé, D. A., Casalicchio, E., & Dubey, V. (2010). On optimal service selection in Service Oriented Architectures. Performance Evaluation Journal, North-Holland, Elsevier, 67(8), 659-675. Menascé, D. A., & Dubey, V. (2007). Utility-based QoS brokering in service oriented architectures. In Proceeding of 2007 IEEE International Conference on Web Services, 422-430. Menascé, D. A., Ewing, J., Gomaa, H., Malek, S., & Sousa, J. (2010). A framework for utility based service oriented design in SASSY. First Joint WOSP-SIPEW International Conference on Performance Engineering, 27-36. Rosario, S., Benveniste, A., Haar, S., & Jard, C. (2008). Probabilistic QoS and soft contracts for transaction-based Web services orchestrations. IEEE Transactions on Services Computing, 1(4), 187–200. doi:10.1109/TSC.2008.17
150
Russell, S. J., & Norvig, P. (2003). Artificial intelligence: A modern approach (2nd ed.). Upper Saddle River, New Jersey: Prentice Hall. Salehie, M., & Tahvildari, L. (2009). Self-adaptive software: Landscape and research challenges. ACM Transactions on Autonomous and Adaptive Systems, 4(2), 1556–4665. Santhanam, G. R., Basu, S., & Honavar, V. (2008). On utilizing qualitative preferences in web service composition: A CP-net based approach. SERVICES ‘08: Proceedings of the 2008 IEEE Congress on Services - Part I, 538-544. Serhani, M. A., Dssouli, R., Hafid, A., & Sahraoui, H. (2005). A QoS Broker based architecture for efficient web service selection. ICWS ‘05: Proceedings of the IEEE International Conference on Web Services, 113-120. Sheng, Q. Z., Benetallah, B., Maamar, Z., & Ngu, A. H. H. (2009). Configurable composition and adaptive provisioning of Web services. IEEE Transactions on Services Computing, 2(1), 1939–1374. doi:10.1109/TSC.2009.1 Singh, M. P. (2004). A framework and ontology for dynamic web services selection. IEEE Internet Computing, 8(5), 84–93. doi:10.1109/ MIC.2004.27 Tesauro, G., Das, R., Walsh, W. E., & Kephart, J. O. (2005). Utility-function-driven resource allocation in autonomic systems. ICAC ‘05: Proceedings of the Second International Conference on Automatic Computing, 342-343. Yau, S. S., Ye, N., Sarjoughian, S., Huang, D., Roontiva, A., Baydogan, M. G., & Muqsith, M. A. (2009). Toward development of adaptive service-based software systems. IEEE Transactions on Services Computing, 2(3), 1939–1374. doi:10.1109/TSC.2009.17
Performance Management of Composite Applications in Service Oriented Architectures
Ye, X., & Mounla, R. (2008). A hybrid approach to QoS-aware service composition. ICWS ‘08: Proceedings of the 2008 IEEE International Conference on Web Services, 62-69. Yu, Q., & Bouguettaya, A. (2010). Guest Editorial: Special Section on Query Models and Efficient Selection of Web Services. IEEE Transactions on Services Computing, 3(3), 161–162. doi:10.1109/ TSC.2010.43 Yu, T., Zhang, Y., & Lin, K.-J. (2007). Efficient algorithms for Web services selection with endto-end QoS constraints. ACM Transactions on the Web, 1(1), 1–26. doi:10.1145/1232722.1232728 Zeng, L., Benatallah, B., Ngu, A. H. H., Dumas, M., Kalagnanam, J., & Chang, H. (2004). QoSaware middleware for web services composition. IEEE Transactions on Software Engineering, 30(5), 311–327. doi:10.1109/TSE.2004.11
KEY TERMS AND DEFINITIONS Business Process: A collection of activities connected to one another in a certain workflow to address the business needs of an enterprise.
Service Provider Selection: A set of service providers that support all activities of a business process. Optimal Service Provider Selection: A service provider selection that optimizes some objective function (e.g.: global utility) possibly subject to some constraints. QoS Broker: A broker that serves as an intermediary between consumers of business processes and candidate service providers. It provides a near-optimal service provider selection for business processes by examining a tiny fraction of the solution space as well as provides the run time monitoring and management of QoS of the business processes. Service Oriented Architecture: A loosely coupled approach to enterprise computing using re-usable services. Such services could be easily composed to support dynamic and flexible business processes and bring agility to the enterprise. QoS Metric: Performance metrics: such as response time, availability, and throughput. Utility Function: A function that measures the usefulness of a service or business process to a consumer in terms of various QoS metrics. Such utility function can be used as the objective function used to determine a near-optimal selection of service providers to execute a business process.
151
152
Chapter 7
High-Quality Business Processes Based on MultiDimensional QoS Qianhui Liang HP Labs, Singapore Michael Parkin Tilburg University, The Netherlands
ABSTRACT An important area of services research gathering momentum is the ability to take a generic business process and instantiate it by selecting services that meet both the functional and non-functional requirements of the process owner. These non-functional or quality-of-service (QoS) requirements may describe essential performance and dependability requirements and apply across different logical layers of the application, from business-related details to system infrastructure; i.e., they are cross-cutting and considered multidimensional. Configuring an abstract business process with the “best” services to meet the process owner’s multidimensional end-to-end QoS requirements is a challenging task as there may be many services that match to the functional requirements but provide differentiated QoS characteristics. In this chapter we explore an approach to discover services, differentiated by their QoS attributes, to configure an abstract business process by selecting an optimal configuration of the “best” QoS combinations. The approach considered takes into account the optimal choice of multi-dimensional QoS variables. We present and compare two solutions based on heuristic algorithms to illustrate how this approach would work practically.
INTRODUCTION When a business analyst designs a new business process or performs refactoring of an existing process, determining the dependability and reliDOI: 10.4018/978-1-60960-794-4.ch007
ability of the final end-to-end composition of operations and services that make up the process is of paramount importance; without considering aspects of the business process such as the maximum throughput, time required to complete the process or its availability, the process might not meet the expectations of the business or
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
High-Quality Business Processes Based on Multi-Dimensional QoS
their customers. Therefore, an abstract business process, which we consider to be a collection of interrelated, structured activities or tasks, must be designed not only with coherent and functionally compatible services but also with a set of qualities of service (QoS) that will fulfill the necessary business objectives. This chapter focuses on the dependability of business processes by considering how an abstract business processes is deployed with a selection of services from the Internet of Services that meets the desired end-to-end quality of service (Cardoso et al, 2009) from the process’ systemlevel perspective (Liang et al, 2008) (Liang et al, 2009). The challenge with operating a business process with end-to-end quality of service like this is that no methodology exists for determining resource allocation from the many autonomous service providers available through the Internet of Services. This chapter demonstrates how an approach based on a multi-criteria analysis can be used to decide on the combination of services to select. Later, we describe how we implemented this decision making technique using two optimization techniques (simulated annealing and genetic algorithms) and carry out an experimental evaluation of the performance of each technique. However, we first present an overview of the context and issues involved and describe the proposed highlevel approach before introducing the decision making techniques.
Differentiated Services Using business enterprise architecture and service-orientation, an abstract business process designed by a business analyst can be deployed over one or more software services, or a set of related software functionalities, together with the policies that should control their use. OASIS defines a service as “a mechanism to enable access to one or more capabilities, where the access is provided using a prescribed interface
and is exercised consistent with constraints and policies as specified by the service description” (MacKenzie, Laskey, McCabe, Brown & Metz, 2006). Services are differentiated from each other by their functional behavior (what they do) and non-functional characteristics (how well they do it), such as service cost (Papazoglou et al, 2003), availability (Liang et al, 2008b) and security. Such services are opaque in the sense that their implementation is hidden from service consumers except for service description, policy and invocation information and behavioral characteristics exposed through the relevant service descriptions (Helland, 2007) (Andrikopoulos, Fairchild, van den Heuvel, Kazhamiakin, Leitner, Metzger, Nemeth, di Nitto, Papazoglou, Pernici & Wetzstein, 2008). The benefit of software services like this is that process owners (stakeholders responsible for the operation of the business process, who may also be business analysts) can realize abstract business processes by composing third-party services provided via the internet rather than by developing and maintaining their own software, which may be costly or impractical. As described, the services used to implement an abstract business process are differentiated from each other by their functional behavior and non-functional or quality of service (QoS) characteristics. The QoS characteristics of a service, either requested, advertised or agreed (possibly in an Service Level Agreement, or SLA), may be for any logical level of the service-based application, from the (logically highest) business-level application layer to the (lowest) system-level infrastructure. This is depicted in Figure 1, which illustrates in Business Process Modeling Notation (BPMN) (Object Management Group, 2009) an abstract end-to-end business process made up of logistics, payment and billing processes (shown as rounded rectangles with dash-dotted borders). These processes contain five component services (rounded rectangles with solid borders) with QoS at two levels of the application: business-related QoS properties, such as on time delivery deadlines
153
High-Quality Business Processes Based on Multi-Dimensional QoS
and number of active orders, and system-related QoS properties, like process cycle times and process throughput. When a request, advert or agreement for a service contains multiple QoS attributes like this, we refer to the QoS as multidimensional QoS. Using services differentiated by their QoS when configuring an abstract business process into an instantiated business process, or when re-configuring an existing (instantiated) business process to meet an ongoing SLA, allows the process owner to create or configure the business process to meet a desired cumulative QoS for the end-to-end business process. Cumulative QoS is defined as a cumulative result of multiple services to be integrated into a service-based application in order to meet the required end-to-end, total QoS for the entire process. When considering the possibly thousands of services from the Internet of Services, finding the
optimum configuration of services to meet the multi-dimensional and cumulative QoS requirements for the instantiated business process is a challenging task; what is the best selection? This chapter focuses on the selection of services based on their qualities of service to ensure the quality of business processes by establishing it meets the desired end-to-end cumulative quality of service across multiple dimensions.
Chapter Objectives The objectives of this chapter are to review solutions to the problem of differentiated service selection for business processes. This chapter will then propose an approach to formalizing a method for optimizing the ‘fitness’ of an end-to-end service aggregation by representing it as a function over multiple quality dimensions correlated with the provided QoS of all its component services.
Figure 1. Abstract Business Process Example showing Multidimensional QoS
154
High-Quality Business Processes Based on Multi-Dimensional QoS
In presenting this approach we also present a formalization of the constraints associated with QoS parameters and show how we can distinguish between hard and soft constraints.
Approach In this case, the problem can be described generally as a situation where services should be selected from a collection of services differentiated by their functional behavior and non-functional characteristics. The selection process should result in a service composition that is functionally compatible (i.e., the behavior of collaborating services, their inputs and outputs are all correct) and which also has the desired cumulative endto-end quality of service. Choosing services to instantiate a business process by matching their functional capabilities with the requirements expressed in the abstract business process is well-understood and not covered in this chapter (the interested reader is directed to the section on further reading for references). The services found in this selection process provide an initial set of choices to instantiate the abstract business process. However, the difficulty
in selecting services based on their non-functional characteristics comes from the fact that there will be many possible solutions (i.e., combinations of services) and outcomes in terms of the performance of the instantiated business process. There are many approaches to solving this problem, as described in the related work section, below. The general approach we have concentrated on is to evaluate combinations of services for inclusion in the instantiated business process through the calculation of a combination’s utility, or value to the owner of the process.
General Service Selection Process To illustrate our selection approach Figure 2 shows in pseudo-flowchart notation how a request for a selection of services to populate an abstract business process may be processed. (Note the abstract business process and service metadata repository are assumed to have been created and populated, possibly using a service monitoring approach like the one described by Wetzstein, Leintner, Rosenberg, Brandic, Dostdar & Leymann (2009) before the selection process starts.) Starting with the stakeholder who will be responsible for the
Figure 2. Operational Framework for Configuring a Business Process from an End-User Request
155
High-Quality Business Processes Based on Multi-Dimensional QoS
process, the process owner, they form a request based on the abstract business process designed by the business analyst, as described earlier. The formulation of the request involves annotating the abstract business process with the desired end-to-end multidimensional QoS requirements. The annotated request can be seen as a constraint on all possible combinations of services that are compatible with the fixed functional composition and integration aspects determined by the structure of the abstract business process. To allow even greater flexibility for the process owner, the QoS request for the configured application can contain ‘hard’ constraints the application must conform to and ‘soft’ preferences describing qualities the process owner would like the running process to have. Simple examples of individual constraints for the business process could be that when a application receives an order it should notify the customer before a certain number time units have elapsed, that the order has been processed successfully (or not), or that an invoice is sent within another amount of time units. An example of a simple QoS preference for the business process could be that the system processes an order in the minimum amount of time. As described above, the QoS requirements for an abstract business process are often multidimensional and can contain multiple constraints and/or preferences, such as that the running business process is configured to process at least 100 messages/sec (a system level constraint), process orders at a rate of least 250 orders/hour with a maximum of 10 defective orders/hour (both business level constraints) and/or the process should be configured with services that provide the highest amount of bandwidth possible with the lowest associated processing time and overall cost (system and business level preferences). From Figure 2, the multidimensional QoS request based on the annotated business process
156
is passed to the quality measure algorithm or algorithms used to compute the utility value corresponding to the request (as shown in the upper path in Figure 2). We use the concept of utility value, or the “measure of the total perceived value resulting from an outcome” (Clapham & Nicholson, 2006) to reduce complex multidimensional QoS criteria to a single comparable value. Note the utility value is opaque to the process owner; it has no meaning other than it can be used as a means for comparing service configurations. How the utility value is calculated is described in the sections below. In parallel with the calculation of the utility value of the request, the same request is used to produce a set of utility values for all possible functionally correct permutations of services using data from the service QoS metadata repository, which could be based on a UDDI or similar service. (This process is shown in the lower path in Figure 2.) The resulting set of utility values is then processed and calibrated to determine the optimal matching set of QoS to the utility value of the request (note that there may not be an exact match, in this case the closest match will be used). The outcome of our approach thus entails a set of QoS levels for the differentiated services in the end-to-end business process. In this way an abstract business process can be instantiated or a running business process may be re-configured with the process owner’s desired (or closest possible to the desired) cumulative QoS properties. This completes the description of the general framework through which services can be selected and the abstract business process instantiated to become a usable running business process with a guaranteed end-to-end cumulative quality of service. The key to the success of this framework is in accurately calculating the utility values of the services based on the process owner’s constraints and preferences, which we now discuss.
High-Quality Business Processes Based on Multi-Dimensional QoS
RELATED WORK Generality and variability in software systems has attracted much attention in the area of domain models and product line architectures (PLAs) (Tratt, 2007). PLAs are a special type of business process model that define a group of common products which share a similar set of characteristics within a specific domain, e.g., the telecom or the transportation market (Sendall & Kozaczynski, 2003). In particular Scheer (1999) describes a mathematical approach to deriving products from product lines, similarly to how we would like to form an instantiated business process service-based applications from service instances. However, work in that area concentrates solely on product assembly using the functional characteristics of product lines. Our work has a different focus as it concentrates on non-functional properties over multiple criteria, rather than the single dimension of functional compatibility. Recently, research into the non-functional properties of (Web) services has been a focus for the software service community. This has led to the creation of well-known standards, such as WSAgreement (Andrieux et al, 2007), which allows the expression of a service’s non-functional QoS and an SLA to be agreed on the basis of these requested and advertised service properties. However, WS-Agreement does not solve the problem of finding a correct match for the process owner’s requirements and a significant amount of work has taken place, mainly in the semantic community, to develop ontologies and languages as models for expressing and/or selecting a service’s quality of service. An example of this approach is described in Liang, Chung and Miller (2007). Research has also taken place into considering a user’s satisfaction level with a service and related business objectives in the form of a global utility function used in the process of service discovery or composition. In both Liang et al (2007), Liang et al (2009) and Seo, Jeong and Song (2007), possible representations of the quality measure
for a service aggregation (or request for a service aggregation) are studied. However, neither Liang et al (2007) nor Seo et al (2007) address possible heuristics for improving the performance of service selection with a proposed representation or model, such as an abstract business process. This is why we have provided an experimental evaluation section in this chapter that contains the performance information for our experimental configuration and choice of parameters to give an indication of relative performance. Our approach to solving the problem described earlier of selecting services to populate an abstract business process is to treat it as a subject for optimization. Generally, optimization techniques attempt to find the best possible solution to a problem by attempting many different solutions and scoring each solution to determine the most appropriate solution. Optimization techniques have many applications. For example, they are used by travel agents when selecting of connecting flights to maximize or minimize aspects of a traveller’s journey, such as maximizing the passenger’s comfort whilst minimizing their waiting time at airports. In this chapter, we consider the optimization of the end-to-end cumulative QoS for the abstract business process using a multi-criteria decision analysis methodology and technique known as multi (or multiple) attribute utility theory (MAUT) to design a mathematical optimization model. Johnson, Gratz, Rust and Smith (2007) write about MAUT: “this technique uses gathered data with a specific and sensitive weighting system to assess a given decision regarding various attributes (variables or outcomes), in order to find the optimal decision given a specific set of criteria” (Barron & Barrett, 1996; Herrmann & Code, 1996). This description fits well with our objectives of deploying an abstract business process according to the process owner’s QoS preferences according to an optimal configuration of services. Multi attribute theory emerged in the early 1970’s (Gustafson & Holloway, 1975; Kenney,
157
High-Quality Business Processes Based on Multi-Dimensional QoS
1970) and has since been applied to many fields, such as economics, management sciences, healthcare and energy production to optimize operations, to provide a methodology for assessments of the impact of environmental, transport and education policies and as a tool for selecting appropriate nuclear waste sites, airports, power stations and manufacturing facilities (e.g., Canbolat, Chelst & Garg 2005). MAUT builds on multi attribute theory as it is used to find the utility, or the appropriateness of a solution for the desired outcome, of a single combination of multiple attributes. The utility value, found from a mathematical utility function containing a set of declarative, decision making axioms, provides a single, onedimensional value that can be ranked according to its carnality. Returning to the focus of this chapter, using this theory we can rank selections of services by calculating their utility. The closer their utility is to the desired utility - which can be determined from the annotated business process - the “better” that combination of services is for the QoS preferences stated. The general process of all MAUT optimization calculations is given by von Winterfeldt and Edwards (1986) as: 1. Defining the attributes to be searched over; 2. Evaluating each alternative separately on each attribute; 3. Assigning relative weights to the attributes; 4. Aggregating the weights of attributes to obtain an overall evaluation of alternatives; 5. Performing a sensitivity analyses to make recommendations. This is the general process we will use to determine the optimum utility value and therefore optimum combination of services for the end-toend cumulative QoS attributes given. However, to evaluate alternatives and make recommendations requires a method of searching through the possible combinations of services. There are many methods of performing a search like this,
158
such as a random search, where combinations of variables are tried without a methodology. The problem with this type of approach is that it does not search around good values already discovered and make best use of those solutions to find better ones. Hill climbing is a technique that starts with a random solution and then searches the set of neighboring solutions for those that are better (i.e., have a better utility value). This technique has the problem that if a better solution exists but is far away from the current solution, it is unlikely to be found. Hill climbing is sensitive to the initial set of inputs into the model and only looks at the direct consequences of the next choice and not at choices two or three steps away or indirect paths to the best solution. To avoid the problem of hill climbing only finding local solutions we chose two optimization methods to search through the solution space from very different domains: simulated annealing, inspired by physics, and genetic algorithms, inspired by evolutionary biology. The simulated annealing and genetic algorithm approaches are described in more detail in the sections below, and we chose them not only because they avoid the problems of other techniques described earlier but also because these algorithms have been found to be successful in many other domains where a similar problem to that which we are concerned with, i.e., choosing an optimum selection for dependency and reliability. For example, simulated annealing has been used by Tracey, Clark and Mander (1998) to generate suites of test-data for safety-critical systems, whilst Zhang, Harman and Mansouri (2007) demonstrated how simulated annealing techniques can be used to build more reliable products by resolving complex and conflicting user requirements during the software development process. Genetic algorithms have also enjoyed widespread and long-term use in areas in similar areas; Brito and May (2007) describe how genetic algorithms are used in safety-critical systems, whilst Dolado (2000) and Wegener, Sthamer, Jones and Eyres (1997) describe how genetic algorithms can be
High-Quality Business Processes Based on Multi-Dimensional QoS
applied in software engineering situations and the testing of real-time systems to ensure dependable systems. Since a business process based on software services allows systems to be built by integrating reusable services in an automatic manner, a significant factor is the speed of service selection using multiple, multidimensional QoS attributes. Little research has been carried out into this area when considering a large search space; i.e., where there are many functionally compatible services for each phase in the abstract business process. For example, Bonatti and Festa (2005), Berbner, Spahn, Repp, Heckmann and Steinmetz (2006) and Menasce, Casalicchio, and Dubey (2008) provide experimental results when searching for services using single/individual QoS attributes. As described above, the problem of instantiating an abstract business process with multidimensional QoS requires searching across many combinations of QoS attributes, not just single attributes. As a result of this lack of information on the speed of service selection across multidimensional QoS attributes in this chapter we have provided an experimental evaluation of such a search to give an illustrative indication of the performance of such searches.
SOLUTIONS AND RECOMMENDATIONS As described above, this chapter is primarily focused on service selection techniques that solve the problem of optimizing a selection of services based on a desired end-to-end QoS for the instantiated processes. In this scenario the problem is choosing the best combination of services based on the many possible solutions (i.e., combinations of services) and outcomes in terms of the desired performance, reliability and/or dependability of the instantiated business process. To satisfy the preferences and constraints of the business process-owner when configuring
(or re-configuring) a business process, we now describe our formalization of the problem of service selection for the environment described. The formalization is based on the non-functional aspects of the system and we model both (soft) preferences and (hard) constraints to maximize the usefulness of the system to its users. As we described above, the solution of the problem - i.e., the optimal match to the preferences and constraints of the process owner - is evaluated through finding the utility value for a combination of services from a utility function following the approach described earlier. The value of our utility function is calculated by summing the weighted dimensional utilities contributed by each QoS dimension of each service, which are calculated using the quality characteristics of the services along a particular dimension.
Formalization The formalization of the service selection problem begins with defining the variables given by the process owner at the start of the sequence shown in Figure 1. As described above, we start with the assumption that a repository exists containing service information and QoS metadata and can provide details of all the services available in the network (the non-empty set S) and their quality dimensions (the non-empty set D). We also assume an abstract process description, from which the set of activities or tasks (the non-empty set T) can be determined, which is annotated with the desired quality dimensions (the non-empty set D). These variables are given for reference in Table 1. Continuing, let equation (1) denote quality measures on the quality dimensions D of service s, and let equation (2) signify the function that assigns a dimensional quality measure qs,d ∈ Q to each quality dimension d∈D for each service s∈ S and its operations. If p is a partition of tasks such that p ⊆T, then the overall utility, Vb, of the proposed but not yet instantiated business process using binding b - defined as an assignment of one
159
High-Quality Business Processes Based on Multi-Dimensional QoS
Table 1. Variables used in the service selection problem Variable
Description
T
The set of tasks described in the abstract business process.
D
Quality dimensions requested by the process owner.
S
Services available in the service network.
Q
All available quality measures.
or more services to a task partition according to the service constraints, or a “best” match between QoS requested by the process owner for a set of tasks and a QoS advertised by a service - is given by (3), where b(p) is given in (4). s[D ] − {q ∈Q ∃d ∈ D.g(s, d ) = q }
(1)
q : D ×S → Q
(2)
1 Vb − ∑ fa (qd .s , ...) * wd n d ∈D
(3)
b(p) − {s ∈ S ∃p. < p, s >∈ b }
(4)
From (3), we can see the overall utility is determined by three factors: 1) the dimensional quality measures of services and the quality characteristics associated with each invocation (qs,d); 2) how these characteristics affect the utility of that dimension (fd); and 3) the weighting (wd) of each dimension. We now discuss qd,s in more detail.
process. We can identify two cases where qd,s takes different forms: 1. Totally Ordered Quality Dimensions: for totally order quality dimensions, like time and money, qd,s is obtained by summing the costs of all calls specified in the binding plus the one-time cost associated to the called service. kt,d,s is calculated as a sum of the associated service calls made by every task that uses the corresponding service (t ∈T, s ∈ b(p).t, p ∈P). In this case, qd,s takes the form shown in (5), with α allowing the same quality dimension to be normalized across all different services. 2. QoS-like Dimensions: we model these by summing the qualities of each selected taskservice match, as shown in (6), where (2) computes the quality measure by taking into account two measures: the first is the value of kt,d,s associated with the service call made by one of the tasks that use the corresponding service (t ∈T, s ∈b(p).t, p ∈P); the second is the measure cd,s. associated with the service. _
Quality Measures qd,s is the quality characteristic of a service invocation (e.g., the QoS associated with carrying out a service operation) and is calculated using cd,s, a one-time, fixed initialization or registration cost associated to the remote service, and kt,d,s, a floating cost associated with invoking a service operation from another service in the business
160
In this case we assume g is a polynomialcomplexity function, or a basic verifiable equation.
q (d , s ) = α cd ,s + kt ,d ,s ∑ t ∈T ,s ∈b ( p ).t , p ∈P
(5)
High-Quality Business Processes Based on Multi-Dimensional QoS
q(d , s ) = g cd ,s ,
kt ,d ,s t ∈T ,s ∈b ( p ).t ,p ∈P
∑
(6)
To give an illustrative example of how these quality measures work within the operational framework, consider the network bandwidth quality dimension of a service: here the bandwidth is constrained by the total bandwidth of the server’s network interface and the bandwidth of network at the client side. In this case, qd,s - the available bandwidth QoS characteristic - is the minimum of the network bandwidth of the server and the end-user invoking the service. What we have presented in this section is a formal model based on multi-attribute utility theory of how the utility value for a configuration of services (Vb) can be calculated based on an abstract business process model annotated with desired quality of service information. The utility value signifies the appropriateness of the solution for the desired outcome, which in this case is the optimum configuration of services for the multidimensional QoS request submitted. As we have described above, this model only provides a utility value for a single configuration of services and what is also required is an extension to search through utility values to find the optimum configuration from the many service combinations available. The following section describes two algorithms we implemented based on the simulated annealing and genetic algorithm approaches introduced above to find the optimum combination of services. As we describe, these algorithms perform a search through these combinations to find the optimum utility value for the end-to-end business process.
SELECTION ALGORITHMS Here the selection problem is cast as a search problem. Various heuristics and metaheuristcs have been proposed and studied to iteratively
improve the quality of a candidate solution to traditional search problems, such as TSP (Liang et al, 2006). In order to facilitate service selection, we present the simulated annealing and genetic algorithm approaches to selecting an optimal service configuration that satisfies a process owner’s end-to-end QoS requirements for an abstract business process. These two meta-heuristics apply effective ways to allow afster arrival at a satisfactory solution when searching a large search space. Simulated annealing is an analogy of the physical process such that the search will not be stuck in a local extrema and an approximation of the global extrema can be located. Genetic algorithm is an analogy of the natural evolution process such that the best solution can survive multiple iterations and surface the search process. In these algorithms we use the same assumptions as in the rest of this chapter, namely: 1. The abstract business process has been annotated with the QoS preferences and constraints of the process owner. 2. Services with the necessary functional properties to instantiate the abstract process exist and are available in the service network; i.e, the process can always be instantiated. Generally, both implementations of the algorithms operate in the same method and follow the general 5-step process for all MAUT calculations given by von Winterfeldt and Edwards (1986) described earlier. That is, the attributes to be searched over are the QoS preferences and constrains given in the annotated abstract business process. These are evaluated by calculating qd,s with the weighting of wd before the overall utility values are calculated and the simulated annealing and genetic algorithms search through the utility values to make a final recommendation of the optimal solution. What is interesting about these algorithms is how they can be combined sequentially to provide a third method of experimental evaluation, and in
161
High-Quality Business Processes Based on Multi-Dimensional QoS
this section we also describe how we developed a hybrid algorithm from the simulated annealing and genetic algorithm approaches.
Simulated Annealing Simulated annealing is a heuristic approach that we have taken in order to find the optimal configuration of services for the abstract business process. The name of this approach comes from the process of annealing in metallurgy, where a substance is heated to provide its atoms with an initial high-energy state and cooled gradually to allow the atoms to find their optimum configuration at the lower-energy state. The idea behind this process is that when the heating temperature is high enough to ensure a sufficiently random state and the cooling process is slow enough to ensure thermal equilibrium, then the atoms will arrange themselves in a pattern that corresponds to the global energy minimum of a perfect crystal. Computational algorithms to perform the same process for data were developed in the early 1980’s by Kirkpatrick, Gelatt, and Vecchi (1983). In our algorithm, the temperature in the annealing process equates to the willingness to accept a worse solution, or selection of services that have a less optimal utility value. By starting with a high willingness (temperature) to accept a worse solution at the start of the algorithm and reducing the willingness gradually the algorithm will, at the end, only accept a better solution. In this way we can avoid the problem of becoming stuck in an optimum local solution and find the global optimum for the service configuration. The probability, p, of accepting the worse solution is found through (7), which demonstrates how as temperature T is reduced the difference between the high and low utility values (Vh and Vl, representing a good solution and not-so-good solution respectively) becomes more significant. At a high temperature the exponent is close to 0 and the probability 100%, but as the temperature decreases, Vh - Vl becomes significant and a larger
162
difference leads to a lower probability and the algorithm begins to prefer “less worse” over “more worse” solutions (Segaran, 2007). p −e
(−YA −Vt ) T
(7)
The algorithm therefore works as follows: we first perform a random selection from the set of services that provide a functional match for the abstract business process and pass their metadata to the algorithm. As an optimization we also preprocess this service metadata, possibly retrieved from a central store, to evaluate services instances before they are passed to the algorithm. In the pre-processing, services are removed from the search set that have low average quality values or violate any of the constraints, meaning that the constraints will always be satisfied. This search set is then passed to the simulated annealing algorithm where we perform a loop that decreases the temperature by a factor set by the cooling rate. Within this loop, we calculate the utility value for the first random selection, change the bindings and re-calculate the utility value for these new bindings. By calculating the probability of choosing the new solution over the old one, we determine if the change in bindings should be made. The temperature is decreased by the cooling rate and the process is repeated until the temperature has been decreased to the lowest possible temperature and the algorithm completes. The simulated annealing process provides us with the largest utility value and its associated binding (i.e., the selection of services for the abstract business process) for the given end-to-end multidimensional QoS request.
Genetic Algorithm Genetic algorithms are a family of global search heuristics inspired by evolutionary biology (Holland, 1975) and start by selecting a set of random set of individuals called a population. The popu-
High-Quality Business Processes Based on Multi-Dimensional QoS
lation is progressively refined through iterations of selection and reproduction processes until an optimal generation is produced and the algorithm terminates. There are several variations in the implementation of these algorithms in terms of how the selection and reproduction of the population takes place, e.g., through elitism, fitness or the crossover or mutation of genes. The general process, however, is to rank the population, select which members of the population should continue and use them to produce a new generation. In our implementation, the initial population is a set of random service bindings for instantiating the abstract business process. The selection of members (bindings) to form the next population is found by calculating the fitness of each binding, or the application of a QoS-related penalty on the utility value for the binding. The bindings are ranked by fitness and a percentage of the topranked population are chosen to form the next population. We use the crossover technique to breed the new population, which takes a random number of bindings from one solution to generate a new one. The process is repeated, with better solutions being kept over time. When the algorithm completes, it returns us the final population - the optimal selection and configuration of services for the abstract business process. In more detail, the fitness of each the generation is defined to be the utility function value of the individual service minus a penalty for any unsuitable QoS factors. The mathematical function showing how the fitness is calculated is shown in (8). Fb −Vb − w x *
dcd .Q g *∑ g 1 Rmax − Rmin
(8)
In more detail, dcd,Ω,i is the amount that the individual exceeds the upper or lower bound specified in ith QoS constraint, where i is the index to the constraint. Rmax and Rmin are the upper and lower bounds of the constraint. g is the index
to the current genome generation and gmax is the maximum number of generations allowed. wg can be used to adjust the weighting of the penalty, which is determined by two factors: where the current generation is in the evolution process and the amount of the ‘negative’ impact on the service’s utility due to its unsuitability. Early in the genome’s evolution, g/gmax is small and a smaller penalty will apply due to the native impact of the unsuitable subpopulation. Later in the evolution process, g/gmax increases and a greater penalty will be given due to the unsuitability. If the unsuitability of a service is only a small amount over the threshold or the unsuitability is caused by a violation of one or a small number of constraints, the penalty is relatively small.
Hybrid Approach The simulated annealing and genetic algorithms both have their strengths and weaknesses; it is well known that simulated annealing produces a solution quicker than a genetic algorithm, but genetic algorithms are often more accurate than their simulated annealing equivalents. To get the best from both algorithms - the accuracy of the genetic algorithm with the performance of the simulated annealing - we fed the results of the genetic algorithm into the simulated annealing algorithm. The purpose of doing this is for the simulated annealing algorithm to locate (local) optima based on the final population produced by the genetic algorithm.
EXPERIMENTAL VERIFICATION This section describes the performance results of implementations of the three algorithms from the previous section and a comparison on their running results. All experiments were performed on a Windows XP PC with 512MB of RAM and a 2.86GHz Pentium P4 processor. The algorithms
163
High-Quality Business Processes Based on Multi-Dimensional QoS
used in the experiments were written and executed using Sun’s Java 6 libraries and virtual machine. Before describing the experiment and its results the notation used in the analysis is shown in Table 2. Most of the symbols have an intuitive meaning. PH needs more explanation since it is the number of possible execution paths in the abstract business process due to conditional branching. Conditional branching takes place when there is a decision point in the process and a choice of two or more paths through the process are available. As a simplifying assumption for our initial experiments, when a conditional branch appears in an abstract business process, we assume an equal probability in taking any branch.
To investigate the performance and accuracy of the simulated annealing and genetic algorithms, our first experiment was to find the time used to select the optimal configuration of services for a requested configuration of the abstract business process shown in Figure 1 from a set of functionally compatible services. The end-user request used in the experiment was to configure the abstract business process with the highest bandwidth possible, and with the lowest associated processing time and cost. For the example shown in Figure 1, the starting values for the algorithms were as shown in Table 3a and 3b. The value of S for this example is 5 (as there are five component services) and PH is 3 (as there are three possible execution paths).
Table 2. Symbols used in analysis of experiment results Variable
Description
AV
The average utility value of all the experiment runs for a particular problem instance.
N
The number of component services in the end-to-end process.
PH
The total number of total execution paths in the end-to-end process.
S
The total number of service implementations matching the functional request.
T
The average completion time (ms) for the algorithm.
V
The maximum utility value of all the experiment runs for a particular problem instance.
Table 3. (A and B) Initial settings for each algorithm a. Variable
Simulated Annealing
Cooling Rate
0.99
T0
100
Tn
1000
Rate
0.3
Max Iterations
10
b.
164
Variable
Generic Algorithm
Number of individuals in population
10% of all bindings
Rate of cross over
0.9
Rate of mutation
0.05
Max Number of Generations (gmax)
10
High-Quality Business Processes Based on Multi-Dimensional QoS
The calculation of the utility value for each functionally correct combination of services is performed along the following dimensions; the minimum time, the minimum time and cost and the minimum time and cost but highest bandwidth. These QoS constraints are shown in Table 4. We investigated two cases where, for each component service of the abstract business process, the optimal service is chosen from either five or ten service instances constrained by the QoS in Table 4. The results are shown in Table 5. We always repeated each experiment 20 times and got the average. In these results a higher value of utility means the algorithm is more accurate in finding the minimum/maximum preferences of the user. A lower value of the computation time represents a faster search algorithm. Both the computation time and the utility values are the raw figures from the calculation so that we can demonstrate the differences of the two algorithms. As can be seen, we found the simulated annealing to be quicker than the genetic algorithm in both cases. In the case of five components, SA took 24 units of time while GA took 33 units of time to arrive at a solution with a certain quality. In the case of ten components, SA took 1479 units of time while GA took 3726 units of time to search for a solution with the same quality. However, the genetic algorithm yields a higher (better) utility than simulated annealing in both cases. In the case of ten components, the solution derived by SA is
5329 units of utility and the one by GA is 7380 units of utility. To study the performance of the hybrid algorithm with respect to that of the simulated annealing and genetic algorithms, we performed further comparison experiments in more realistic settings that exhibit more complexity. These experiments used the same QoS constraints as before, but instead of using the generic business process example from Figure 1, we created datasets representing generic business processes as input to the algorithms with each dataset containing different (random) numbers of component services and conditional branches but with a constant number of services instances from which to select the optimal configuration. To compare the performance of the hybrid algorithm we calculated the ratio of the time required to finish the simulated annealing and genetic algorithms and the hybrid of the two. Tables 6 and 7 show the results of our comparison and show the ratio of the corresponding results of two algorithms calculated using the running time, T, maximum utility, V, and average utility, AV (calculated over 10 iterations of the algorithms). In general, we found that the hybrid of the simulated annealing and genetic algorithms improves the utility value (i.e., the accuracy) and the hybrid finds the utility value faster than the genetic algorithm. Of course, we found several abnormal cases in our experiments; for example,
Table 4. QoS constraints considered in cases A, B and C Dimensions
A
B
C
Constraints
Time
Time + Cost
Time + Cost + Bandwidth
Table 5. Performance and accuracy with various numbers of services for 5 tasks N
PH
S
T SA
V GA
SA
GA
5
3
5
24
33
8362
8998
5
3
10
1479
3726
5329
7380
165
High-Quality Business Processes Based on Multi-Dimensional QoS
Table 6. Average result comparisons for time and utility N
P
S
Ratio T HA/SA
Ratio V SA/GA
HA/SA
Ratio A HA/GA
HA/SA
HA/GA
5
384
5
1.21
0.88
1.07
1.00
1.24
0.94
6
90
5
1.09
0.91
1.59
1.03
1.41
1.01
7
108
5
1.09
0.91
0.99
1.11
1.03
1.11
8
960
5
1.11
0.85
1.02
1.02
1.07
1.01
9
270
5
1.44
0.80
1.01
1.00
1.01
0.97
10
786
5
1.39
0.74
1.00
1.00
1.22
1.01
Table 7. Average result comparisons on time and utility N
P
S
Ratio T
Ratio V
Ratio A
HA/SA
SA/GA
HA/SA
HA/GA
HA/SA
HA/GA
5
39200
10
1.77
0.75
1.42
1.05
1.38
1.02
6
504
10
1.05
0.82
1.54
0.94
1.71
0.91
7
65856
10
1.68
0.88
1.07
0.97
0.95
1.03
8
81
10
1.94
1.02
1.41
0.86
1.02
0.93
9
60
10
5.59
0.78
1.00
1.00
1.10
0.96
10
720
10
76.30
0.47
1.04
1.11
1.51
0.86
for 8 tasks and 81 functional matches (not shown in these results) the time to complete the hybrid is more than that of the genetic algorithm. The hybrid’s utility value is much smaller (i.e., less accurate) than the genetic algorithm’s value, however. To summarize our results, our experiments have shown that, for our experimental configuration and choice of parameters, the genetic algorithm is substantially more accurate than the simulated annealing approach, although it takes longer to achieve this accuracy - i.e., the simulated annealing algorithm is faster, but has a much lower accuracy than the genetic algorithm. The hybrid algorithm using the genetic and simulated annealing algorithms shows the trade-off in performance and accuracy; it achieves an accuracy higher than running the simulated annealing algorithm by itself, yet has only a small improvement in accuracy over that of running the genetic algo-
166
rithm. The results of the hybrid are understandable because genetic algorithm plays a more significant role in the hybrid and, thus, the final result tends towards the results of that genetic algorithm. Regarding execution time, the hybrid is slightly more expensive than both the simulated annealing and genetic algorithms.
FUTURE RESEARCH DIRECTIONS Dynamic business process management in a service-oriented environment requires the processes of service description, discovery and integration. This chapter has focused on the second and third of these steps to describe the difficulties of selecting participants for an abstract business process from services that advertise various, heterogeneous quality dimensions. Due to these difficulties business process owners carrying out the service
High-Quality Business Processes Based on Multi-Dimensional QoS
selection are in need of an efficient mechanism of choosing from the many services available in the Internet of Services. We have described how the selection of services and their combination can be carried out using optimization techniques - necessary for situations where there are a large number of services available to select from - to satisfy the constraints of the users of the system and achieve the best utility for the users. The selection of the most appropriate services can be used to populate a process model. As we have described, this area of work can be seen as being generally within the field of decision making techniques. However, in order for any decision making technique to make an accurate decision, the decision maker (human or algorithm) must have sufficient data on service instances and their non-functional capabilities to perform the recommendation. In our scenario, we have assumed the data the algorithms use to make their decisions comes from a service QoS metadata repository, which could be based on a UDDI or similar service. This repository could be maintained through a centralized provider - e.g., similar to how Google provides a central point of information for web sites - but, although there are initiatives such the European Commission’s Service Finder project (http://www.service-finder. eu/) that record service functions and descriptions, similar repositories or providers for QoS metadata have yet to materialize. The vision for the Internet of Services is to have an unstructured network of services where links between services can be commissioned and decommissioned regularly and with little overhead. Such a decentralized environment lends itself to highly distributed models of searching for service instances, such as those used by peer-to-peer systems and protocols. We favor this approach to the searching for QoS metadata over a centralized repository as it has been shown in theory and in practice to be reliable, free from single points of
failure and robust - three properties that are highly desirable in any distributed system. Recently, decision making techniques using multi-dimensional data like we have described are finding an application in many areas of the World Wide Web. These techniques use a wide range of metrics on, for example, previous customer purchases, recently viewed items and membership of social networks to recommend products and services to other customers who may have similar, but not quite the same requirements to find the optimal combination of product, price and supplier with a customer’s profile. These techniques are known as collective intelligence and have enjoyed great success; recently, in late 2009, a hybrid collaborative filtering algorithm (similar to, but much more sophisticated than, the hybrid algorithm presented above) won the $1million NetFlix Prize, an open competition for the best algorithm to predict user ratings for films based on previous user ratings. Many techniques envisioned for searching QoS metadata are based on the adoption of a formal model to describe data (e.g., solutions based on semantic web technologies often require an ontology to define and relate concepts). However, in the future environment for the Internet of Services we have described above, we feel techniques developed for searching through sparsely populated, semi- and unstructured data to be of more use than those which use formal descriptions. This is because within the Internet of Services there is no single point of control to define, regulate or mandate the use of these data models; as we have seen on the World Wide Web, even though it may be beneficial to re-use existing ‘standard’ data models service designers and developers often design their own data formats. As a result, we believe techniques to search through these heterogeneous formats will be of significance in the future.
167
High-Quality Business Processes Based on Multi-Dimensional QoS
CONCLUSION The ability of a process owner to take an abstract business process created by a business analyst and configure it according to desired quality of service (QoS) requirements by a process owner constitutes an important and challenging problem for services research. This paper presents a demand-driven operational framework where process owners are allowed to configure an abstract business processes and generate an optimal configuration of customized or differentiated services that are parameterized in terms of multidimensional QoS information. This approach that goes beyond conventional approaches to configuring generic business processes according to purely functional characteristics. The approach is based on optimization techniques that operate on the fusion of two proven algorithms (a simulated annealing algorithm and a genetic algorithm). The approach has been explored and validated by an initial experimental implementation, which will be fine-tuned and extended in the future and in particular will deal with probabilistic data about execution paths be gathered from event logs.
REFERENCES Andrieux, A., Czajkowski, K., Dan, A., Keahey, K., Ludwig, H., Nakata, T., Pruyne, J., Rofrano, J., Tuecke, S., and Xu, M. (2007). Web Services Agreement Specification (WS-Agreement). Specification GFD-R-P.107, Open Grid Forum, Grid Resource Allocation Agreement Protocol (GRAAP) WG. Andrikopoulos, V., Bertoli, P., Bindelli, S., di Nitto, E., Gehlert, A., Germanovich, L., Kazhamiakin, R., Kounkou, A., Pernici, B., Plebani, P., and Weyer, T. (2008): State-of-the-art Report on Software Engineering Design Knowledge & Survey of HCI and Contextual Knowledge. Software Services and Systems Network (S-Cube) Deliverable CD-JRA-1.1.1. 168
Andrikopoulos, V., Fairchild, F., van den Heuvel, W.J., Kazhamiakin, R., Leitner, P., Metzger A., Nemeth, Z., di Nitto, E., Papazoglou, M.P., Pernici, B., and Wetzstein B. (2008): Comprehensive Overview of the State-of-the-art of Service-based Systems. Software Services and Systems Network (S-Cube) Deliverable CD-IA-1.1.1. Barron, F. H., & Barrett, B. E. (1996). Decision Quality Using Ranked Attribute Weights. Management Science, 42(11), 1515–1522. doi:10.1287/ mnsc.42.11.1515 Berbner, R., Spahn, M., Repp, N., Heckmann, O., & Steinmetz, R. (2006) Heuristics for QoS-aware Web Service Composition. Proceedings of the IEEE Conference on Web Services (ICWS’06), 72-82. Bonatti, P. A., & Festa, P. (2005): On Optimal Service Selection. Proceedings of 14th International Conference on the World Wide Web (WWW’05), 530-538. Brito, M., and May, J. (2007): Safety Critical Software Process Improvement by Multi-objective Optimization Algorithms. Software Process Dynamics and Agility, LNCS 4470/2007, 96-108. Canbolat, Y. B., Chelst, K., & Garg, N. (2005). Combining Decision Tree and MAUT for Selecting a Country for a Global Manufacturing Facility. International Journal of Management Science, 35(3), 312–315. Cardoso, J., Voigt, K., & Winkler, M. (2009): Service Engineering for the Internet of Services. Revised, Selected Papers from the Proceedings of the 10th International Conference on Enterprise Information Systems, 10-27. Clapham, C., & Nicholson, J. (2006). The Concise Oxford Dictionary of Mathematics. Oxford, New York: Oxford University Press. Dolado, J. J. (2000). A Validation of the Component-Based Method for Software Size Estimation. IEEE Engineering, 26(10), 1006–1021.
High-Quality Business Processes Based on Multi-Dimensional QoS
Gustafson, D. H., & Holloway, D. C. (1975). A Decision Theory Approach to Measuring Severity in Illness. Health Services Research, 97–106. Helland, P. (2007): Data on the Inside Versus Data on the Outside. Proceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR’05), 144-153. Herrmann, M., & Code, C. (1996). Weightings of Items on the Code-Muller Protocols: The Effects of Clinical Experience of Aphasia Therapy. Disability and Rehabilitation, 18(10), 509–514. doi:10.3109/09638289609166037 Holland, J. (1975): Adaptation in Natural and Artificial Systems Johnson, B. B., Gratz, E., Rust, K. L., & Smith, R. O. (2007). ATOMS Project Technical Report - Multiattribute Utility Theory Summarizing a Methodology and an Evolving Instrument for AT Outcomes. University of Wisconsin at Milwaukee. Keeney, R. L. (1970). Assessment of Multiattribute Preferences. Science, 198, 1491–1492. doi:10.1126/science.168.3938.1491 Kirkpatrick, S., Gelatt, C. D. Jr, & Vecchi, M. P. (1983). Optimization by Simulated Annealing. Science, 220(4598), 671–681. doi:10.1126/science.220.4598.671 Liang, Q., Lau, H. C., & Wu, X. (2008a): Robust Application Level QoS Management in Service Oriented Systems, IEEE International Conference on E-Business Engineering (ICEBE ‘08), pp. 239-246. Liang, Q. A. Rubin, S. H. (2006): Randomized local extrema for heuristic selection in TSP. IEEE International Conference on Information Reuse and Integration (IRI 2006), pp. 336-340 Liang, Q. A., Chung, J. Y., & Miller, S. (2007). Modeling Semantics in Composite Web Service Requests by Utility Elicitation. Knowledge and Information Systems, 13(3), 367–394. doi:10.1007/ s10115-006-0052-4
Liang, Q. A., Lam, H., Narupiyakul, L., & Hung, P. (2008b): A Rule-Based Approach for Availability of Web Service. International Conference on Web Services (ICWS 2008), pp.153-160 Liang, Q. A., Wu, X., & Lau, H. C. (2009). Optimizing Service Systems Based on ApplicationLevel QoS. IEEE Transactions on Services Computing, 2(2), 108–121. doi:10.1109/TSC.2009.13 MacKenzie, C. M., Laskey, K., McCabe, F., Brown, P. F., & Metz, R. (2006). Reference Model for Service Oriented Architecture 1.0. OASIS SOA-RM Technical Committee. Menasce, D., Casalicchio, E., & Dubey, V. (2008): A Heuristic Approach to Optimal Service Selection in Service-Oriented Architectures. Proceedings of the ACM International Workshop on Software and Performance (WOSP 2008), 13-24. Object Management Group. (2009): Business Process Model and Notation (BPMN) Version 1.2. OMG Document Number: formal/2009-01-03. Standard document URL: http://www.omg.org/ spec/BPMN/1.2. Paludo, M., Burnett, R., & Jamhour, E. (2000): Patterns Leveraging Analysis Reuse of Business Processes. Proceedings of the 6th International Conference on Software Reuse: Advances in Software Reusability, 353-368. Scheer, A. W. (1999). ARIS - Business Process Frameworks. Berlin, Heidelberg: Springer-Verlag. doi:10.1007/978-3-642-58529-6 Segaran, T. (2007): Programming Collective Intelligence. Sebastopol, California: O’Reilly Media Inc. Sendall, S., & Kozaczynski, W. (2003). Model Transformation - the Heart and Soul of Modeldriven Software Development. IEEE Software, 20(5), 42–45. doi:10.1109/MS.2003.1231150
169
High-Quality Business Processes Based on Multi-Dimensional QoS
Seo, Y. J., Jeong, H. Y., & Song, Y. J. (2004): A Study on Web Services Selection Method Based on Negotiation Through Quality Broker: A MAUTbased Approach. Proceedings of the International Conference on Embedded Software & Systems (ICESS’04), 65-73. Tracey, N., Clark, J., & Mander, K. (1998). Automated Program Flaw Finding Using Simulated Annealing. Proceedings of the International Symposium on Software Testing and Analysis (ISSTA’98), 73-81. Tratt, L. (2007). The MT model transformation language. Science of Computer Programming, 68(3), 169–186. doi:10.1016/j.scico.2007.05.003 von Winterfeldt, D., & Edwards, W. (1986). Decision Analysis and Behavioral Research. Boston, Massachusetts: Cambridge University Press. Wegener, J., Sthamer, H., Jones, B. F., & Eyres, D. E. (1997). Testing Real-time Systems using Genetic Algorithms. Software Quality Journal, 6(2), 127–135. doi:10.1023/A:1018551716639 Wetzstein, B., Leitner, P., Rosenberg, F., Brandic, I., Dustdar, S., & Leymann, F. (2009): Monitoring and Analyzing Influential Factors of Business Process Performance. Proceedings of the 13th International Enterprise distributed Object Computing Conference (EDOC’09), 141-150. Zhang, Y., Harman, M., & Mansouri, S. A. (2007): The Multi-Objective Next Release Problem. Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, 1129-1137.
ADDITIONAL READING Dudstar, S., & Schreiner, W. (2005). A Survey on Web Service Composition. International Journal of Web and Grid Services, 1(1), 1–30. doi:10.1504/ IJWGS.2005.007545
170
Engelbrecht, A. P. (2007). Computational Intelligence: An Introduction. Wiley-Blackwell. Marmanis, H., & Babenko, D. (2009). Algorithms of the Intelligent Web. Manning Publications. Maximilien, E. M., & Singh, M. P. (2004). A Framework and Ontology for Dynamic Web Services Selection. IEEE Computing, 8(5), 84–93. Papazoglou, M. (2008). Web services: principles and technology. Pearson Prentice Hall. Papazoglou, M., & Georgakopoulos, D. (2003). Service-Oriented Computing. Communications of the ACM, 46(10), 25–28. Prebys, E. K. (1999). The Genetic Algorithm in Computer Science. MIT Undergraduate Journal of Mathematics, 1, 165–170. Qi, Y., & Bouguettaya, A. (2010). Foundations for Efficient Web Service Selection. Springer. Segaran, T. (2007): Programming Collective Intelligence. Sebastopol, California: O’Reilly Media Inc. Singh, M. P., & Huhns, M. N. (2005). ServiceOriented Computing: Semantics, Processes, Agents. John Wiley & Sons, Ltd. Zhang, L. J., Zhang, J., & Cai, H. (2007). Services Computing. Springer Verlag and Tsinghua University Press.
KEY TERMS AND DEFINITIONS Service: A mechanism to enable access to one or more capabilities, where the access is provided using a prescribed interface and is exercised consistent with constraints and policies as specified by the service description (MacKenzie, Laskey, McCabe, Brown, & Metz, 2006). Software Service: A service implemented using well-known standards and provides an uniform and standards-based methods to offer,
High-Quality Business Processes Based on Multi-Dimensional QoS
discover, interact with and use capabilities to produce desired effects consistent with measurable preconditions and expectations. Service Network: A collection of well-known and well-defined network-enabled software services, the network spans multiple service providers and their services are allowed to join and leave the network independently. Abstract Business Process: An abstract description of a service composition for carrying out a particular process or task and could also be referred to as a process model containing service interaction details. It is defined in an appropriate notation for the representation of the end-to-end process and its components, such as BPMN, the business process execution language (BPEL), or some other notation. The model has the objective of promoting the reuse of services, process fragments and available services. An abstract business process is often created by an analyst or process designer from a common, shared vocabulary of ‘best practice’ business process patterns (also known as process fragments) that help the designer to model and create processes (Paludo, Burnett, & Jamhour, 2000). Business Process: In the context of this work, an instantiated is composed of one or more services that together perform a desired sequence of interconnected tasks. Component services can be provided by one or more service providers. Note that a business process deployed like
this has a significant difference with respect to a component-based application; while the owner of the component-based application also owns and controls its components, the owner of a servicebased application generally, does not own the component services nor can they influence their execution (Andrikopoulos, Bertoli, Bindelli, di Nitto, Gehlert, Germanovich, Kazhamiakin, Kounkou, Pernici, Plebani & Weyer, 2008). Quality of Service (QoS) Characteristic, Service Quality Dimension (Quality Attribute): A quality of service capability or a requirement. Quality characteristics may apply any logical level of the service-based application, from the (highest) business-level application layer to the (lowest) system- level infrastructure and can apply to compositions of services as well as individual service offerings. Practically, a quality characteristic is a meta-model that describes the nonfunctional characteristics in the QoS properties of a service (e.g., latency or throughput); the dimensions in which each characteristics is measurable (e.g., reliability can be measured in mean time between failure or time to repair) as well as the direction of order in the domain, its units and associated statistics; the possibility of grouping several characteristics into higher-level constructs (e.g., in the performance category); the description of the values that quantifiable QoS characteristics can take (Andrikopoulos, et al, 2008).
171
172
Chapter 8
A Game Theoretic Solution for the Optimal Selection of Services Salah Merad Office for National Statistics, UK Rogério de Lemos University of Kent, UK Tom Anderson Newcastle University, UK
ABSTRACT This chapter considers the problem of optimally selecting services during run-time with respect to their non-functional attributes and costs. Commercial pressures for reducing the cost of managing complex software systems are changing the way in which systems are designed and built. The reason behind this shift is the need for dealing with changes efficiently and effectively, which may include removing the human operator from the process of decision-making. In service-oriented computing, in particular, the run-time selection and integration of services may soon become a reality since services are readily available. Assuming that each component service has a specific functional and non-functional profile, the challenge now is to define a decision maker that is able to select services that satisfy the system requirements and optimise the quality of services under cost constraints. The approach presented in this chapter describes a game theoretic solution by formulating the problem as a bargaining game.
INTRODUCTION Services are normally characterised in terms of their functional and non-functional properties. In general terms, the functional properties describe what the system does, while the non-functional
properties impose restrictions on how the system does it. The latter are usually expressed in terms of some observable attributes, such as dependability, usability and performance, and together specify the quality of a service. Depending on the service to be delivered by the system, com-
DOI: 10.4018/978-1-60960-794-4.ch008
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Game Theoretic Solution for the Optimal Selection of Services
ponent services need to be selected according to their functional and non-functional properties. Although there are several techniques that allow fixing functional mismatches between required and provided services (DeLine 1999), the same cannot be said about the non-functional properties since there are no simple techniques that easily permit adapting the non-functional profile of a service (except by changing the component providing the service). For example, if the performance of a particular service does not meet what is required, the solutions are either revaluate what is required or change the component providing the service. The problem is further exacerbated if we consider that we are dealing with multiple dimensional mismatches of non-functional properties. Hence, novel approaches have to be sought if support for self-adaptation in service-oriented computing becomes a reality. In this chapter, a game theoretic approach is proposed for optimally selecting, at run-time, component services with respect to their nonfunctional attributes. We consider the problem of selecting component services that have similar functionality, but differ in their non-functional attributes (NF-attributes). In addition to the NFattributes mentioned above, we also consider the cost associated with a service. That is, in the process of selecting component services, we seek to “optimise” the quality of service provided by the overall system, subject to cost constraints. In such circumstances, it might be the case that depending on the cost, we are able to select more than one component service that have distinct, though complementary, NF-attributes. The advantage of selecting more than one component service is that the quality of services that can be obtained will be higher than what could be achieved by employing any single service from a repository of component services. The proposed game theoretic approach for the optimal selection of component services is based on the Nash solution for bargain games, which relies on a decision maker that delegates the de-
cision process to self-interested rational agents, each representing a non-functional attribute of the service. In this setting, where each attribute is a player represented in terms of a utility, the optimal solution can entail selecting more than one component service if no single component service were able to meet the non-functional requirements of the system. The solution proposed yields a selection of services, which is optimal in some well-defined sense, without relying on the decision maker’s (DM) subjective value trade-offs. For every attribute there is a certain preference pattern over the available alternatives. If we try to optimise simultaneously all the attributes, then the solution will be a compromise between the preference patterns of all the attributes. This solution is obtained when the DM delegates the decision process to self-interested rational agents, each representing an attribute. The agents will then have to bargain with each other to reach agreement on which alternative should be selected. This solution will yield the alternative that is as satisfactory as possible for each attribute. In this formulation, the DM’s value trade-offs between the attributes are not incorporated into the structure of the game. The innovative aspect of the proposed approach, compared with existing ones, is that instead of selecting a single service component based on NF-attributes and cost, we select a collection of component services that are functionally equivalent, but distinct regarding their NF-attributes. This solution optimises simultaneously all the attributes by taking account their importance. In this chapter, we consider the local selection of services, but nothing hinders the application of the proposed approach into composite services once the “abstract” component service is identified. The rest of the chapter is organised as follows. In the section below, on background, we provide a brief introduction to service selection, utility functions, and bargain games – key topics of this chapter. The section after that describes the main contribution of this chapter: first, we motivate briefly the game theoretic solution for the selec-
173
A Game Theoretic Solution for the Optimal Selection of Services
tion of services; second, we present the core of the technical contribution in terms of a general model of a service-oriented system, a solution in the form of a bargaining game, and a small case study to exemplify the proposed approach; finally, we present other alternative methods, and compare them with the approach proposed here. The last two sections of the chapter present possible avenues for future work, and a discussion evaluating our contribution.
BACKGROUND Service Selection on Quality of Services In service oriented computing, it is expected that new services may be offered, and existing services are subject to maintenance and improvements so that they can be made available to a wide range of consumers. The service selection is usually done dynamically on the basis of a set of policies. Services are usually selected from a directory of services available, known as the Universal Discovery, Description and Integration (UDDI), and the selection is based on description of the services using the Web Service Description Language (WSDL). For the composition of services to be successful, it needs to be performed in the context of the quality of the services (QoS) being provided, and regulated through service level agreements (SLAs). These SLAs stipulate, for example, how other policies should be respected, how dependability and security requirements should be enforced, and how overall performance should be delivered. SLAs should be monitored and violations reported so that compensation can be obtained, or alternative services selected. Web service brokers, which have the means to collect information about the quality of the services being provided and which complement the role of a service directory, have been advocated as means for customers to make more
174
insightful selections between services that may appear to be equivalent (de Lemos 2005). Such an approach works well if all functional and nonfunctional requirements can be met by a single service. However, when the selection of services depends on multiple QoS attributes, which may include dependability and performance, it is possible that there is no single service provider that is able to meet all the requirements. Under these circumstances, it may be necessary to negotiate the quality attributes, either relaxing the requirements or negotiating an improvement by the provider (Ardagna 2007). A different perspective is to select several services providers depending on the QoS needs and cost constraints, a problem that is usually referred in the literature as QoS-aware Service Selection or Optimal Service Selection. One solution that has been exploited is the usage of a QoS Broker for negotiating quality attributes and resource reservation mechanisms on behalf of service providers. The QoS Broker selects a service provider that maximises a global utility for a client under a cost constraint using utility functions and cost functions. The service provider is identified using predictive analytical queuing network performance models (Menascé 2007). Service composition is known to be the collection of generic service tasks described in terms of service ontologies, which are combined according to a set of control-flow and data-flow dependencies (Dustdar 2005; Zeng 2004; Rao 2005). The major problem in selecting services in the context of composite services is that, for each possible selection of services, different levels of QoS and cost can be obtained – the challenge is to select an optimal composition. As part of the Agflow middleware platform for quality-driven composition of Web services two approaches for service selection for Web service composition were proposed (Zeng 2004): local optimisation and global planning. Both approaches were compared, and it was concluded that the global planning approach leads to significantly better QoS of composite service executions with little
A Game Theoretic Solution for the Optimal Selection of Services
extra system cost in static environments. If there is no requirement for specifying global constraints, then local optimisation is preferable, especially in dynamic environments. In order to allow decisions to be taken at run-time, a heuristic algorithm for service selection has been proposed to find closeto-optimal solutions in polynomial time (Yu 2007). The objective was to maximise an applicationspecific utility function under the end-to-end QoS constraints. The problem has been modelled in two ways: a combinatorial model that defines the problem as a multi-dimension multi-choice knapsack problem (MMKP), and a graph model that defines the problem as a multi-constrained optimal path (MCOP) problem. A rather different approach based on genetic algorithms was devised for supporting the dynamic binding of composite services whenever the actual QoS deviates from initial estimates (Canfora 2008). The authors claim that the main advantage of using GAs is the possibility of applying the approach in the presence of arbitrary, non-linear QoS aggregation formulae, whereas traditional approaches (such as linear integer programming) require linearization. More recently, an approach has been proposed that combines global optimization with local selection methods. The heuristic-based solution relies on the decomposition of QoS global constraints into a set of local constraints, and the satisfaction of these local constraints by a local service broker guarantees the satisfaction of the global constraints (Alrifai 2009).
Utility Functions Utility is a measure of relative satisfaction or gratification from the consumption of various goods and services. A utility function is simply a “quantification” of a player’s preferences, and allows us to examine the measure of relative satisfaction succinctly and in mathematical terms. The assessment of utility function is a matter of subjective judgment, just like the assessment of subjective probabilities. Despite some doubts
about the foundations of utility theory, mainly to the fact that players can make different decisions in two identical or equivalent situations when these situations are described in a different way (Kaheman 1982), utility functions have been successfully established. Although most frequently used in financial contexts, utility theory has been also used in a wide range of other domains. In the context of computing science, utility functions have been used as an effective means to support decision making. In autonomic computing, utility functions have been used to express in high-level business terms, or service-level attributes, how resources can be dynamically allocated (Walsh 2004), and how systems can self-manage their behaviour (Kephart 2007). In the dynamic adaptation of component-based applications, utility functions have been employed in the selection of application variants in order to maximise the utility of the application, depending on the properties of the variants and the application environment (Geihs 2009). Actually, in the wider context of software self-adaptation based on architectural reconfiguration, what can be observed is that utility functions have been the preferred technique used in decision making for ranking the utilities across a range of alternatives. In order to eliminate the role of a manager during the operation of the system, the self-adaptation language Stitch incorporates a utility-based algorithm for the selection of adaptation strategies (Cheng 2007). In Stitch, strategies are selected depending on the utilities associated with the business preferences regarding the quality of services. Utility functions are usually determined at design-time, which requires properties of possible alternatives to be specified in considerable detail, which can be a cumbersome task and prone to inconsistencies. When the user is directly involved with the selection services, based on their respective qualities, the task becomes even more challenging. In order to facilitate the calibration of preferences related to the quality of services, simple tools have been devised for helping users to customise their
175
A Game Theoretic Solution for the Optimal Selection of Services
preferences for applications that run on resource constrained devices (Sousa 2008).
Game Theory and Bargaining Games Game theory has seen an increasing popularity in the computing literature because of the nature of the new problems that appeared with advances in artificial intelligence, automated systems, distributed architectures, and the spread of computing networks (Ito 1995; Rosenschein 1994). In these problems, some entities (for example, robots in a warehouse, computer programs, automated airplanes, software in control systems, problem solvers in distributed systems, users of a shared computer, etc.) are assumed to have some autonomy in their decision making and to be rational, that is they are utility maximisers. These entities are called agents and they are usually assumed to be self-interested, that is they try to maximise their own expected utility function rather than the overall utility of the system in which they belong. The interaction of these agents leads to conflict situations that need to be resolved. Because of the characteristics of the interacting agents, the process of conflict resolution is achieved using the appropriate techniques from game theory. In bargaining games, two or more players must reach agreement on how to distribute the payoff. The purpose is to reach the most favorable agreement that they can, while avoiding the risk of making no agreement at all (Davis 1983). In other words, each player prefers to reach an agreement in these games, rather than abstain from doing so; however, each player prefers that agreement which most favours her/his interests. One way of bypassing the bargaining process, and thus avoiding a possible failure to agree, is to have the terms of agreement fixed by arbitration, which should reflect the strengths of the players so that the results of negotiation can be attained without risk. John Nash proposed a small number of axioms with the striking result that, given certain technical assumptions, there is a unique solution to the
176
bargain problem known as the Nash bargaining solution (Nash 1950). The Nash bargaining solution is the outcome that maximises the product of the players’ gains from any agreement.
DYNAMIC SELECTION OF SERVICES Motivation One of the stages in the lifecycle of serviceoriented computing is the selection of services among several alternatives (Papazoglou 2007). Once the functional and non-functional attributes of these services have been well characterised at their interfaces (Franch 1998; Frolund 1998), the selection of the “best” service or group of services becomes a multi-attribute decision problem. For this reason, we need to define another solution concept that is scalable for a large number of services and attributes, and which has an optimal solution in some well-defined sense. In addition to the above features, in the longer term, we would like the method to be able to support the selection of services during run-time, in order to adapt to changes that might occur in the operating environment of the service-oriented system. Moreover, instead of focusing directly on the component services in an attempt to find the “best” component that is able to fulfil all the requirements, our approach considers individual attributes and aims to select a collection of components that is able to fulfil the multi-attribute problem.
A Game Theoretic Solution In the following, we propose a solution concept, which yields the selection of services that is the optimal in some well-defined sense, without relying on the decision maker’s (DM) subjective value trade-offs. For every attribute there is a certain preference pattern over the available alternatives. If we try to optimise simultaneously all the attributes, then the solution will be
A Game Theoretic Solution for the Optimal Selection of Services
a compromise between the preference patterns of all the attributes. This solution is obtained when the DM delegates the decision process to self-interested rational agents, each representing an attribute. The agent for attribute Xn has utility function un(xn). The agents will then have to bargain with each other to reach an agreement on which alternative should be selected. This solution will yield the alternative that is as satisfactory as possible for each attribute. In this formulation, the DM’s value trade-offs between the attributes are not incorporated into the structure of the game.
service must be equal to or greater than rvn . We denote the user’s NF-requirements by the vector rv=(rv1,rv2,…,rvN). Let umn,m=1,…,M be the utilities of attribute Xn of a service Am . The utilities are computed using the method of von Neumann (von Neumann 1947; Keeney 1976), and are scaled so that, the utility is 0 at the required value and 1 at the best value (for more details on how to compute the utilities, see (Merad 1999)). The utility profile of service Am is denoted by um=(um1,um2,…,umN). Let ωn represent the weight or importance of attribute Xn. We have, 0≤ωn≤1, n=1,2,…N, and
General Model of a System of Services
∑ω
Let A={A1,A2,…AM} be the set of services, which satisfy the user’s requirements. Let X1,X2,…XN be N NF-attributes associated with a service. Let amn be the value of attribute Xn in service Am,1≤m≤M and 1≤n≤N. Then, vector a m = (am 1, am 2 , ..., amN ) represents the profile of service Am . Let rvn be a required value for attribute Xn. For those NF-attributes in which the higher the value the better, the value of attribute Xn of a
N
n
= 1.
n =1
Let cm be the cost of service Am, and c* the budget available for the software system. Without loss of generality, we assume that cm ≤ c* for all 1≤m≤M.. Table 1 gives a summary of the services profiles and their weights together with the user’s requirements. Given a user’s non-functional requirements (NF-requirements) and a limited budget, the problem is to select the “best” service, or group
Table 1. Services’ profiles and user’s requirements Attributes
X1
X2
…
XN
Weights
ω1
ω2
€
ωN
Required values services
rv1
rv2
…
rvN
c*
A1
a11
a12
…
a1N
c1
A2
a 21
a 22
…
a 2N
c2
. . .
. . .
. . .
€
€
. . .
AM
aM 1
aM 2
…
aMN
cM
Cost
177
A Game Theoretic Solution for the Optimal Selection of Services
of services, from a library of services with respect to their NF-attributes and their cost.
The Nash Solution Nash proposed a solution to the bargaining game in which, to avoid the prospect of not reaching agreement, the players/agents are willing to submit their conflict to a “fair” arbiter (Nash 1950): an impartial outsider who will resolve the conflict by suggesting a solution. The arbitration scheme devised by Nash is defined by a function, which associates each conflict with a unique payoff to the players. The arbitration solution should give each player at least as much as the player could get under the worst case, and there should not be any other feasible payoff preferred by all players. In the context of services selection, the worse case corresponds to the required value for an attribute at which the utility is set at 0 for simplicity. For the mathematical formulation of the subjective intuition of fairness, Nash defined the following four basic axioms (Carmichael 2005; Luce 1957): invariance to equivalent utility representation, which states that the players’ preferences should affect the arbiter outcome, and not the specific utility functions used to represent them; independence of irrelevant alternatives, which states that unselected options should not have an impact on the final outcome; symmetry, which states that the outcome should not depend on the labelling of the players, thus implying that when the players’ utility function are the same they should receive equal shares; and Pareto-optimal, which states that there should not be any other outcome in which two players simultaneously do better; in other words, none of the players can increase their payoff without decreasing the payoff of at least one of the other players. Before we describe the method to find the Nash solution in the context of services selection, we will define some terms:
178
1. A randomised strategy δ is an M -tuple (δ1, δ2 , ..., δM ) , w h e r e 0 ≤ δm ≤ 1 , m=1,2,...,M and
M
∑δ m =1
m
= 1. δm is the prob-
ability of selecting service Am . 2. The expected utility for attribute X n under the randomised strategy δ is given by M
EU n (δ) = ∑ δm um (amn ). m =1
3. A strategy δ is dominated if there exists a strategy β such that EU n (δ) ≤ EU n (β) , for all n∈{1,2,…,N} and at least one inequality is strict. 4. A strategy δ is said to be Pareto-optimal if it is not dominated. 5. Let ∆ be the set of randomised strategies that are Pareto-optimal and such that the expected utility (payoff) for every attribute is positive. Then the set of Pareto-optimal p a y o f f s V (∆) i s d e f i n e d a s V(△)={v=(v1,v2,…,vN)|vn=EUn(𝛿),𝛿∈△}. 6. Let vn , 1≤n≤N, be the expected utility for attribute X n under strategy δ . Then the (generalised) Nash product (Binmore 1992), which gives a measure of the quality of service, is defined as N
F (v1 , v2 , ...vN ) = ∏ vn wn . n =1
7. The Nash solution is achieved by maximising the Nash product in the set of Pareto-optimal payoffs. The Nash product not only satisfies the “fairness” axioms, it can be shown that it is the only function that does so (Nash 1950). Hence, these fairness conditions implicitly define a unique arbitration scheme for bargaining games. In terms of the problem of selecting a service, or a group of services, from a library, the set of Pareto-optimal payoffs is a portion of the boundary of the convex hull generated by the utility profiles of the alternative services. This convex
A Game Theoretic Solution for the Optimal Selection of Services
hull, which we denote by H({u1 u2,…,uM}), can then be represented as a convex combination of these profiles (Bazaraa 1993), that is H({u1,…, uM})= M v = (v1, ..., vN } : vn = ∑ δm umn , 1 ≤ n ≤ N ,δm ≥ 0 and m =1
∑= δm = 1 . m 1 M
In the context of our problem, this means that the optimal solution will be the combination of at most N services from the pool. Figure 1 illustrates an example with a library of five services and two attributes. The extreme points of the polytope represent four of the services, and the internal point represents the fifth service. The set of Pareto-optimal payoffs is drawn in bold. To find the Nash solution, we need to maximise the Nash product in the set of Pareto-optimal payoffs. This is achieved by maximising the Nash product over the region defined by the convex hull H({u1 u2,…,uM}); that is, by solving the nonlinear constrained optimisation problem N
P1: F (v1, v2 , ...vN ) = ∏ vn wn . n =1
subject to: v∈ H({u1 u2,…,uM}).
N
If we substitute vn by ∑ δm umn in the expresm =1
sion of F (v1 , v2 , ...vN ) , problem P1 is then equivalent to problem P2: maximize G(δ1, δ1,…, δM) = ∑ ωn 1n ∑ δm umn M
M
n =1
m =1
M
subject to: ∑ δm = 1, δm ≥ 0, for 1 ≤ m ≤ M . m =1
Problem P2 is a non-linear optimisation problem with one linear equality constraint and M linear inequality constraints. There are numerous approximate numerical methods to solve problem P2 (Bazaraa 1993), for instance, the Zoutendijk method (Zoutendijk 1960), the successive linear programming approach (Griffith 1961) and the generalised reduced gradient method (Abadie 1969). The latter method is the basis of the solver used in numerous software packages such as LINGO (Schrage 1991) and GRG2 (Lasdon 1978). Let δ*= (δ*1,δ*2,…, δ*M) be the optimal solution of problem P2. All the services, for which there is a positive probability of being selected, will comprise the optimal group. As was noted above, there will be up to N services in the optimal group. The total cost Cs* of the group, which corresponds M
Figure 1. Set of Pareto-optimal payoffs
to the Nash solution δ*m, is given by C s * = ∑ θmcm 1 if δ * >0 m . where θm = 1 if δ * =0
m =1
m
The budget available for expenditure on the system is c * , and if the total cost Cs* does not exceed c * , we say that the solution δ* is admissible. To avoid having non admissible solutions, we could express the budget requirement as a constraint which we add to the optimisation problem P2. Unfortunately, we can only approximate this constraint by the nonlinear inequality M δmcm ≤ c * , where ε is a small positive ∑ δ − ε m =1 m number, and the resulting solution turns out to be
179
A Game Theoretic Solution for the Optimal Selection of Services
very sensitive to the choice of ε . Moreover, when there are nonlinear constraints, the standard algorithms do not guarantee convergence to the optimal solution. We hence adopt an iterative process to find a feasible solution. When the optimal solution is non-admissible, services of the optimal group have to be removed from the pool, and the new Nash solution for the reduced pool has to be found. A tree diagram can represent this removal process: the nodes are the pools together with their optimal solutions. The root node is the original library and its solution, and every node has a number of offspring equal to the number of services in the optimal group associated with a non-admissible solution; that is, at most N offspring. If a node yields either an admissible solution or a non-admissible solution, whose Nash product is lower than the highest Nash product of the existing admissible solutions, then the removal process is discontinued on this node. On the other hand, if the Nash product is higher than the one of all the admissible solutions, then the removal process continues. At the end of the removal process, if there is more than one reduced pool for which the Nash solution is admissible, then the DM will choose the solution with the highest Nash product. But the optimal admissible solution may be only marginally better than some of the other solutions, whereas its cost may be much higher. In this case, the DM may prefer to trade-off the Nash product of the groups against their total costs. For instance, suppose that there are K reduced pools whose solutions are δ(1) , δ(2) , ..., δ(K ) , with Nash products F (1) , F (2) , ..., F (K ) , respectively. For each solution, we define the value trade-off index T(K) by F (k ) T (k ) = , and the solution with the highest C δ( k )
index is chosen. But this exhaustive search may require iterating the removal of services up to M-1 times. In the worst case scenario, at iteration k,1≤K≤M-1, there are Nk-1 pools, and hence a 1 − N M −1 total of pools are examined, which is 1−N 180
very large. But if c * is large enough, say comparable to Nc , where c is the average cost of the services in the pool, then the search is very likely to terminate quickly. When c * is much smaller than Nc , it is more efficient to consider all the groups of size equal to or smaller than N and whose total cost does not exceed c * , find the Nash solution for every group, and select the group with the highest Nash product or trade-off index.
Example In the following example, the aim is to optimally select a service, or group of services from a pool of four services which are characterised by the three attributes: 1.
X1 represents the reliability of a service, and it is given by the probability of operating without a failure during a given length of time under a given set of operating conditions. 2. X 2 represents performance, and it is measured as the number of operations executed per unit time. 3. X 3 represents the availability of the system, and it is given by the probability that the system is functioning correctly at any given time. Table 2 gives the profiles of the four services, together with the user’s weights and required values. Assuming linear utility functions for all attributes, and setting utilities at 0 at the required values, and at 1 at the maximum values, the a − rvn , for utilities are given by umn = mn* an − rvn max 1≤m≤4, 1≤n≤3, where an * = amn . The 1≤m ≤4 computed utilities are presented in Table 3.
A Game Theoretic Solution for the Optimal Selection of Services
Table 2. Data for the example Attributes
X1
X2
X3
Cost (£)
Weights
0.5
0.25
0.25
€
Required values Services
0.95
100
0.90
500
A1
0.97
130
0.92
100
A2
0.99
110
0.96
170
A3
0.98
120
0.93
160
A4
0.96
140
0.94
150
Using the package LINGO (Schrage 1991), we find that the optimal solution is to select the group {A2, A4}, where services A2 and A4 are selected with probabilities 0.84 and 0.16, respectively. Its Nash product is 0.72 and its total cost is 320. This solution is hence admissible. Its trade0.72 = 0.002. off index is 320 If the available budget c*=270, then the above solution is not admissible because the total cost of the group is 320. We need to remove one of the services composing the group. 1. Service A2 is removed from the pool. The optimal solution in the resulting reduced pool is to select service A3 only. The Nash product is equal to 0.61 and its cost is 160. This solution is hence admissible. Its trade0.62 = 0.004. off index is 160 2. Service A4 is removed from the pool. The optimal solution in the resulting reduced pool is to select the group {A1, A2}, and the services are selected with probabilities 0.07 Table 3. Utility profiles of the services Attributes Services
X1
X2
X3
A1
0.5
0.75
1/3
A2
1
0.25
1
A3
0.75
0.5
0.5
A4
0.25
1
2/3
and 0.93, respectively. The Nash product of this solution is 0.71 and its total cost is 260. This solution is hence admissible. Its trade0.71 off index is = 0.003. 260 Both solutions are admissible, but group {A1, A2} has the higher Nash product, whereas {A3} has the higher trade-off index. Note that when the available budget is c*=500, the optimal group {A2, A4} is admissible, but it has a lower trade-off index than the groups {A3} and {A1, A2} obtained if the removal process was carried out on the original pool. The Nash product of group {A2, A4} is only slightly higher than that of group {A1, A2}, but the total cost of the latter is significantly lower. However, the search procedure cannot prevent such situations as, once an admissible solution is found at a node, the search is terminated at that node.
ALTERNATIVE SOLUTIONS AND EVALUATION CRITERIA In order to contextualise the application of a game theoretic solution to the optimal selection of services, the following subsection introduces several alternative solutions, and the second part discusses criteria that can be used for evaluating the game theoretic solution against the alternative solutions considered.
181
A Game Theoretic Solution for the Optimal Selection of Services
Alternative Solutions Combined Utility Function (CUF) There exists already a large literature outside the field of software engineering that deals with this type of problem, see, for example, (Fishburn 1970; Keeney 1976) where the authors seek to represent the decision maker’s preferences with a simple function. One of these approaches shows that under some independence assumptions between the attributes, it is possible to represent the user’s preferences and value trade-offs with a Combined Utility Function U (x 1 , x 2 , ..., x N ) in terms of a simple expression. For example, under the assumption of preference independence, the utility function has the following additive form N
U (x 1 , x 2 , ..., x N ) = ∑ kn un (x n ) , where un (x n ) , n =1
1≤n≤N, is the utility function for attribute X n , and k1,k2,…,kN are scaling coefficients through which the user’s value trade-offs are evaluated empirically. Hence, the Combined Utility Function (CUF) requires assumptions about the independence between attributes and the empirical evaluation of scaling coefficients. This evaluation process is subjective and often yields inconsistencies, which are difficult to eliminate even after repeating the process many times, especially when the number of attributes is large. For a more detailed account of this method and its applicability for the selection of software components, see (Merad 1999).
Weighted Scoring Method (WSM) and the Analytic Hierarchy Process (AHP) In the context of software engineering, the complexity of component selection has already been recognised, and a framework that supports multivariable component selection has been developed (Kontio 1996). In Kontio’s framework two solution methods are considered: the Weighted Scoring
182
Method (WSM) and the Analytic Hierarchy Process (AHP) (Saaty 1990). The latter was chosen because it is theoretically sound, it constructs the user’s utility function for each attribute, and it has a strong empirical validation. It is true that the AHP is superior to the WSM, but it nevertheless has some shortcomings. The AHP is a heuristic method, which yields an additive aggregate utility function, which has the same form as the Combined Utility Function (CUF) under the assumption of preference independence. In this aggregate utility function, the utilities for each attribute are separately evaluated by pair-wise comparisons of the alternatives; they are in fact positive numbers which add up to 1, and which represent ratio scale preferences between the alternatives. But these utilities are not as accurate as the ones computed using the method of von Neumann when representing the user’s strength of preferences and attitude to risk (von Neumann 1947). Moreover, the inconsistencies in the pairwise comparisons may be difficult to eliminate when the number of components is large. Also in this aggregate utility function the coefficients of the individual utilities are the weights of the attributes, hence not incorporating the user’s value trade-offs between the attributes. In the methods considered so far, namely CUF, WSM and AHP, we need to construct utility functions, make assumptions, and perform computations. These tasks can be inaccurate, tedious, and time consuming. Depending on the problem, simpler methods can sometimes be adopted for the selection of services. For instance, if one of the attributes is clearly much more important than the others, then we could make the selection with respect to only that attribute. However, in most cases there is a need for methods that are able to handle the selection of services based on more than one attribute. The two solutions presented below are computationally simpler, at the cost of disregarding the information about the DM’s strength of preferences. The solutions obtained using these
A Game Theoretic Solution for the Optimal Selection of Services
methods are all Pareto-optimal. Without loss of generality, it is assumed that for all the attributes that characterise a service the high values are the most desirable.
Minimum Weighted Sum of Ranks (MWSR) In this method, services are ranked from the best to the worst for every attribute, with the best having rank 1, and the worst rank M . Let rmn, with 1≤ rmn≤M, be the rank of service Am with respect to attribute Xn. Taking into account all attributes and their importance, the overall rank of service Am be given by its weighted sum N
of ranks defined as Rm = ∑ ωn rmn . n =1
The optimal service is the one with the lowest overall rank. This measure has the same form as the aggregate utility function of the AHP method, but with ordinal individual utility functions. The optimal service is obviously the one with the minimum weighted sum of ranks.
Maximum Weighted Product (MWP) In this method, an index is evaluated for every service, and the service with the highest index is selected. The index Im of service Am is defined as I m = amw11amw22 ...amwNN . This index is a measure of utility first introduced by Bridgman (Bridgman 1922) and later used by Miller and Star (Miller 1960) for goal programming problems. But the rationale behind this solution, the implicit assumptions, and how the solution is related to the decision maker’s preference structure were not well understood (Johnsen 1978). This index has in fact the same form as the Nash product when restricted to nonrandomised strategies, with the individual utilities being identical to the values taken by the services for the attributes. The individual utility functions are linear and represent ratio scale preferences,
however, but they are not of the von Neumann type. Hence, this method is hence not suitable if the decision maker’s attitude to risk is not neutral, because this gives rise to a nonlinear utility function.
Evaluation Criteria In the previous section, a number of alternative solutions were presented for selecting component services from a repository of services. The solution concept that should be adopted depends on the decision maker’s (DM) interests, the information it can provide, the practicality of the computation, and most importantly, the trade-offs between the computational effort and the gain in utility over simpler but “cruder” methods (Chankong 1983). Below, the different solutions presented, including the proposed one, will be evaluated qualitatively with respect to the information required, the type of solution obtained, and the computational effort involved.
Information For supporting a DM to make a rational choice between the available alternatives, we need to know the profile of every service for the various NF-attributes and cost, the individual utility functions, and how the DM may trade-off between the attributes. •
•
Profiles: In all the methods, the services need to be measured with respect to every attribute in some chosen scale. In the MWSR method, the services need only to be ranked from the worst to the best for each attribute, so precision in the measurement is less important than in the other methods so long as the difference between the services is obvious. Individual Utility Functions: All the methods use individual utility functions, but the functions used can be different and
183
A Game Theoretic Solution for the Optimal Selection of Services
•
some may contain more information than others about the DM’s preference patterns. For instance, in the combined utility and the game theoretic solution, the utility functions are constructed using the method of von Neumann, whereas in the AHP and the MWP they represent ratio scale preferences. In the MWSR method, the utility functions are ordinal and hence they contain the least information about the DM’s strength of preferences. Value Trade-offs: Only the combined utility method incorporates the DM’s value trade-offs explicitly through the scaling constants evaluated in indifference experiments that can only be carried out off-line. As noted above, the method of evaluation of the constants is subjective and can lead to inconsistencies that can be difficult to eliminate completely. Other methods such as the MWSR, the MWP and the game theoretic solution use the importance coefficients as partial measures of the value trade-offs. A simpler method that bases the service selection on the most crucial attribute, obviously, does not incorporate any measure of value trade-offs.
Type of Solution All the methods suggested yield Pareto-optimal solutions. This is a minimal requirement for any method. The combined utility function represents the DM’s true preference structure when some independence assumptions between the attributes hold. The AHP and the MWSR methods yield an aggregate utility which approximates the true preferences of the DM, but it is not clear what assumptions are implicit and how good is the approximation. The game theoretic solution yields a solution that optimises simultaneously all the attributes taking into account their importance
184
and the DM’s strength of preferences for every attribute. The MWP yields a solution of the same type as the Nash solution, but without including the DM’s true structure of preferences for the attributes. The game theoretic solution can yield randomised strategies, and hence to the selection of more than one component service. Although this may lead to higher costs depending on how services are charged, it also leads to a solution that is robust to changes in the profiles or the utilities of the services.
Computational Effort The amount of computation needed varies greatly between the methods. In the game theoretic solution, the main computational effort is the optimisation of a separable function over the boundary of a convex hull – for which efficient methods exist. In the combined utility approach, the main computation requirement is to solve a system of linear equations, as part of the evaluation process of the scaling coefficients. The other methods require very simple arithmetic and sorting operations.
Future Research Directions As future work, study is being conducted to identify simpler and more efficient approaches for selecting services in systems that need to be reconfigured during run-time when adapting to changes, to either the system itself or its environment. Moreover, the optimal selection of services should be done without any human intervention, as would be appropriate for plug-and-play architectures (Lowry 1998). This implies that the whole process of decision making also needs to adapt during run-time, from the ranking of the alternatives to the criteria used for selecting the alternatives, otherwise the system will start to deviate from its intending purpose, expressed in terms of goals; for example, because the services
A Game Theoretic Solution for the Optimal Selection of Services
being selected cease to be representative of the new circumstances. If service selection relies on utility functions, then means have to be found for updating the coefficients of the utility functions in order to maximise the overall utility of the system while the system and/or its services evolve. If the coefficients do not adapt according to changes, which may also include the addition of new alternatives, then the model for decision making will be detrimentally affected, which may impact on the selection of services depending on their non-functional attributes. An envisaged solution to support dynamic utility functions is the provision of a meta-level that is able to reflect on the performance of the utility functions according to the established system goals, and update them in response. The proposed approach for selecting services was considered mainly in the context of a single service, and not composite services. However, there is no reason for the proposed game theoretic solution not to be extended to composite services. A straightforward procedure would be to define an “abstract” component service based on the proposed solution, and then compose these “abstracts” component services using one of the existing approaches. One approach that could be employed is the one in which QoS global constraints are decomposed into a set of local constraints (Alrifai 2009). Another issue from the proposed approach that needs attention is how to evaluate the integration cost of a group of services that are to be used together. This cost has to be estimated in advance, and some experience may be needed to reduce the uncertainty in the estimation. Also related to the selection of a group of services that may have distinct non-functional profiles, it is important to evaluate the risk associated with a particular architectural configuration, in the same way that this is done at design-time in order to identify and mitigate risks (Bianco, 2007). An alternative way to facilitate risk evaluation would be to consider architectural selection (Guo, 2007) instead of
service selection, since the combinatorial aspects of the latter might hinder a run-time evaluation. However, the selection of architectures may require a totally different approach from the one presented here.
CONCLUSION As the discipline of service oriented computing gains momentum as a means of building large and complex systems from services originating from different sources, new methods have to be devised to build these systems efficiently. Although the selection of component services is usually done dynamically on the basis of a set of policies, their optimal selection in terms of their non-functional attributes is still an open research topic. We endeavoured to solve this problem by considering the associated costs of component services, taking into account the non-functional requirements of the system to be built and its allocated budget. When building systems out of services, the task of fixing non-functional mismatches is a difficult one, as we cannot rely on non-intrusive techniques, such as wrappers and bridges. Moreover, since the non-functional profile of services is a multi-dimensional problem, it would be difficult for a single service to excel in all of the system’s non-functional requirements. Hence the approach taken in this chapter, which is to select a group of component services that might have complementary non-functional attributes. Our game theoretic determination of the optimal selection of component services, according to their non-functional profile and cost, is based on the Nash solution for bargaining games. It relies on a decision maker that delegates the decision process to self-interested rational agents, each representing a non-functional attribute of the service. In this setting, where each attribute is a player represented in terms of a utility, the optimal solution can entail selecting more than one service because no single service might be able to meet the non-functional
185
A Game Theoretic Solution for the Optimal Selection of Services
requirements of the system. Such a situation would be more likely to arise when functionally equivalent, readily available, component services have different non-functional attributes. Indeed, the idea of using multiple services with the same functionality for a single application is not new. In software fault tolerance, for example, off-theshelf components have been proposed as a means for obtaining diversity. The feasibility of the proposed approach relies on the availability of quantitative measures for describing the non-functional profiles of services, and the literature already provides a wide range of solutions (de Lemos 2005; Serhani 2005). Moreover, we assume that when composing services the non-functional attributes of the individual services are not affected. Within this context, we could imagine an implementation for the proposed approach in which several identical services would process the same set of inputs, but on each occasion only one service output would be selected according to some optimal solution. Although the proposed approach could be applied to most nonfunctional attributes, there are some attributes for which extra care should be taken, for example, security and safety. When dealing with systems in which security or safety is critical, the optimal selection of services becomes secondary compared with the need to maintain system integrity.
REFERENCES Abadie, J., & Carpentier, J. (1969). Generalization of the Wolfe reduced gradient method to the case of nonlinear constraints. In Flecher, R. (Ed.), Optimization. New York, USA: Academic Press. Alrifai, M., & Risse, T. (2009). Combining global optimization with local selection for efficient QoSaware service composition. In Proceedings of the 18th international Conference on World Wide Web (WWW ‘09). (pp. 881-890). New York, NY: ACM.
186
Ardagna, D., & Pernici, B. (2007). Adaptive service composition in flexible processes. IEEE Transactions on Software Engineering, 33(6), 369–384. doi:10.1109/TSE.2007.1011 Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming: theory and algorithms. New York, NY: John Wiley & Sons, Inc. Bianco, P., & Kotermanski, R. Merson, & Summa, P. (2007). Evaluating a service-oriented architecture. Software Engineering Institute Technical Report CMU/SEI-2007-TR-015. Binmore, K. (1991). Fun and games: a text on game theory. Lexington, MA: D. C. Heath. Bridgman, P. W. (1922). Dimensional analysis. New Haven: Yale University Press. Canfora, G., Di Penta, M., Esposito, R., & Villani, M. L. (2008). A framework for QoS-aware binding and re-binding of composite web services. Journal of Systems and Software, 81(10), 1754–1769. doi:10.1016/j.jss.2007.12.792 Carmichael, F. (2005). A guide to game theory. Harlow, England: Pearson Education Limited. Chankong, V., & Haines, Y. Y. (1983). Multiobjective decision making: theory and method. Amsterdam, The Netherlands: North-Holland. Cheng, S.-W., & Garlan, D. (2007). Handling uncertainty in autonomic systems. In Proceedings of the International Workshop on Living with Uncertainties (IWLU’07), co-located with the 22nd International Conference on Automated Software Engineering (ASE’07). Davis, M. D. (1983). Game theory: a nontechnical introduction. New York, NY: Basic Books Inc. de Lemos, R. (2005). Architecting Web services applications for improving availability. In R. de Lemos.
A Game Theoretic Solution for the Optimal Selection of Services
Gacek, C., & Romanovsky, A. (Eds.), Architecting Dependable Systems III. Lecture Notes in Computer Science 3549 (pp. 69–91). Berlin, Germany: Springer. DeLine, R. (1999). A catalog of techniques for resolving packaging mismatch. In Proceedings of the 5th Symposium on Software Reusability (SSR’99) (pp. 44-53).
Ito, A., & Yano, H. (1995). The emergence of cooperation in a society of autonomous agents: The prisoner’s dilemma game under disclosure of contract histories. In Proceedings of the First International Conference on Multi-Agent Systems (ICMAS95) (pp. 201-208). Johnsen, E. (1968). Studies in multiobjective decision models. Lund, Sweden: Studentlitteratur.
Dustdar, S., & Schreiner, W. (2005). A survey on Web services composition. International Journal of Web and Grid Services, 1(1), 1–30. doi:10.1504/ IJWGS.2005.007545
Kaheman, D., & Tversky, A. (1982). The psychology of preferences. Scientific American, (January): 160–173. doi:10.1038/scientificamerican0182-160
Fishburn, P. C. (1970). Utility theory for decision making. New York, NY: Wiley.
Keeney, R. L., & Raiffa, H. (1976). Decisions with multiple objectives: preferences and value tradeoffs. New York, NY: Wiley.
Franch, X., & Botella, P. (1996). Putting nonfunctional requirements into software architecture. In Proceedings of the 9th International Workshop on Software Specification and Design (pp. 60-67). Los Alamitos, CA: IEEE Computer Society. Frolund, S., & Koistinen, J. (1998). Quality-ofservice specification in distributed object systems. Hewlett-Packard Laboratories, Technical Report 98-158. Geihs, K., Barone, P., Eliassen, F., Floch, J., Fricke, R., & Gjorven, E. (2009). A comprehensive solution for application-level adaptation. Software, Practice & Experience, 39(4), 385–422. doi:10.1002/spe.900 Griffith, R. E., & Stewart, A. (1961). A nonlinear programming technique for the optimization of continuous processing systems. Management Science, 7, 379–392. doi:10.1287/mnsc.7.4.379 Guo, H., Huai, J., Li, H., Deng, T., Li, Y., & Du, Z. (2007). Optimal configuration for high available service composition. In IEEE International Conference on Web Services (ICWS ’07) (pp. 280-287).
Kephart, J. O., & Das, R. (2007). Achieving self-management via utility functions. IEEE Internet Computing, 11(1), 40–48. doi:10.1109/ MIC.2007.2 Kontio, J. (1996). A case study in applying a systematic method for cots selection. In Proceedings of the 18th International Conference on Software Engineering (ICSE96) (pp. 201-209). Los Alamitos, CA: IEEE Computer Society Press. Lasdon, L. S., Warren, A. D., Jain, A., & Ratner, M. (1978). Design and testing of a grg code for nonlinear optimization. ACM Transactions on Mathematical Software, 4, 34–50. doi:10.1145/355769.355773 Lowry, M. R. (1988). Component-based reconfigurable systems. Computer, 31, 44–46. Luce, R. D., & Raiffa, H. (1957). Games and decisions. New York, NY: Wiley. Menascé, D. A., & Dubey, V. (2007). Utility-based QoS brokering in service oriented architectures. In Proceedings of the IEEE 2007 International Conference on Web Services (ICWS 2007) (pp. 422–430). Los Alamitos, CA: IEEE Computer Society Press.
187
A Game Theoretic Solution for the Optimal Selection of Services
Merad, S., de Lemos, R., & Anderson, T. (1999). Dynamic selection of software components in the face of changing requirements. Department of Computing Science, University of Newcastle upon Tyne, UK, Technical Report No 664. Miller, D. W., & Starr, M. K. (1960). Executive decisions and operations research. Englewood Cliffs, NJ: Prentice-Hall. Nash, J. F. (1950). The bargaining game. Econometrica, 18, 155–162. doi:10.2307/1907266 Papazoglou, M. P., Traverso, P., Dustdar, S., & Leymann, F. (2007). Service-Oriented Computing: State of the Art and Research Challenges. Computer, 40(11), 38–45. doi:10.1109/MC.2007.400 Rao, J., & Su, X. (2005). A survey of automated web service composition methods. In Cardoso, J., & Sheth, A. (Eds.), Semantic Web Services and Web Process Composition. Lecture Notes in Computer Science 3387 (pp. 43–54). Berlin, Germany: Springer. doi:10.1007/978-3-540-30581-1_5
Serhani, M. A., Dssouli, R., Hafid, A., & Sahraoui, H. (2005). A QoS broker based architecture for efficient Web services selection. In Proceedings of the 2005 IEEE International Conference on Web Services (ICWS 2005) (pp. 113-120). Los Alamitos, CA: IEEE Computer Society Press. Shrage, L. (1991). User’s manual for LINGO. Chicago, IL: LINDO Systems Inc. Sousa, J. P., Balan, R. K., Poladian, V., Garlan, D., & Satyanarayanan, M. (2008). User guidance of resource-adaptive systems. In International Conference on Software and Data Technologies (pp. 36-44). Setubal, Portugal: INSTICC Press. von Neumann, J., & Morgenstern, O. (1947). Theory of games and economic behavior. Princeton, NJ: Princeton University Press. Walsh, W. E., Tesauro, G., Kephart, J. O., & Das, R. (2004). Utility functions in autonomic systems. In International Conference on Autonomic Computing (pp. 70-77).
Rosenschein, J. S., & Zlotkin, G. (1994). Rules of encounter: Designing conventions for automated negotiation among computers. Boston, MA: MIT Press.
Yu, T., Zhang, Y., & Lin, K.-J. (2007). Efficient algorithms for Web services selection with endto-end QoS constraints. ACM Transactions on the Web, 1(1), 1–26. doi:10.1145/1232722.1232728
Saaty, T. L. (1990). The analytic hierarchy process. New York, NY: McGraw-Hill.
Zeng, L., Benatallah, B., Ngu, A. H. H., Dumas, M., Kalagnanam, J., & Chang, H. (2004). QoSaware middleware for Web services composition. IEEE Transactions on Software Engineering, 30(5), 311–327. doi:10.1109/TSE.2004.11 Zoutendijk, G. (1960). Methods of feasible directions. Amsterdam, The Netherlands: Elsevier.
188
189
Chapter 9
A Tool Chain for Constructing QoS-Aware Web Services Bernhard Hollunder Furtwangen University of Applied Sciences, Germany Ahmed Al-Moayed Furtwangen University of Applied Sciences, Germany Alexander Wahl Furtwangen University of Applied Sciences, Germany
ABSTRACT Web services play a dominant role in service computing and for realizing service-oriented architectures (SOA), which define the architectural foundation for various kinds of distributed applications. In many business domains, Web services must exhibit quality attributes such as robustness, security, dependability, performance, scalability and accounting. As a consequence, there is a high demand to develop, deploy and consume Web services equipped with well-defined quality of service (QoS) attributes – socalled QoS-aware Web services. Currently, there is only limited development support for the creation of QoS-aware Web services, though. In this work we present a tool chain that facilitates development, deployment and testing of QoS-aware Web services. The tool chain has following features: i) integration of standard components such as widely used IDEs, ii) usage of standards and specifications, and iii) support for various application servers and Web services infrastructures.
INTRODUCTION Today, service-oriented architectures (SOA) are a widely used paradigm for structuring distributed enterprise applications. In this context, services are often characterized as reusable units of functionality, which are aggregated to automate a particular DOI: 10.4018/978-1-60960-794-4.ch009
task or business process. Based on well-known protocols and standards such as HTTP, SOAP, and WSDL, Web services are the prevailing technology for implementing services. In fact, almost all application servers and enterprise service buses (ESBs) include runtime environments for Web services. When deploying Web services in the area of so-called mission-critical business applications,
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Tool Chain for Constructing QoS-Aware Web Services
pure Web services are not sufficient. A pure Web service implements the required input/output behavior, but does not address – explicitly or implicitly – non-functional requirements. However, at business level quality of service (QoS) attributes such as security, message reliability, availability and scalability are crucial prerequisites to be supported by the deployed service. The importance and necessity of QoS-aware Web services, i.e., Web services with well-defined QoS attributes, has been stressed elsewhere (see, e.g., O’Brien, Merson and Bass, 2007; Ludwig, 2003). In order to create a QoS-aware Web service, one could encode the required QoS attributes directly in the source code of the service. For example, security aspects such as authentication or authorization could be embedded into the business functionality. However, this approach would bring a couple of disadvantages such as increased maintenance and adaptation costs. Ideally, a QoS-aware Web service would be realized according to the separation of concerns principle (see, e.g., Sommerville, 2004): The Web service implementation encodes the core business functionality, whereas QoS attributes are “outsourced” and can be strictly separated from the service’s source code. QoS attributes should be formally described and flexibly attached to a Web service. This strategy yields the following advantages: • • •
Reduction of the Web service’s source code complexity. Higher degree of flexibility to react on changing QoS requirements. Increased reusability of Web services for different deployment settings.
With the WS-Policy specification (W3C, 2007b), there exists a well-known framework for formally describing QoS attributes for Web services. Basic building blocks in WS-Policy are so-called assertions. A single assertion may represent a domain-specific capability, constraint or requirement. Before we discuss the importance
190
of WS-Policy in more detail, let us take a look at development tools for Web services. Integrated development environments (IDEs) such as Eclipse, NetBeans and Visual Studio allow the quick construction of pure Web services and offer respective project types. For example, the NetBeans IDE has a project type “Web Application” and Visual Studio comes with a “WCF Library” wizard. There is, however, only limited support for creating QoS-aware Web services. Typically, the support is restricted to WS-Policy assertions for security and message reliability as defined in the standards WS-Security (OASIS, 2006), WS-SecurityPolicy (OASIS, 2009), and WS-ReliableMessaging (OASIS, 2008). The IDEs do not support other QoS attributes. We also observe that a thorough tool chain for creating arbitrary QoS-aware Web services does not yet exist. Except for security and messaging, QoS attributes are typically hard-coded in the Web service, thus violating the separation of concerns principle. Hence, the above-mentioned advantages, which are good software engineering practices, are not applicable. In this chapter we will present a complete tool chain for developing QoS-aware Web services. The tool chain can be characterized as follows: • • • • • •
Integration of standard components such as IDEs. Support for the separation of concerns principle. Usage of widely accepted standards and specifications. Support for different application servers and Web services infrastructures. Creation, usage and reuse of not yet supported QoS attributes. Easy configurability of QoS attributes.
Constructing a completely new development environment would not be realistic in terms of implementation efforts and would have a negative impact on its acceptance. Therefore, our tool chain will reuse and integrate well-known and widely
A Tool Chain for Constructing QoS-Aware Web Services
used components such as NetBeans, Eclipse and Visual Studio. However, these environments will be supplemented by a mechanism for supporting the separation of concerns principle. A basic requirement is that the extension should be based on highly accepted standards and specifications. As a consequence, QoS-aware Web services created with the tool chain elaborated in this chapter can be deployed in different Web services runtime environments and application servers. WS-Policy is a well-known standard formalism for defining “QoS policies”. In fact, almost all Web services infrastructures are equipped with a WSPolicy implementation. Our tool chain exploits WS-Policy and supports the development of new QoS attributes – not only for technical domains, but also targeting the application level. Special focus lies on the reuse of QoS attributes. We address this feature by distinguishing two roles: the creator of a QoS attribute and the user of a QoS attribute. The latter is typically a developer of a Web service who wants to equip the Web service under development with some predefined QoS attributes. The QoS creator has the responsibility to introduce the artefacts required to use the QoS attribute. To give an impression on the easy configurability of QoS attributes, let us consider an example from the security domain. Suppose the service provider defines a (security) policy which requires the service parameters to be encrypted during transmission. Based on this description, the runtime environment will check if the client has really encrypted the parameter values when invoking the service. If not, the service request will be rejected. Now suppose that the parameter values should, in addition, be digitally signed. Instead of modifying the service implementation, only the policy description of the service will be adapted. After redeployment, the runtime environment now is responsible also to verify the signature, which must be included by the invoker of the service. As we will see later, our tool chain
allows the easy configuration of QoS attributes not restricted to the security domain. This chapter is organized as follows. We will start by introducing fundamental technologies, terms and definitions required for the understanding of this chapter. Then, in the third section, we will present related work. The architecture of the tool chain and its components are covered in the fourth section, followed by a description of a proof of concept implementation and its application to formalize selected QoS attributes. Conclusions and future research issues are discussed in the final section.
FOUNDATIONS This section will cover the technologies the tool chain is build upon. We start with considering the Web services technology followed by a discussion on quality of service. Then we will introduce the WS-Policy specification as a vehicle to formalize QoS attributes for Web services. This section will conclude with some remarks on application servers and IDEs.
Web Services and WSDL When developing a Web service, one typically starts with implementing a class in some programming language such as Java or C#. Depending of the technology used, this class has to be enlarged with additional information, which is required to deploy the class as a Web service. For example, in a Java environment (Hewitt, 2009) the class must be annotated with @WebService, while in the WCF technology the [service contract] attribute is required (Löwy, 2008). Given a Web service implementation class, there are tools (e.g., Java2WSDL and svcutil) that derive an interface representation for the service in the Web Services Description Language (WSDL). WSDL is an XML based formalism to specify interfaces of Web services in a program-
191
A Tool Chain for Constructing QoS-Aware Web Services
ming language independent manner. A WSDL description comprises different parts, each of them addressing a specific topic: •
•
•
What part: specifies the abstract interface, i.e., the service’s name including its parameter types, and introduces XML elements such as type, message, portType and interface. How part: maps the abstract interface onto a communication protocol such as http and is represented by the binding element. Where part: defines a specific Web service implementation through an address such as a URL and is defined within the service element.
This information is utilized by a service consumer to construct so-called SOAP messages that are exchanged with the service implementation. A SOAP message consists of a body, which contains the payload of the message including the current parameter values of the request, and an optional header containing additional information such as addressing or security data. WSDL itself does not address QoS attributes for Web services. However, the WS-PolicyAttachment specification (W3C, 2007a) describes how a WSDL description can refer to WS-Policy descriptions formalizing QoS attributes. The Figure 1. Excerpt of a WSDL interface
192
XML fragment in Figure 1 is part of a WSDL. The PolicyReference element within the binding element refers to a WS-Policy description named QoSPolicy. Before we introduce the structure of WSPolicy descriptions the term quality of service will be sharpened.
Quality of Service In system and software engineering there are mainly two categories of requirements: functional and non-functional requirements. A functional requirement describes a specific business or technical functionality of a system in terms of the input/output behavior. In contrast, a non-functional requirement addresses a quality of service attribute of the implementation. In software engineering there was (and still is) much research on quality of service for software systems. Standardization organizations such as ISO have identified different kinds of QoS attributes (see, e.g., ISO/IEC, 2005). There are several publications that consider QoS attributes in the specific context of Web services (e.g., O’Brien, Merson and Bass, 2007). OASIS (2010) does not only give a classification of different types of QoS attributes such as service level quality measurement, business process quality and security quality, but also introduces
A Tool Chain for Constructing QoS-Aware Web Services
formal definitions for various QoS attributes. Besides others, the following topics are covered: • • • • • • • • • • •
Response time: time elapsed between sending a request and receiving a response. Throughput: amount of requests that can be processed is a given time period. Availability: degree of which the service is available in operational status. Standard conformability: degree to which the service is built in conformable type with specification of standards. Messaging reliability: guaranteed delivery and delivery order, guaranteed duplicate elimination. Observability: measures how effectively an implementation can provide status information. Security qualities: confidentiality, authentication, authorization, integrity, non-repudiation. Audit: capability to log activities of all relevant events occurring on service provider and consumer side. Pricing and accounting: mandatory value for using a service, typically defined by an accounting model. Metadata: inclusion of additional data such as time stamps, creation time and expires time of a message. Robustness: degree to which the service continues to function properly under abnormal conditions or circumstances (e.g., erroneous input, presence of defects).
Let us make some remarks. Even though we have characterized some well-known QoS attributes, there are often differences regarding their exact meaning. Some of them can be described by a formula. For instance, response time can be defined as the sum of client latency, network latency and server latency. The behavior of other QoS attributes such as confidentiality and integrity can be formalized in terms of functions for
encryption and digital signature. Robustness is an example for a QoS attribute that has diverse facets such as error tolerance, often described as the ability to deal with erroneous input. A service, for example, should not crash or run into an inconsistent state if it is called with invalid parameter values. In the section “Examples” we will come back to this issue. The above-mentioned QoS attributes are application-independent in the sense that principally every service can be equipped with them. There are also QoS attributes which are meaningful only to a specific category of services. To give an example, consider a calculator service for adding, multiplying, etc. numbers. Each service implementation is able to manage numbers of particular size (e.g. 4 vs. 8 bytes representation). Thus, a QoS attribute of a calculator service is the range of numbers that can be processed properly. This example will also be reconsidered later in this chapter. We observe that many QoS attributes are orthogonal and hence can be combined. For example, a service may be equipped with both confidentiality and accounting. However, typically there will be an impact on quantitative QoS attributes such as performance and throughput. If, for instance, confidentiality is turned on, the performance of a service will typically slow down due to additional computations for encrypting and decrypting parameter values. It is not good practice to switch on all available and supported QoS attributes for the service under development. Roughly speaking, the more QoS attributes are activated, the slower the overall performance of a service will be – due to additional processing. Which of the available QoS attributes should be activated for a specific service strongly depends of the deployment environment and criticality of the service. If, for example, a Web service returns sensitive data, which are transmitted over an open infrastructure, confidentiality is a must.
193
A Tool Chain for Constructing QoS-Aware Web Services
WS-Policy WS-Policy (W3C, 2007b) provides a policy language to formally describe properties of a behavior of services. A WS-Policy description is a collection of so-called assertions. A single assertion may represent a capability, a requirement or a constraint and has an XML representation. As shown in Figure 1, a policy can be attached to a service via a PolicyReference element. Figure 2 shows sample assertions. To explain their meaning, suppose a Web service S is equipped with these assertions. The first assertion states that a client invoking the service S has to include a time stamp into the message. The second assertion expresses that the service will accept the request only if the invoker has some part of the message digitally signed. The child element XPath specifies the message fragment to be signed. The PerformanceAssertion makes use of an XML attribute to describe a specific property: it states that the service guarantees an average execution time of at most one second. The fourth assertion is an example for an application-specific QoS attribute: a calculator service equipped with this assertion should able to process numbers up to 231 - 1. To explain the meaning of the ParameterConstraint
Figure 2. Examples of WS-Policy assertions
194
assertion, suppose the service S requires an input parameter of type Person and person is the formal parameter name. The assertion states that the caller of the service must pass a person not younger than 18 years, otherwise the request will not be accepted. WS-Policy introduces three operators to group assertions and sets of assertions, respectively. These operators are: • • •
wsp:ExactlyOne: Exactly one of the child assertions must be satisfied. wsp:All: All the child assertions must be satisfied. wsp:Policy: All the child assertions must be satisfied. Additionally, a name can be assigned to the assertion set.
Figure 3 shows a WS-Policy description. The outer wsp:Policy operator has one embedded assertion set, which is built up with wsp:ExactlyOne. The overall policy is satisfied, if one of the two embedded wsp:All policy sets is satisfied. In other words, this policy specifies that the message must include either a time stamp or a signature of the message body. WS-Policy itself does not come with concrete assertions. Instead, related specifications such as
A Tool Chain for Constructing QoS-Aware Web Services
WS-SecurityPolicy (OASIS, 2009) and WSReliableMessaging (OASIS, 2008) apply WSPolicy to introduce specific assertions (e.g., IncludeTimestamp and SignedElements from the example above) covering domains such as security and reliable messaging. The respective specifications do not only define the syntax, but also the meaning of the assertions and their impact on the Web services runtime behavior. The broad acceptance of, e.g., WS-SecurityPolicy demonstrates the applicability and usefulness of the WS-Policy standard. From an application development point of view, it would be quite helpful to have a wider repertoire of assertions. In fact, it should be possible to construct further, custom-designed assertions covering project-, business-, or technology-specific phenomena. Thus, the separation of concerns principle should not be restricted to the already addressed domains. Observe that WS-Policy has been designed in such a way that it allows the creation of further WS-Policy assertions. However, the construction of new assertions turns out to be difficult covering different activities, results, and stakeholders (see, e.g., Erl, Karmarka, Kalmsley, Haas, Yalcinalp, Liu, Orchard, Tost and Pasley, 2009; Hollunder, Hüller, and Schäfer, 2011).
When defining a new assertion type, its syntax and semantics has to be fixed. While the syntax is defined in XML, the semantics of an assertion is usually defined literally in the respective documentation. In addition, the Web services and WS-Policy runtime environment has to be adapted such that the custom assertions are treated properly. Hollunder (2009a) discusses the introduction of different categories of custom assertions and the adaptation of the Web services runtime environment.
Application Servers and IDE Today, application servers such as GlassFish, JBoss, .Net/WCF/IIS, and WebSphere include a Web services runtime environment. This portfolio is supplemented by products such as Axis2, which mainly focus on the Web services technology. For all these infrastructures there exists development support facilitating the implementation, deploying and testing of Web services. This support is often realized via specific project types and plug-ins for IDEs. To give two examples, we consider Visual Studio and NetBeans. The project wizard “WCF Library” of Visual Studio creates a complete set of artefacts required to deploy a standard WCF service
Figure 3. A WS-Policy description
195
A Tool Chain for Constructing QoS-Aware Web Services
including graphical test clients. Similar support is given through the “Web Application” project type of NetBeans. The application developer can focus on the implementation of the core business functionality; the required files such as deployment descriptors, build and manifest files are created automatically. Last but not least also deployment and debugging support is given. As we will explain in the next section, currently there is only little support for the development of QoS-aware Web services. In fact, to the best of our knowledge there is no tool chain that gives thorough support for the definition, creation and usage of new QoS attributes. Support of existing IDEs is restricted to the usage of predefined QoS attributes from the security and reliable messaging domain.
RELATED WORK There are several research areas that are related to the main topic of this paper: a tool chain for developing QoS-aware Web services. These approaches can be grouped into the following categories: • • •
Model driven development. Policy based approaches. QoS runtime infrastructures.
A common principle is their support for an explicit separation of functional aspects from non-functional requirements. Besides this communality, each category has its own specialty and techniques. In the following we will describe the respective approaches and their relationship to our work.
Model Driven Development Since more than two decades, graphical models are used to bring more efficiency to the development of software systems. In the early years the
196
focus was mainly on documenting the overall structure of the system (i.e., its architecture) and relevant abstractions. Later, models were exploited to derive artefacts for the software system to be developed. Typically, artefacts such as source code, configuration and build files, deployment descriptors and database schemata are automatically created. In other words, the model driven development (MDD) approach bridges the gap between design and coding by using techniques such as model to model and model to code transformations. A prominent instance of MDD is the Model Driven Architecture (MDA) of the Object Management Group (2010b). MDA uses the Unified Modeling Language (UML), also a specification of the Object Management Group (2010a), which comprises various diagram types to formalize different aspects of a software system. With the help of so-called UML profiles, domain- or technology-specific abstractions can be introduced and equipped with the intended semantics. For example, to express that some UML class (or some method of a class) should represent a Web service, it can be equipped with a corresponding stereotype. Defining a specific UML profile mainly consists in declaring the names and semantics for stereotypes. There are several approaches that apply MDD to explicitly model non-functional requirements. Basically, these approaches use “stereotyped” classes to represent QoS attributes. When creating a UML model for an application, model elements such as classes, methods and associations are introduced. They define the core business model and specify QoS attributes for the application. Although there is a common UML model, different aspects (e.g., modeling of functional vs. non-functional requirements) can be logically separated and hence can be treated differently when generating artefacts for applications. The work of Wada, Suzuki and Oba (2008) can be viewed as a typical representative for a group of approaches that apply MDD to explicitly model QoS attributes. This and related papers,
A Tool Chain for Constructing QoS-Aware Web Services
e.g., (Hafner & Breu, 2008), (Basin, Doser & Lodderstedt, 2006), (Jürjens, 2004), Chapter 16 in this book (Rodrigues et al., 2011), mainly focus on security qualities and are often characterized with the phrases “model driven security” and “security engineering”. Basically, these approaches cover the following topics: introduction of the QoS attributes to be addressed, proposals for suitable UML profiles, transformation rules, development models, and tool sets. Although there are communalities to our work, we see two main differences: 1. The above-mentioned approaches emphasize on the usage of predefined QoS attributes. In contrast, the tool chain proposed in our work also covers the development of additional QoS attributes. 2. Our tool chain does not require a UML modeling tool, instead the WS-Policy standard is used as interchange and runtime format for QoS attributes. An interesting aspect for future research would be an extension of our tool chain towards the usage of UML modeling styles. In the final section of this chapter we will come back to this issue.
Policy Based Approaches The focus of this category is on the explicit representation of business rules and regulations mandating quality requirements the IT-infrastructure has to fulfill. Policy languages are typically not full-fledged programming languages, but allow the declarative description of selected features. Although there are policy languages for various domains, most popular formalisms mainly address security aspects such as: • •
The representation of authorization and entitlement policies (OASIS, 2005a). The creation and exchange of security entities and identities (OASIS, 2005b).
• •
The representation of privacy practices (W3C, 2002). The description of permissions, prohibitions, and obligations (Kähmer, Gilliot, & Müller, 2008).
A general discussion of policy based approaches for service oriented systems is given by Phan, Han, Schneider, Ebringer and Rogers (2008). Another interesting field for the usage of policy languages is the formalization of service level agreements (SLAs). Since SLAs typically comprise IT-level parameters such as availability, response time and throughput, “SLA policy languages” are also related to our approach. Chapter 1 in this book gives a survey of how SLAs are created, managed and used in utility computing environment (Wu & Buyya, 2011). With Web Service Level Agreement (WSLA) of Ludwig, Keller, Dan, King and Franck (2003) there is a comprehensive specification that defines a policy language as well as standard extensions for, e.g., measuring the response time and the number of invocations. Though the specification has some potential, today WSLA plays only a minor role. This is due to the facts that there is no update of the specification since 2003 and no support by application servers. Out of the set of proposals that apply policy languages to SOA and Web services (e.g., Wang, Wang, Chen and Wang, 2005) we consider in more detail the approach of Phan, Han, Schneider and Wilson (2008) because it has several similarities to our solution. The authors also apply WS-Policy to formally represent QoS attributes on the technical level. From a development point of view, the QoS attributes are modeled on a higher level of abstraction and are subsequently mapped onto WS-Policy assertions. In some sense, the “Policy Editor” component is the counterpart in our tool chain. There are differences, though. In contrast to our approach, the authors apply their approach to a (simplified) security quality model only. The
197
A Tool Chain for Constructing QoS-Aware Web Services
authors “expect that other domain quality models can be expressed”, however details are not given.
ARCHITECTURE OF THE TOOL CHAIN
QoS Runtime Infrastructures
This section introduces the components of a comprehensive tool chain for developing and deploying QoS-aware Web services. Before we introduce the overall architecture, important requirements and assumptions are presented. Thereafter, each component and its responsibility will be discussed.
Application servers such as JBoss, GlassFish, WebSphere, and WSO2 come with management consoles that provide various functions such as: •
• • •
Deployment and undeployment of components (e.g., enterprise and web archives, assemblies). Access to configuration data (e.g., logging, keystores, transport protocols). User management. Monitoring and system statistics.
Some management consoles, as the one of WSO2, also support the configuration of predefined WS-Policy assertions for security and reliable messaging. In particular, WS-Policy descriptions of deployed services can be modified by adding or removing assertions. However, the consoles do not provide features for the development of additional QoS attributes and WS-Policy assertions, respectively. Another link to our tool chain is the monitoring and statistics functionality of WSO2, which counts requests for services and determines the average, minimum and maximum response time and hence addresses to some extend the QoS attribute performance. There are approaches that extend the standard monitoring capacity of management consoles. Thereby, QoS metrics such as availability, reliability, and performance have been investigated (see, e.g., Artaiam & Senivongse, 2008; Zeng, Lei & Chang, 2007). It should be noted that these approaches do not directly compete with our solution. Instead, an integration of a full-fledged monitoring system would be an interesting enhancement of our tool chain as sketched in the final section on the chapter.
198
Requirements and Assumptions The architecture of the proposed tool chain is based on the following principles: • • • • • • •
Support for different Web services runtime environments. Usage of standards and well-known specifications. Reuse of widely used IDEs and development models for Web services. Support for the separation of concerns principle. Availability of predefined QoS attributes. Development model for the creation of additional QoS attributes. Server and client side support.
Today, there are well-known specifications that are supported by almost all Web services platforms. Due to a high degree of standardization different Web services platforms are interoperable. A basic requirement on the tool chain is the development support of QoS-aware Web services that can be deployed on different platforms. In order to achieve this requirement, the tool chain should not exploit specifics of a concrete platform, but should employ accepted and widely used standards. In particular, the following standards should be used: 1. SOAP as message exchange format. 2. WSDL for describing the interfaces of services.
A Tool Chain for Constructing QoS-Aware Web Services
3. WS-Policy as framework for the specification of QoS attributes. As mentioned before, there are several development tools that support the construction of pure Web services. Widely used IDEs provide helpful features (e.g., auto-completion, syntax highlighting and debugging support) and come with project wizards that generate a complete project infrastructure. Such an infrastructure typically comprises source files to be completed by the service developer, configuration files (such as deployment descriptors and build files), and default test clients. Few IDEs already provide limited QoS support. The proposed tool chain should profit as much as possible from existing IDEs, since constructing a completely new environment would require enormous developing efforts and would decrease the tool chain’s acceptance. Hence, the solution should extend existing infrastructures such that QoS attributes can be introduced and attached seamlessly to the Web services under development. Another important requirement already stressed is the separation of concerns principle. According to this principle, QoS attributes are specified declaratively in policy files, which can be associated with a concrete service implementation. The Web service’s business logic implementation and the QoS attributes should have different life cycles such that both can be modified independently from each other. Consequently, the tool chain should support a quick adaptation of QoS attributes assigned to a concrete service. The tool chain should come with pre-defined QoS attributes. Of particular interest are generic attributes that are applicable to every Web service. Examples of relevant attributes can, for example, be found in (OASIS, 2010; O’Brien, Merson and Bass, 2007; Ludwig 2003). Another requirement is support for the creation of application-specific QoS attributes. Besides the development of QoS-aware services, the tool chain should also facilitate the
construction of QoS-aware client applications. A client developer should be able to formally specify QoS attributes the Web service has to fulfill. The tool chain should comprise a validation component that checks whether or not the client’s requirements can be guaranteed by a concrete service implementation.
KEY COMPONENTS AND THEIR RESPONSIBILITIES: SERVER SIDE The following figure shows the key components of the tool chain. Basically, the tool chain consists of four blocks: • • • •
Development and management of QoS attributes. Usage of QoS attributes during the development of QoS-aware Web services. Deployment of QoS-aware Web services. QoS-aware runtime environment.
Before we discuss the key components, it is helpful to sketch the development model.
Development Model Basically, we distinguish two roles: the developer of QoS attributes and the developer of QoS-aware Web services. The former has the responsibility to design, implement and publish QoS attributes. To facilitate these tasks, the tool chain contains the assertion creation editor and the QoS attributes repository, which stores the available QoS attributes. The QoS attributes developer also has the responsibility to realize QoS extensions – plug-ins that bring the intended semantics of QoS attributes to a standard Web services runtime environment. The developer of QoS-aware Web services starts as usual with programming the functionality of the service in some programming language such as Java or C#. For this task the source code editor is needed. The QoS attributes to be guar-
199
A Tool Chain for Constructing QoS-Aware Web Services
anteed by the service are configured in another component – the QoS editor. This component has access to the service as well as to the available QoS attributes stored in the QoS attributes repository. The developer can flexibly assign QoS attributes to Web services. After the developer has implemented the service and configured its QoS attributes, the service is packaged into a deployable format. This is achieved by the assembly construction component. The role of the deployment component is to install the QoS-aware service in some standard Web service runtime environment. Such a runtime environment is not aware of the QoS attributes assigned to the Web service. Therefore, the configured QoS extensions have the responsibility for guaranteeing the respective QoS requirements. Each component of the tool chain will now be described in more detail.
Assertion Creation Editor and QoS Attributes Repository These components are designed to create and store QoS attributes. Since QoS attributes are represented as WS-Policy assertions, the former component is called assertion creation editor. According to the WS-Policy specification, every assertion must have a representation in XML (see Figure 2). This representation prescribes the syntax to be used in WS-Policy descriptions. In addition, an assertion has a representation in some programming language, which is used by a WSPolicy runtime when internalizing the assertion’s XML representation. The assertion creation editor, typically realized as GUI component, facilitates the development of WS-Policy assertions. Basically, the user defines the structure of an assertion – both the XML representation and corresponding classes (e.g., in Java) are generated completely from this structure. In general, the intended semantics of an assertion cannot be derived automatically from its XML representation. Thus, the QoS attribute developer
200
must individually realize the required QoS extensions. Such extensions can be implemented as SOAP handlers (see, e.g., Hewitt 2009) in some general purpose programming language such as Java. Although handlers are not completely standardized, all relevant Web services platforms provide a set of (similar) interfaces needed to implement the specific logic of a handler. Handlers are typically configured declaratively. Once a WS-Policy assertion and its handler have been created, they are stored in the QoS attributes repository.
Source Code Editor A source code editor is required to realize the functionality of the Web service. To implement the input/output behavior of a service, a developer typically uses a modern programming language and applies some programming model (see, e.g., Löwy, 2008; Hewitt 2009). There are widely used source code editors on the market. Such editors are part of powerful IDEs such as Eclipse, NetBeans, and Visual Studio, and significantly simplify the development of Web services. Within a few mouse clicks complete project infrastructures including deployment and testing support will be created. The output of the editor component is a set of source files together with the compiled versions and configuration files. If a model driven approach is applied, the required artefacts for a Web service can be derived from a more abstract model such as a UML class diagram. In such an approach, complete class layouts with empty method bodies can be generated. A developer has to implement manually only the method’s functionality. Such a UML tool is an optional component in our tool chain and is therefore not included in Figure 4.
QoS Editor QoS attributes should be separated from the Web service’s functionality. Therefore, QoS attributes
A Tool Chain for Constructing QoS-Aware Web Services
Figure 4. Tool chain for development of QoS-aware Web services – server side view
are not implemented within the source code editor, but are expressed as WS-Policy assertions. However, instead of configuring directly WSPolicy assertions, higher-level representations are more suitable from an application development point of view. This allows one to apply specific combinations of QoS attributes (for instance the security pattern “encrypt and sign”), without directly configuring the required set of low-level WS-Policy assertions. According to the proposed development model, the QoS attributes will be loosely coupled and can be flexibly associated to a Web service. Thus, when assigning QoS attributes the developer gets the following information: • •
A list of Web services that can be equipped with QoS attributes. A list of QoS attributes retrieved from the QoS attributes repository.
In general, there may be dependencies between different QoS attributes. Examples are:
• • •
QoS attributes may exclude each other. A set of QoS attributes may be incompatible or contradictory. A particular QoS attribute may require another QoS attribute.
In order to support a developer resolving possible dependencies, the QoS editor should contain a validation feature that detects both semantically and syntactically incorrect WS-Policy descriptions.
Assembly Construction and Deployment A QoS-aware Web service has two parts: the implementation of its business functionality and a description of the associated QoS attributes. The role of the assembly construction component is to bring both parts together and to construct a deployable component. Such a component is a selfcontained software archive containing all relevant information to install and run the QoS-aware Web service. The specific format of a deployable com-
201
A Tool Chain for Constructing QoS-Aware Web Services
ponent depends on the underlying technology. In .NET so-called assemblies have been introduced, while in Java infrastructures enterprise or web archives are the required formats. Independently from the specific format, a QoS-aware Web service has an implementation class, a description of its interface in WSDL, and an associated WS-Policy description that specifies its QoS attributes. The responsibility of the assembly construction component is to create these artefacts. Subsequently, the deployment component will install the QoS-aware Web service in the underlying runtime environment.
Runtime Environment and QoS Extensions Similar to the source code editor, our tool chain includes a runtime environment from a third party. There are various alternatives such as Axis2, Metro from GlassFish, and .NET/WCF. As already noted, a standard Web services runtime environment is not aware of QoS attributes (except the ones specified in WS-SecurityPolicy and WS-ReliableMessaging). However, Web services infrastructures provide extensibility mechanisms to adapt their behavior. Basically, incoming and outgoing Web services requests are accessible and can be manipulated with the help of (SOAP-) handlers. In order to realize a specific QoS attribute, a corresponding hander will be created and installed. In some sense, the handler works as a guard who patrols incoming as well as outgoing messages to guarantee a certain QoS attribute. We give two examples for QoS extensions. Suppose we want to realize a robustness attribute that detects invalid parameter values. Before the request will be delegated to the service implementation, a handler validates the values. This can be achieved by comparing the parameter values extracted from the request with the admissible ones specified in some policy description. The second example comes from the security realm. Suppose
202
the parameter values should be encrypted and signed. A standard way to realize this requirement is to install two handlers. Before transmitting the values to the server, a client side handler will apply suitable encryption and signing algorithms. The second handler is part of the server side runtime environment and performs the following activities: decryption of the parameter values and validation of the signature. Eventually the request will be forwarded to the service implementation.
Client Side Components We now take a look at that part of the tool chain that gives support to the developer of client applications. In this context, a client application is any program that invokes Web services. To implement client applications, one typically employs an IDE (e.g. NetBeans, Eclipse or Visual Studio), which also offers special support for downloading the Web service’s WSDL and generating proxy classes. Before invoking a specific Web service, a client has to discover and select a suitable one. If there are several functional equivalent services, a client will typically choose a service with QoS attributes that come nearest to the ones demanded by the client. To give an example, suppose there is some accounting model. If the usage cost of a particular service is higher than the maximal amount the client is willing to pay, the client should not use this service and may choose an alternative one. In general, the client should be able to specify the conditions under which he is willing to use the service. In our approach, these conditions will be formally described with the QoS editor, which is also part of the client side tool chain (see Figure 5). As on server side, the QoS editor has access to the available QoS attributes stored in the QoS assertion repository. The role of the QoS compatibility component is to compare two sets of QoS attributes. While the one set contains the QoS attributes supported by a specific service, the other defines the QoS
A Tool Chain for Constructing QoS-Aware Web Services
Figure 5. Tool chain for usage of QoS-aware Web services – client side view
attributes demanded by the client. In the literature, several algorithms for comparing QoS attributes have been proposed. Since in our approach QoS attributes are mapped to WS-Policy assertions, the algorithms policy intersection of the WSPolicy specification (see Section 4.5 in W3C, 2007b) and policy entailment elaborated by Hollunder (2009b) are of particular interest. The QoS compatibility component should be designed in such a way that several algorithms can be integrated. Other components of the client side tool chain are a deployment component and QoS extensions. As on server side, QoS extensions will be plugged into a standard client runtime environment for Web services and realize the specific logic of particular QoS attributes. To give an example for a client side QoS extension, consider the QoS attributes confidentiality and integrity. In this case, the QoS extension could be implemented as SOAP handler that encrypts and signs outgoing messages. Note that every Web services infrastructure comes with a specific, light-weight set of libraries optimized
for client applications (including handler support). Thus, there is no need to install a full-fledged application server on client side.
PROOF OF CONCEPT The abstract architecture from the previous section can be instantiated in several ways. For our proof of concept we have used a Java based infrastructure. To be precise, the following technologies are used: • • •
NetBeans IDE. GlassFish application server. WS-Policy implementation.
NetBeans is a widely used IDE for developing various kinds of applications. NetBeans’ plug-in mechanism allows one to seamlessly integrate further features such as specific editors. For the development of Web services, NetBeans provides the project type “Web Applications”, which manages the required artefacts to implement, deploy
203
A Tool Chain for Constructing QoS-Aware Web Services
and test Web services. This project type together with NetBeans’ source code editor yields an advanced infrastructure to implement the functionality of services. The next question is how to realize the QoS editor? Since QoS attributes are mapped to WSPolicy assertions, one option would be to design a graphical WS-Policy editor. Such an editor would support the selection, composition and creation of WS-Policy descriptions that will be attached to a Web service. However, this approach has the following deficit: the developer gets only minimal support how to map high-level QoS concepts to the technical level of WS-Policy assertions. For example, to realize the QoS attributes confidentiality and integrity the application developer has to configure a consistent set of several assertions of the WS-SecurityPolicy specification. Even for simple QoS concepts, the resulting WS-Policy description will have a complexity the developer should not see. An alternative approach is the introduction of a higher-level abstraction. In other words: the developer can flexibly select the required QoS attributes; it is the QoS editor’s responsibility to map the QoS abstractions to an equivalent WSPolicy description. This idea is not new. In fact, there is already the WSIT-plug-in – available for NetBeans and Eclipse – from the Web Services Interoperability Project (http://wiki.netbeans.org/ WSIT). This plug-in supports the configuration of security and messaging properties according to WS-Security, WS-SecurityPolicy, and WSReliableMessaging. Instead of designing a new QoS editor from scratch, we used the WSIT-plug-in as a starting point. This approach does not only increase acceptance of our solution due to a similar lookand-feel, but also reduces development efforts for realizing the QoS editor. In particular, the plug-in’s GUI and data model can be adopted as well as the logic to obtain meta-data for the current Web service. For example, to configure preconditions and postconditions for a service, the existing
204
GUI can be extended by an additional panel such that the required expressions can be entered and checked on consistency. We will come back to this example in the following section. Before we proceed with describing the assembly component, it should be noted that the QoS editor could be realized as any GUI application that allows an easy arrangement of QoS attributes. The main tasks of the assembly component are: 1. Creation of the WSDL description for the Web service under development. 2. Creation of a WS-Policy description as required for the realization of the QoS attributes configured with the QoS editor. 3. Attachment of the WS-Policy description to the Web service’s WSDL. 4. Construction of a web archive. Principally, these tasks could be completely implemented with the APIs provided by NetBeans and a WS-Policy implementation. Since the WSITplug-in also provides assembly support (for the special case of WS-Security and WS-ReliableMessaging), we could again reduce implementation efforts. While the first and fourth tasks need not be changed, the two other steps are adapted as follows. To realize the second step, we retrieve for all QoS attributes, which are configured in the QoS editor, the required WS-Policy assertion types from the QoS attributes repository. These types are instantiated by invoking APIs of the WS-Policy implementation and will be transformed into the required XML format according to WS-Policy. Then, all created WS-Policy assertions will be combined to a WS-Policy description. In the third step, the WSDL description of the first step will be slightly extended by a PolicyReference element such that it now refers to the created WS-Policy description of the second step. The web archive constructed by the assembly component is passed to the deployment component. This component installs the web archive in some Web services infrastructure. In our proof
A Tool Chain for Constructing QoS-Aware Web Services
of concept we employ the GlassFish application server, which provides a good out-of-the-box integration with NetBeans and the WSIT-plug-in. GlassFish, as almost all application servers, has an administration console which can in particular be used to deploy a web archive. Hence, with some mouse clicks it would be possible to install the created Web service. During the developing and testing phase an alternative is more comfortable, though. Immediately after its construction, the web archive will be automatically deployed. This can be achieved by applying APIs provided by the application server. Since the WSIT-plug-in already offers this kind of deployment support, we can reuse this functionality. No special activity is required since the constructed web archive comprises a self-contained QoS-aware Web service. As shown in Figure 4, the Web service runtime environment requires extensions for guaranteeing the intended semantics of the supported QoS attributes. To obtain a flexible module structure, a single QoS extension should realize a particular QoS attribute. To support several QoS attributes, a corresponding set of QoS extension components has to be configured. Analogously to other Web services infrastructures, GlassFish supports a mechanism which allows the flexible configuration of handlers (i.e., message interceptors) for manipulating incoming and outgoing messages. Multiple handlers can be arranged in a chain and messages are passed through the configured chain. In our case, each handler represents a particular QoS extension component. To create a specific handler, one implements a Java class that realizes the SOAPHandler interface provided by JAX-WS – a high-level Java API for providing and consuming Web services (Hewitt 2009). The core functionality of the handler will be realized in the inherited handleMessage method, which is invoked when a request arrives. The handlers to be installed are described in an XML based configuration file. Before forwarding a request to the Web service implementation, the Web service runtime will call handleMessage of
the installed handlers. In the following section we give examples for handlers that realize selected QoS attributes. So far we have not described how the QoS attributes repository can be populated. As there does not exist a standard solution for this task, we have developed an assertion creation editor. With this component new WS-Policy assertion types can be designed and added to the QoS repository. To be precise, the component has the following features: • • •
GUI based design of custom WS-Policy assertions. Derivation of XML schemas according to WS-Policy. Generation of corresponding implementation classes for a WS-Policy implementation.
The assertion creation editor has been developed as standalone application, but it can be embedded into NetBeans or Eclipse as plug-in. The following section will give some more details.
EXAMPLES In previous section “Foundations”, we have listed some of the most well-known QoS attributes for Web services. In the following we will exemplify the usage of the tool chain by means of three examples: i) robustness with respect to erroneous input, ii) application-specific QoS attributes, and iii) accounting.
Robustness With Respect to Erroneous Input A crucial quality characteristic of each software component is its well-defined behaviour – even if it is called with invalid parameter values. The design by contract concept proposed by Bertrand Meyer offers a viable approach to tackle this issue by defining preconditions and postconditions
205
A Tool Chain for Constructing QoS-Aware Web Services
for interfaces. A precondition is a contract on the state of the system when a service is invoked and typically imposes constraints on its parameter values. Conversely, a postcondition is evaluated when the service terminates. A well-known formalism for expressing constraints on interfaces is the Object Constraint Language (OCL) of the Object Management Group (2010c). We introduce so-called OCL-assertions that enrich a Web service’s WSDL by OCL constraints. Ideally, a precondition specifies exactly the admissible (combinations of) parameter values that the service is able to process. Hence, if called with invalid values, the service does not run into an inconsistent state since the request will be immediately rejected. OCL constraints are usually applied to check range restrictions (e.g., 5 ≤ i ≤ 15 for same parameter i), format expressions (e.g., dd.mm.yyyy), null pointer object references, and dependencies between values (e.g., if month is May, then 1 ≤ day ≤ 31). This approach can also tackle more complex scenarios. For example, suppose that in a banking domain additional actions such as liquidity checks must be performed, if the amount of money passed as Web service parameter exceeds a particular value. Such a rule can be declaratively described and need not
be hard-coded in the source code, which will improve robustness and adaptability of the Web service implementation.
Assertion Creation Editor In its simplest form, an OCL-assertion has two attributes: a precondition and a postcondition. The type of both attributes is a string and should represent a valid OCL expression. Figure 6 shows the user interface of the editor used in the proof of concept implementation.
QoS Extension The QoS extension component has the responsibility to check the preconditions (resp. postconditions) when a service request arrives (resp. terminates). To achieve this, this component – realized as a so-called OCL-handler – is invoked before delegating the request to the service implementation. The OCL-handler, on the one hand, has access to the current parameter values by inspecting the incoming SOAP request. On the other hand, the handler also knows the OCL-assertions contained in the WS-Policy description. Hence, the handler can determine whether or not the preconditions
Figure 6. Graphical user interface of the assertion creation editor
206
A Tool Chain for Constructing QoS-Aware Web Services
are satisfied by applying an OCL checker. If the precondition is violated, the request is immediately rejected. Since there are OCL implementations on the market, the design of the handleMessage method of the OCL-handler mainly consists in wrapping a third-party OCL implementation such as the Kent OCL Library, the Dresden OCL Toolkit, or the Open Source Library for OCL (OSLO).
QoS Editor This component should support the application developer to enter valid OCL expressions. In the proof of concept, the QoS editor has been implemented as an extension of the WSIT-plugin. Advanced features are syntax highlighting of OCL expressions and the immediate detection of syntactic as well as semantic errors (e.g., usage of unknown parameter names).
Assembly Construction and Deployment The assembly component can be realized according to the description given in the previous section. The first activity (i.e., creation of the service’s WSDL) and the fourth activity (i.e., construction of the web archive) are already realized by the WSIT-plug-in. The creation of a suitable WSPolicy description, which is the second activity of the assembly construction component, is realized as follows: The OCL expressions specified with the QoS editor are packed as OCL-assertions in a WS-Policy description. This description is eventually attached to the service’s WSDL. The resulting web archive will be installed by the generic deployment component of the proof of concept implementation, thus no adaptation is required.
Application-Specific QoS Attributes We have already observed that one can distinguish between application-independent and applicationspecific QoS attributes. This example is concerned
with a QoS attribute that applies to a class of specific services such as calculators. We consider a Web service for adding, multiplying etc. numbers. Each concrete calculator implementation can be characterized with respect to QoS attributes such as precision and the ability to properly process numbers of a particular range. To keep this example simple, we focus on the latter aspect. Assertion creation editor. Depending on the calculator’s implementation, there are specific values for the minimal and maximal integers that could be represented. This information will be made explicit with the QoS attribute range restriction. We apply the assertion creation editor to construct a WS-Policy assertion RangeRestriction with the attributes minInt and maxInt. QoS extension. To design and implement a suitable QoS extension, we must more precisely define the intended behavior of this QoS attribute. To do this, consider the add functionality of the calculator that computes the sum of two numbers passed as parameter values. Prior to adding both numbers, an overflow check is performed. This can be achieved by a server side message handler. Its handleMessage method tests whether the sum of the current parameter values would exceed maxInt specified in the service’s policy. In this case, an appropriate exception would be created and returned to the service invoker; otherwise the numbers will be added. QoS editor. There are only a few requirements regarding the design of a QoS editor for the range restriction attribute. Basically, the application developer should be enabled to enter plausible values for minInt and maxInt. Assembly construction and deployment. Again, the first component can be realized according to the description given in the previous section. In this example, the values for minInt and maxInt provided by the user of the QoS editor will be used to instantiate the RangeRestriction assertion. This description will be attached to the service’s WSDL by the deployment component.
207
A Tool Chain for Constructing QoS-Aware Web Services
Finally, the resulting web archive is installed by the generic deployment component. QoS compatibility. Suppose a potential client requests a calculator service that is able to deal with positive numbers from 0 to 65.535. To express this requirement the client uses his QoS editor to instantiate the RangeRestriction assertion with the value 0 for minInt and 65535 for maxInt. Suppose the client discovers two service implementations: the first one is able to deal with numbers from -32.768 to 32.767, while the second one can process numbers from -231 to 231 -1. In order to automatically select a suitable service, the client uses the QoS compatibility component of the tool chain. Since this component is generic (i.e., it compares two arbitrary WS-Policy descriptions), it can be applied without changes.
PricingAssertion with the attributes pricingCategory and price will be introduced.
Pricing Model
With the QoS editor, the Web service developer is now able to specify the required pricing categories and prices for the service at hand. This GUI component should support the creation and management of pricing profiles, which can be reused for particular classes of Web services. The Web service consumer will also apply the QoS editor to define one or more pricing models. These pricing models determine the maximal prices the Web service consumer is willing to pay.
In the last years, an alternative to traditional licensing models for software applications has been proposed, which is called Software as a Service (SaaS). This model is based on the “pay-as-yougo” principle. Simply speaking, the user will be charged when really using the software. If a Web service is published according to the SaaS model, the provider must clearly define the pricing model. In addition, a potential consumer should be enabled i) to automatically process this information and ii) to determine whether the pricing conditions are acceptable. In general, one can distinguish different pricing categories such as cost per call, cost per volume, or a flat rate approach. As argued in OASIS (2010), a pricing model is also related to the QoS attribute penalty, which defines the compensation if the service provider fails to keep the defined service quality.
QoS Extension This component has to perform several activities. In a first step, the identity of the service consumer will be discovered. Based on this information, it will be checked whether the invoker is privileged to use the service. In the positive case, the service will be executed and the consumer will be charged according to the specified pricing model. Depending on the contract and pricing model the service consumer has signed up for, he will receive an invoice.
QoS Editor
Assembly Construction and Deployment These components can be realized as in the previous examples with the following exception: the pricing model specified with the QoS editor will be translated into corresponding pricing assertions, which are subsequently packaged in the WS-Policy description.
Assertion Creation Editor
QoS Compatibility
In its simplest form, a pricing model consists of two properties: a pricing category and a price. With the help of the assertion creation editor a
As in the previous example, the QoS compatibility component will be used to determine whether the consumer’s pricing requirement can be guaranteed
208
A Tool Chain for Constructing QoS-Aware Web Services
by the service implementation. To be precise, it is checked whether the pricing model defined in the consumer’s policy is at least as attractive as the pricing model attached to the service.
CONCLUSION AND FUTURE RESEARCH ISSUES Today’s enterprises demand high flexibility and quick responses to new business and technical requirements. SOA, with its loosely coupled nature and its well-defined abstractions, provides an architectural frame to integrate new services in a structured manner. However, developing SOA services nowadays is not only about developing the functional aspect of the service. It is also about developing, maintaining and guaranteeing certain quality characteristics. Of course, both functional and non-functional aspects of a Web service could be hard-coded into the service’s implementation. However, separating these two fundamental aspects gives the service more flexibility and above all reduces its complexity. It has been argued elsewhere that Web services deployed in mission-critical domains must provide quality criteria such as robustness, dependability, security and accounting. Although there is a strong demand to facilitate the development of QoS-aware services, current tool chains give only little support. In this chapter, we presented a tool chain, which helps developers specifying non-functional aspects and combining these with the business logic of a Web service. A main goal here is the reuse of well-known components such as IDEs and Web services infrastructures. This increases the acceptance of the tool chain among developers and eases the realization of the desired non-functional requirements. The application of the separation of concerns principle is important as well. This allows developers to add, modify and delete quality attributes without affecting the Web service’s functional implementation. While the functional requirements of a service are implemented in some
general purpose programming language such as Java or C#, the QoS attributes are described in a declarative manner and can be flexibly attached to the Web service under development. During the assembly and deployment phases both aspects are combined – transparent for the developer of the QoS-aware service. There are still several areas where the architecture and proof of concept can be further developed: • • • •
Extension of the tool chain. Alternative proof of concept implementations. Formalization of further QoS attributes. Enhanced monitoring facilities.
As described in the section on related work, there are approaches that apply UML to graphically represent concepts on a higher-level of abstraction. In our architecture we have introduced the QoS editor component to grasp QoS attributes. An alternative implementation of this component can be built upon a UML modeling tool such as MagicDraw. The elaboration of a meta-model for the QoS attributes to be supported, the definition of a UML profile, and the specification of transformation rules would be the main activities to achieve this goal. The components of the tool chain can be instantiated in several ways. In the proof of concept implementation described we have applied a Java-based environment consisting of the GlassFish application server and the NetBeans IDE. In order to give support for the development of WCF services with QoS attributes, it would be necessary to design extensions for Visual Studio, which include a realization of the QoS editor and assembly components. At the beginning of this chapter we have introduced well-known QoS attributes. As mentioned, there are further QoS attributes that have been discussed elsewhere. Future research may also elaborate which of these QoS attributes can be formalized and how a suitable representation,
209
A Tool Chain for Constructing QoS-Aware Web Services
for example using WS-Policy assertions, can be found. As a consequence, the QoS attributes repository of the tool chain can be populated with implementations of these QoS attributes. There are differences between QoS attributes with respect to their impact on the runtime environment. To be precise, while there are QoS attributes that are fulfilled by construction, there are others that may or may not be satisfied – depending on the state of the runtime environment. Confidentiality is an example for a QoS attribute of the first category. If encryption is turned on by, e.g., installing a corresponding handler for encrypting (resp. decrypting) the parameter values, every time a Web service is invoked confidentiality is guaranteed. The QoS attribute performance belongs to the second category. The overall execution time for a Web service request strongly depends on the utilization of resources such as bandwidth, network traffic, number of incoming requests as well as the workload of the application server and operation system. In order to check whether a specified average response time can be guaranteed, the runtime environment must be monitored. Future research is also concerned with a full-fledged monitoring component that is able to control arbitrary QoS attributes. To sum up, the proposed tool chain is an all-in-one environment, which brings more efficiency to the development, deployment, testing, and maintenance of QoS-aware Web services by leveraging widely accepted standards and proven components. It not only covers the usage of predefined QoS attributes when developing QoS-aware Web services, but also addresses the construction of further, generic as well as application-specific QoS attributes.
REFERENCES W3C (2002). The Platform for Privacy Preferences 1.0 (P3P1.0) Specification.
210
W3C (2007a). Web Services Policy 1.5 - Attachment. W3C (2007b). Web Services Policy 1.5 - Framework. Artaiam, N., & Senivongse, T. (2008). Enhancing server-side QoS monitoring for Web Services. In Proceedings of the International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing. IEEE Computer Society. Basin, D., Doser, J., & Lodderstedt, T. (2006). Model driven security: From UML models to access control infrastructures. [TOSEM]. ACM Transactions on Software Engineering and Methodology, 15(1). Erl, T., Karmarka, A., Walmsley, P., Haas, H., Yalcinalp, U., & Liu, C. K. (2009). Web Service Contract Design & Versioning for SOA. Prentice Hall. Hafner, M., & Breu, R. (2008). Security Engineering for Service-Oriented Architectures. Springer. Hewitt, E. (2009). Java SOA Cookbook. O’Reilly. Hollunder, B. (2009a). WS-Policy: On Conditional and Custom Assertions. In Proceedings of the IEEE International Conference on Web Services. IEEE Computer Society. Hollunder, B. (2009b). Domain-Specific Processing of Policies or: WS-Policy Intersection Revisited. In Proceedings of the IEEE International Conference on Web Services. IEEE Computer Society. Hollunder, B., Hüller, M., & Schäfer, A. (2011). A Methodology for Constructing WS-Policy Assertions. In Proceedings of the International Conference on Engineering and Meta-Engineering. International Institute of Informatics and Systematics.
A Tool Chain for Constructing QoS-Aware Web Services
ISO/IEC (2005). Software engineering - Software product Quality Requirements and Evaluation (SQuaRE) - Guide to SQuaRE. ISO/IEC 25000.
Object Management Group. (2010b). MDA Specifications. Retrieved May 15, 2010, from http:// www.omg.org/mda/specs.html.
Jürjens, J. (2004). Secure Systems Development with UML. Springer.
Object Management Group (2010c). Object Constraint Language, Version 2.2.
Kähmer, M., Gilliot, M., & Müller, G. (2008). Automating Privacy Compliance with ExPDT. In Proceedings of IEEE Conference on E-Commerce Technology. Springer.
Phan, T., Han, J., Schneider, J., Ebringer, T., & Rogers, T. (2008). A Survey of Policy-Based Management Approaches for Service Oriented Systems. In Proceedings of the Australian Conference on Software Engineering. IEEE Computer Society.
Löwy, J. (2008). Programming WCF Services. O’Reilly. Ludwig, H. (2003). Web Services QoS: External SLAs and Internal Policies, Or: How Do We Deliver What We Promise? In Proceedings of the International Conference on Web Information Systems Engineering, IEEE Computer Society. Ludwig, H., Keller, A., Dan, A., King, P., & Franck, R. (2003). Web Service Level Agreement (WSLA) Language Specification. Retrieved May 17, 2010, from http://www.research.ibm.com/wsla/.
Phan, T., Han, J., Schneider, J., & Wilson, K. (2008). Quality-Driven Business Policy Specification and Refinement for Service-Oriented Systems. In Proceedings of the International Conference on Service Oriented Computing. Springer. Rodrigues, D., Estrella, J., Monaco, F., Branco, K., Antunes, N., & Vieira, M. (2011). Engineering Secure Web Services. Chapter 16 in this book. Sommerville, I. (2004). Software Engineering. Pearson Education.
O’Brien, L., Merson, P., & Bass, L. (2007). Quality Attributes for Service-Oriented Architectures. In Proceedings of the International Workshop on Systems Development in SOA Environments, IEEE Computer Society.
Wada, H., Suzuki, J., & Oba, K. (2008). A ModelDriven Development Framework for Non-Functional Aspects in Service Oriented Architecture. International Journal of Web Services Research, 5(4). doi:10.4018/jwsr.2008100101
OASIS (2005a). eXtensible Access Control Markup Language (XACML).
Wang, C., Wang, G., Chen, A., & Wang, H. (2005). A Policy-Based Approach for QoS Specification and Enforcement in Distributed Service-Oriented Architecture. In Proceedings of IEEE International Conference on Services Computing. IEEE Computer Society.
OASIS. (2005b). Assertions and Protocols for the OASIS Security Assertion Markup Language. SAML. OASIS (2006). Web Services Security: SOAP Message Security 1.1. OASIS (2008). Web Services Reliable Messaging. OASIS (2009). WS-Security Policy. OASIS (2010). Web Services Quality Factors. Object Management Group (2010a). Unified Modeling Language, Infrastructure, Version 2.3.
Wu, L., & Buyya, R. (2011). Service Level Agreement (SLA) in Utility Computing Systems. Chapter 1 in this book. Zeng, L., Lei, H., & Chang, H. (2007). Monitoring the QoS for Web Services. In Proceedings of the International Conference on Service-Oriented Computing (ICSOC). Springer.
211
212
Chapter 10
Performance, Availability and Cost of Self-Adaptive Internet Services Jean Arnaud INRIA – Grenoble, France Sara Bouchenak University of Grenoble & INRIA, France
ABSTRACT Although distributed services provide a means for supporting scalable Internet applications, their ad-hoc provisioning and configuration pose a difficult tradeoff between service performance and availability. This is made harder as Internet service workloads tend to be heterogeneous, and vary over time in amount of concurrent clients and in mixture of client interactions. This chapter presents an approach for building self-adaptive Internet services through utility-aware capacity planning and provisioning. First, an analytic model is presented to predict Internet service performance, availability and cost. Second, a utility function is defined and a utility-aware capacity planning method is proposed to calculate the optimal service configuration which guarantees SLA performance and availability objectives while minimizing functioning costs. Third, an adaptive control method is proposed to automatically apply the optimal configuration to the Internet service. Finally, the proposed model, capacity planning and control methods are implemented and applied to an online bookstore. The experiments show that the service successfully self-adapts to both workload mix and workload amount variations, and present significant benefits in terms of performance and availability, with a saving of resources underlying the Internet service.
INTRODUCTION A challenging issue in management of distributed Internet services tems from the conflicting goals of, on the one hand, high performance and availability, and on the other hand, low cost and re-
source consumption. In the limit, high performance and availability can be achieved by assigning all available machines in a data center to an Internet service. Symmetrically, it is possible to build a very-low cost Internet service by allocating very
DOI: 10.4018/978-1-60960-794-4.ch010
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Performance, Availability and Cost of Self-Adaptive Internet Services
few machines, which induces bad performance and data center downtime. Between these two extremes, there exists a configuration such that distributed Internet services achieve a desirable level of service performance and availability while cost is minimized. This chapter precisely addresses the problem of determining this optimal configuration, and automatically applying it to build a self-adaptive Internet service. The chapter describes a capacity planning method for distributed Internet services that takes into account performance and availability constraints of services. We believe that both criteria must be taken into account collectively. Otherwise, if capacity planning is solely performance-oriented, for instance, this may lead to situations where 99% of service clients are rejected and only 1% of clients are serviced with a guaranteed performance. To our knowledge, this is the first proposal for capacity planning and control of distributed Internet services that combines performance and availability objectives. To do so:
Finally, the proposed utility-aware methods for modeling, capacity planning and control were implemented to build self-adaptive distributed Internet services. The chapter presents experiments conducted on an industry standard Internet service, the TPC-W online bookstore. The results of the experiments show that the Internet service successfully self-adapts to workload variations, and present significant benefits in terms of service performance and availability, with a saving of resources of up to 67% on the underlying Internet service. The remainder of the chapter first presents the background on Internet services. Then, it defines the motivations and objectives of this work. It then presents the utility function of Internet services, the proposed analytic model, the proposed capacity planning method, and the adaptive control of Internet services. An evaluation is then presented; and a related work is discussed. Finally, conclusions of this work are drawn.
•
BACKGROUND
•
•
•
A utility function is defined to quantify the performance, availability and cost of distributed Internet services. A utility-aware capacity planning method is proposed; given SLA performance and availability constraints, it calculates a configuration of the Internet service that guarantees SLA constraints while minimizing the cost of the service (i.e. the number of host machines). The capacity planning method is based on a queuing theory model of distributed Internet services. The model accurately predicts service performance, availability and cost. An adaptive control of online Internet services is proposed to automatically detect both workload mix and workload amount variation, and to reconfigure the service with its optimal configuration.
Underlying System Internet services usually follow the classical clientserver architecture where servers provide clients with some online service (e.g. online bookstore, e-banking service, etc.). A client remotely connects to the server, sends it a request, the server processes the request and builds a response that is returned to the client before the connection is closed. We consider synchronous communication systems, that is, when the client sends its request to the server, it blocks until it receives a response. Furthermore, for scalability purposes Internet services are built as multi-tier systems. A multi-tier system consists of a series of M server tiers T1, T2,...,TM. Client requests flow from the front-end tier T1 to the middle-tier and so on until reaching the back-end tier TM.
213
Performance, Availability and Cost of Self-Adaptive Internet Services
Each tier is tasked with a specific role. For instance, the front-end web tier is responsible of serving web documents, and the back-end database tier is responsible of storing non-ephemeral data. Moreover, to face high loads and provide higher service scalability, a commonly used approach is the replication of servers in a set of machines. Here, a tier consists of a set of replicated services, and client requests are dynamically balanced between replicated services. Figure 1 presents an example of a replicated multi-tier Internet service with two replicas on the front-end presentation tier T1, three replicas on the business tier T2, and two replicas on the back-end database tier T3.
Service Performance, Availability and Cost SLA (Service Level Agreement) is a contract negotiated between clients and their service provider. It specifies service level objectives (SLOs) that the application must guarantee in the form of constraints on quality-of-service metrics, such as performance and availability. Client request Figure 1. Replicated multi-tier services
214
latency and client request abandon rate are key metrics of interest for respectively quantifying the performance and availability of Internet services. The latency of a client request is the necessary time for an Internet service to process that request. The average client request latency (or latency, for short) of an Internet service is denoted as ℓ. A low latency is a desirable behavior which reflects a reactive service. The abandon rate of client requests is the ratio of requests that are rejected by an Internet service compared to the total number of requests issued by clients to that service. It is denoted as α. A low client request abandon rate (or abandon rate, for short) is a desirable behavior which reflects the level of availability of an Internet service. Besides performance and availability, the cost of an Internet service refers to the economical and energetic costs of the service. Here, the cost ω is defined as the total number of servers that host an Internet service.
Performance, Availability and Cost of Self-Adaptive Internet Services
Service Configuration The configuration κ of an Internet service is characterized in the following by a triplet κ(M, AC, LC), where M is the fixed number of tiers of the multi-tier service, AC and LC are respectively the architectural configuration and local configuration of the Internet service that are detailed in the following. The architectural configuration describes the distributed setting of a multi-tier Internet service in terms of the number of replicas at each tier. It is conceptualized as an array AC < AC1, AC2,..., ACM >, where ACi is the number of replica servers at tier Ti of the multi-tier service. The local configuration describes the local setting applied to servers of the multi-tier service. It is conceptualized as an array LC < LC1, LC2,..., LCM >. Here, LCi represents servers MPL (MultiProgramming Level) at tier Ti of the multi-tier service. The MPL is a configuration parameter of a server that fixes a limit for the maximum number of clients allowed to concurrently access the server (Ferguson, 1998). Above this limit, incoming client requests are rejected. Thus, a client request arriving at a server either terminates successfully with a response to the client, or is rejected because of the server’s MPL limit.
Service Workload Service workload is characterized, on the one hand, by workload amount, and on the other hand, by workload mix. Workload amount is the number of clients that try to concurrently access a server; it is denoted as N. Workload mix, denoted as X, is the nature of requests made by clients and the way they interleave, e.g. read-only requests mix vs. read-write requests mix. There is no well established way to characterize the workload mix X of an Internet service. In the following, a workload mix is characterized with the n-uplet X(Z, V, S, D) where:
•
•
•
•
Z is the average client think time, i.e. the time between the reception of a response and the sending of the next request by a client. V
are the visit ratios at respectively tiers T1..TM. More precisely, Vi corresponds to the ratio between the number of requests entering the multi-tier service (i.e. at the front-end tier T1) and the number of subsequent requests processed by tier Ti. In other words, Vi represents the average number of subsequent requests on tier Ti when issuing a client request to the multitier Internet service. Thus, Vi reflects the impact of client requests on tier Ti. Note the particular case of V1 = 1. S<S1,..,SM> are the service times at tiers T1..TM. Thus, Si corresponds to the average incompressible time for processing a request on tier Ti when the multi-tier Internet service is not loaded. D are the inter-tier communication delays. Di is the average communication delay between tier Ti-1, if any, and tier Ti, with i > 1. Note the particular case of D1 = 0.
Furthermore, service workload may vary over time, which corresponds to different client behaviors at different times. For instance, an e-mail service usually faces a higher workload amount in the morning than in the rest of the day. Workload variations have a direct impact on the quality-of-service as discussed later.
PROBLEM ILLUSTRATION Both the configuration of an Internet service and the workload of the service have a direct impact on the quality-of-service. This section illustrates this impact through examples. Impact of configuration. To illustrate the impact of service configuration, the TPC-W multi-
215
Performance, Availability and Cost of Self-Adaptive Internet Services
tier online bookstore service is considered in the following. It consists of a front-end web tier and a back-end database tier (see the Evaluation Section for more details about TPC-W). Two distinct static ad-hoc configurations of the Internet service are considered here: κ1(M = 2, AC < 1, 1 >, LC < 500, 500 >) and κ2(M = 2, AC < 3, 6 >, LC < 200, 100 >). With κ1, the capacity of the system is increased through an increased setting of the local configuration of servers. While with κ2, the local configuration is set to the default one (i.e.
default MPLs in Tomcat front-end server and MySQL back-end server), and the capacity of the system is extended by adding servers to increase the architectural configuration of the system. With both κ1 and κ2 configurations, the TPC-W online service is run with 1000 concurrent clients accessing the service using a read-only workload mix. Figure 2 presents the client request latency and Figure 3 gives the abandon rate obtained with each configuration of the Internet service. κ1 provides bad service performance and availability with a
Figure 2. Impact of service configuration on performance
Figure 3. Impact of service configuration on availability
216
Performance, Availability and Cost of Self-Adaptive Internet Services
latency of 10s and an abandon rate of 71%. Obviously, this is due to too few resources that are assigned to the service. κ2 provides better performance but still induces a bad service availability (40% of abandon rate) because of the inadequate default local configuration of the service. Thus, both local and architectural configurations have an impact of the performance, availability and cost of Internet services. These configurations should therefore be carefully chosen in order to meet quality-of-service objectives and minimize the cost. Impact of workload. Figure 4 and Figure 5 respectively present the impact of client workload variation on the performance, availability and cost of the TPC-W two-tier Internet service. Here, several workload variation scenarios are considered. The workload successively varies from a first stage with workload mix X1 to a second stage with workload mix X2 (see the Evaluation Section for more details about TPC-W and workload mixes). During each stage, the workload amount N (i.e. #clients) varies between 250 and 1250 clients. An ad-hoc medium configuration of the multi-tier
Internet service is considered as follows: κ(M = 2,AC < 2, 3 >, LC < 500, 500 >). The service latency and abandon rate of this configuration are presented in Figure 4 and Figure 5. Obviously, different workloads have different behaviors in terms of service performance and availability. For instance, with 1250 clients, latency is 4.5 s with workload mix X1, and 10.7 s with workload mix X2. This induces an abandon rate of 29% with mix X1 vs. 39% with mix X2. And within the same workload mix, service latency and abandon rate vary when the number of clients varies. Nonlinear behavior. Figure 6 illustrates the behavior of the two-tier TPCW Internet service when the service workload amount varies for workload mix X1 (see the Evaluation Section for more details about TPC-W and workload mixes). Here, the workload amount (i.e. number of concurrent clients) increases linearly over time. However, the service latency does not vary linearly. This clearly shows that linearity assumptions that are made on multi-tier Internet services and linear control do not hold for these systems.
Figure 4. Impact of workload on performance
217
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 5. Impact of workload on availability
Figure 6. Nonlinear behavior of Internet services
ADAPTIVE CONTROL OF INTERNET SERVICES Both service workload and service configuration have an impact on the performance, availability
218
and cost of services. The workload of Internet services is an exogenous input, which variation can not be controlled. Thus, to handle workload variations and provide guaranties on performance and availability, Internet services must be able to
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 7. Adaptive control of Internet services
dynamically adapt their underlying configuration. Several objectives are targeted here: •
•
•
Guarantee SLA constraints in terms of service performance and availability, while minimizing the cost of the Internet service. Handle nonlinear behavior of Internet services taking into account both workload amount and in workload mix variations over time. Provide self-adaptive control of Internet services that provides online automatic reconfigurations of Internet services.
We propose MoKa, a nonlinear utility-aware control for self-adaptive Internet services. First, MoKa is based on a utility function that characterizes the optimality of the configuration of an Internet service in terms of SLA requirements for performance and availability, in conjunction with service cost. Second, a nonlinear model of Internet services is described to predict the performance, availability and cost of a service. Third, a capacity planning method is proposed to
calculate the optimal configuration of the Internet service. Finally, an adaptive nonlinear control of Internet services is provided to automatically apply optimal configuration to online Internet services. MoKa is built as a feedback control of multi-tier Internet services as described in Figure 7, with three main elements: (i) online monitoring of the Internet service, (ii) adaptive control of the Internet service, and (iii) online reconfiguration of the Internet service. Online monitoring aims at observing the Internet service and producing the necessary data in order to, on the one hand, automatically calibrate MoKa’s model, and on the other hand, trigger MoKa’s capacity planning and control. MoKa’s model calibration is performed online and automatically. This allows rendering the dynamics of service workload mix and workload amount, without requiring human intervention and manual tuning of model parameters, which makes the model easier to use. Automatic calibration is described in the Section titled “Automatic and online MoKa calibration”. Therefore, the controller calls the utility-aware capacity planning
219
Performance, Availability and Cost of Self-Adaptive Internet Services
method to calculate the optimal configuration κ* for the current workload amount and workload mix. That optimal configuration guarantees the SLA performance and availability objectives while minimizing the cost of the Internet service. Finally, the new calculated configuration κ* is applied to the Internet service. In the following, we successively present MoKa utility function, modeling, capacity planning and automatic calibration.
UTILITY FUNCTION OF INTERNET SERVICES We consider an SLA of an Internet service that specifies service performance and availability constraints in the form of maximum latency ℓmax and maximum abandon rate αmax not to exceed. Performability Preference (i.e. performance and availability preference) of an Internet service is defined as follows: PP (ℓ, α) = (ℓ ≤ ℓmax) ⋅ (α ≤ αmax)
(1)
where ℓ and α are respectively the actual latency and abandon rate of the Internet service. Note that ↜∀ℓ, ∀α, PP(ℓ,α) ∈ {0, 1}, depending on whether Equation 1 holds or not. Based on the performability preference and cost of an Internet service, the utility function of the service combines both criteria as follows: θ(, α, ω) =
M ⋅ PP (, α) ω
Figure 8. Model input and output variables
220
(2)
where ω is the actual cost (i.e. #servers) of the service, and M is the number of tiers of the multitier Internet service. M is used in Equation 2 for normalization purposes. Here, ∀ℓ, ∀α, ∀ω,θ(ℓ, α, ω) ∈ [0, 1], since ω ≥ M (at least one server per tier) and PP(ℓ,α) ∈ {0, 1}. A high value of the utility function reflects the fact that, on the one hand, the Internet service guarantees service level objectives for performance and availability and, on the other hand, the cost underlying the service is low. In other words, an optimal configuration of an Internet service is the one that maximizes its utility function.
MODELING OF INTERNET SERVICES The proposed analytic model predicts the latency, abandon rate and cost of an Internet service, for a given configuration κ of the Internet service, a given workload amount N and a given workload mix X. The input and output variables of the model are depicted in Figure 8. The model follows a queueing network approach, where a multi-tier system is modeled as an M/M/c/K queue. Moreover, Internet services are modeled as closed loops to reflect the synchronous communication model that underlies these services, that is a client waits for a request response before issuing another request. Figure 9 gives an example of a three-tier system with a configuration κ(M=3, AC<1,1,2>, LC<20,15,3>), a workload amount of 30 clients and a workload
Performance, Availability and Cost of Self-Adaptive Internet Services
mix characterized, among others, by tier visit ratios V<1, 0.5, 2>. The example illustrates how the requests of the 30 incoming clients fly through the different tiers and server queues. For instance, among the Nt1 = 30 clients that try to access tier T1, Nr1 = 10 are rejected because of local configuration (i.e. MPL limit) at that tier and Na1 = 20 clients are actually admitted in the server at T1. Then, the 20 client requests processed by T1 generate Nt2 = 10 subsequent requests that try to access tier T2 (with a visit ratio V2 = 0.5). All the 10 requests are admitted in the server of T2 because they are below T2’s MPL local configuration (i.e. Na2 = 10). Finally, the 10 requests on T2 would induce in total 40 requests to T3 (with a visit ratio V3 = 2). However, due to synchronous communication between the tiers of a multi-tier service, a request on T2 induces at a given time at most one request to T3, and in average 4 successive requests to T3. Thus, Nt3 = 10 subsequent requests tentatively access T3. Among these 10 requests, Nr3 = 4 requests are rejected because of T3’s MPL local configuration and Na3 = 6 requests are admitted
and balanced among the two server replicas of that tier. In summary, among the 30 client requests attempting to access the multi-tier service, a total of 14 are rejected and 16 are serviced, resulting in an abandon rate of 47%. More generally, Algorithm 1 (Figure 10) describes how the model predicts the latency, abandon rate and cost of a replicated multi-tier Internet service. This algorithm builds upon the MVA (Mean Value Analysis) algorithm (Reiser, 1980). It extends it to take into account the following main features: server replication (i.e. architectural configuration), server’s MPL (i.e. local configuration), different workload mixes, and to include predictions of abandon rate and service cost. The algorithm consists of the following four parts. The first part of the algorithm (lines 1−13) calculates the admitted requests and rejected requests at each tier of the multi-tier service following the method introduced in the example of Figure 9. It considers, in particular, the following model inputs: the service configuration κ(M, AC, LC), the workload amount N and the workload
Figure 9. Replicated multi-tier Internet services as a queueing network
221
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 10. (Algorithm 1) Modeling replicated multi-tier Internet services
222
Performance, Availability and Cost of Self-Adaptive Internet Services
mix X with its service visit ratios. Lines 1−6 of the algorithm calculate Nti, the number of requests that try to access tier Ti considering the visit ratios of the tiers. Line 6 guarantees that if Nai-1 requests are admitted to tier Ti-1, there would not be more than Nai-1 requests that try to access Ti, because of synchronous communication between the tiers. Lines 7−9 apply Ti’s MPL local configuration to calculate Nai, the number of requests admitted to Ti, and Nri, the number of requests rejected by Ti. Furthermore, because a request admitted to Ti may be rejected by one of the following tiers Ti+1..TM, lines 10−11 calculate Nai′, the number of requests admitted at Ti and not rejected by the following tiers. Finally, lines 12−13 produce the total number of rejected requests, Nr, and the total number of admitted requests, Na. The second part of the algorithm (lines 14−32) is dedicated to the prediction of service request latency. First, lines 15−17 initialize queues’ length and service demand at each tier (i.e. the amount of work necessary to process a request on a tier Ti, excluding the inter-tier communication delays Di). Lines 18−31 consider the different tiers from the back-end to the front-end in order to estimate the cumulative request latency at each tier Ti: ℓai is the latency of a request admitted at tier Ti and admitted at the following tiers Ti+1.. TM, and ℓri is the latency of a request admitted at Ti and rejected at one of the following tiers. The latter case of requests will not be part of the final admitted requests but it is considered in the algorithm because it has an impact on queue length and, thus, on response times and latency calculation of admitted requests. Lines 19−22 introduce requests one by one at tier Ti, calculate service demand for each server replica at Ti, and estimate request response time for that tier. In line 22, the Max function is used to ensure that service demand is not lower than the incompressible necessary time Wi, as induced by service times characterizing the service workload mix. Then, lines 23−28 cumulate response times to calculate ℓai, the latency of requests admitted at tiers Ti..TM,
and ℓri, the latency of requests admitted at Ti but then rejected at one of the following tiers. These values are then used in lines 29−31 to calculate the queue length using Little’s law. Finally, the overall latency ℓ of a client request is provided in line 32 as the latency of a request admitted to T1 and never rejected in the following tiers. The third part of the algorithm (lines 33−37) is dedicated to the estimation of service abandon rate. It first calculates τa, the throughput of requests admitted at (and never rejected from) tiers T1..TM, τr, the throughput of requests admitted at T1 but then rejected by one of the following tiers, and τr´, the throughput of requests rejected at T1 due to MPL limit. These different values are then used to produce the total service request abandon rate α. Finally, the fourth and last part of the algorithm (lines 38−39) calculates the total cost ω of the replicated multi-tier service in terms of servers that underlie the system. Thus, the algorithmic complexity of the proposed model is O(M ⋅ N), where M is the number of tiers of the multi-tier Internet service and N is the workload amount of the service.
CAPACITY PLANNING FOR INTERNET SERVICES The objective of the capacity planning is to calculate an optimal configuration of a multi-tier Internet service, for a given workload amount and workload mix, to fulfill the SLA in terms of latency and abandon rate constraints, and to minimize the cost of the service. Thus, an optimal configuration κ* is a configuration that has the highest value of the utility function θ*. Figure 11 illustrates capacity planning. The main algorithm of capacity planning is given in Algorithm 2 (Figure 12). The algorithm takes as inputs a workload amount and a workload mix of an Internet service. It additionally has as input parameters the number of tiers of the Internet service, the SLA latency and abandon rate
223
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 11. Capacity planning of multi-tier Internet services
constraints to meet, and the underlying service analytic mode (see previous section). The algorithm produces an optimal configuration of the Internet service that guarantees the SLA and minimizes the cost of the service. The algorithm consists of two main parts: a first part that calculates a preliminary configuration that guarantees the SLA constraints, and a second part that minimizes service cost. The first part of the algorithm (lines 1−12) first increases the number of servers assigned to all tiers of the Internet service (cf. line 6). It then adjusts the local configuration of servers to distribute client requests among server replicas at each tier (cf. line 8). However, if a request on a tier induces in total more than one request on the following tier (i.e. Vi-1 ≤ Vi), and due to synchronous communication between the tiers, the number of concurrent requests on the following tier does not exceed the number of concurrent requests on the current tier (cf. line 10). Afterward, the resulting service configuration along with the workload amount and workload mix are used to predict the latency, abandon rate and cost of the service. This process is repeated until a configuration that meets the SLA is found. The second part of the algorithm (lines 13−45) aims to reduce the number of servers assigned to the service in order to minimize its overall cost, that is to calculate the minimum values of ACi. Basically, the minimum number of servers on tier Ti that is necessary to guarantee the SLA is
224
between 1 and the value of ACi calculated in the first part of the algorithm. To efficiently estimate the minimum value of ACi, a dichotomic search on that interval is performed (cf. lines 15−17). The local configuration LCi is adjusted accordingly to distribute requests among server replicas at tier Ti (lines 18−21). Then, the latency, abandon rate and cost of the resulting service configuration are predicted using the analytic model. If the configuration meets SLA constraints, the number of servers at that tier is reduced by pursuing the search of a lower value of ACi in the lowest dichotomy [minAC..ACi] (lines 23−24). Otherwise, the new configuration does not meet the abandon rate SLO or the latency SLO. The former case means that too few servers are assigned to the service; and the search of a higher value of ACi is conducted in the highest dichotomy ]ACi..maxAC] (lines 26−27). If the abandon rate SLO is met but not the latency SLO, there may be two causes. Either the client request concurrency level is too high which increases the latency. In this case, lower values of LC1..LCM are efficiently calculated using a dichotomic search (cf. lines 30−39). If the new value of local configuration allows to successfully meet the SLA, the algorithm pursues its search of a lower architectural configuration (lines 41−42). Otherwise, that means that too few servers are assigned to the service, and the algorithm pursues the search of a higher architectural configuration (lines 43−44).
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 12. (Algorithm 2) Capacity planning of replicated multi-tier Internet services
225
Performance, Availability and Cost of Self-Adaptive Internet Services
Thus, the proposed capacity planning method has an algorithmic complexity given by Equation 3, where M is the number of tiers of the multi-tier Internet service, N is the workload amount of the service, ACmax and LCmax are respectively the maximum values of the architectural configuration and local configuration respectively used in the loops at lines 16 and 33 of Algorithm 2. Furthermore, the logarithmic cost on ACmax and LCmax are due to the dichotomic search on architectural and local configurations. As a comparison, the algorithmic complexity of an exhaustive search on the optimal architectural and local configurations
of a multi-tier Internet service is O(M2 ⋅ N ⋅ ACmax ⋅ LCmax). The proposed capacity planning method outperforms the exhaustive search by orders of magnitude depending on the size of the service. O(M 2 ⋅N ⋅ log 2 (AC max )+ M 3 ⋅N ⋅ log 2 (LC max )) (3) Moreover, the main capacity planning algorithm is complemented with an optional part presented in Algorithm 3 (Figure 13). This is motivated by the fact that Algorithm 2 may result with a service configuration that is optimal
Figure 13. (Algorithm 3) Additional part to capacity planning algorithm
226
Performance, Availability and Cost of Self-Adaptive Internet Services
for a workload amount N but where the local configuration is too restrictive for a workload amount higher than N. Indeed, the configuration produced by Algorithm 2 may allow to meet the SLA for a given value of workload amount N but not for a higher workload amount, although the architectural configuration would be sufficient to handle that higher workload. This would result in additional future service reconfigurations to handle higher workloads. Thus, Algorithm 3 aims to reduce future service reconfigurations and system oscillations by calculating, based on the result of Algorithm 2, the highest value of local configuration that guarantees the SLA for N and still guarantees the SLA for workload amounts higher than N.
PROOFS This section first describes properties that underlie multi-tier Internet services, before presenting the proofs of optimality and termination of the proposed capacity planning method.
Properties P1. The service level objectives specified in the SLA are expressed in a reasonable way; that is the latency and abandon rate constraints of the SLA are eventually achieved with (enough) servers assigned to the Internet service. P2. Adding servers to a multi-tier Internet service does not degrade the performance and availability of the service. Furthermore, if there is a latency or abandon rate bottleneck at a tier, adding (enough) servers to that tier will eventually improve the latency/abandon rate of the service, and eventually remove the bottleneck from that tier. P3. Augmenting the server concurrency level (i.e. the MPL) will eventually increase the latency of the service and reduce the abandon rate of the service. Decreasing the server concurrency level
will eventually reduce the latency of the service and increase the abandon rate.
Proof of Optimality An optimal configuration of an Internet service for a given workload is a configuration that guarantees the SLA and that induces a minimal cost for the service. Let κ*(M, AC*, LC*) be the optimal configuration of a multi-tier Internet service consisting of M tiers. Thus PP (κ *) = 1 This is possible thanks to property P1 that states that the SLA is achievable. Furthermore, let κ(M, AC, LC) be any configuration of the service. ∀κ, PP (κ) = 1 ⇒ ∑ AC i ≥ ∑ AC i * That is ∀κ, θ(κ) ≤ θ(κ*) ∧ θ(κ*) > 0 In the following, we will first show that the configuration produced by the proposed capacity planning algorithm meets the SLA, and then we will demonstrate that this configuration has a minimal cost. Let κ(M, AC, LC) be the configuration produced as a result of the capacity planning of Algorithm 2. Suppose that κ does not guarantee the SLA. First, lines 1−12 of the algorithm iterate and increase the servers assigned to the Internet service until the SLA is met. Indeed, based on properties P1 and P2, this loop will eventually terminate with a configuration that guarantees the SLA at line 12. Then, suppose that the remainder of the algorithm (lines 13−45) results in a configuration that does not meet the SLA. This corresponds to one of the three following cases: line 26, line 38 or line 43 of the algorithm. In both cases of
227
Performance, Availability and Cost of Self-Adaptive Internet Services
lines 26 and 43, the number of servers assigned to the service is increased, which will allow to eventually meet the SLA (cf. properties P1 and P2). Line 38 corresponds to the case where the abandon rate constraint is not met and where the server concurrency level is augmented. This will either allow to meet the SLA constraints based on property P3 (cf. line 41), or will be followed by an increase of the servers assigned to the service which eventually guarantees the SLA based on properties P1 and P2 (cf. line 43). Thus, this contradicts the supposition that the configuration produced by the capacity planning algorithm does not meet the SLA. Suppose now that the configuration κ(M, AC, LC) produced by the capacity planning algorithm, which guarantees the SLA, does not have a minimal cost. That is (PP (κ) = 1) ∧ (∑ AC i > ∑ AC i *) By definition, removing any server from the optimal service configuration would result in SLA violation (i.e. performability preference violation) and the occurrence of a bottleneck at the tier where the server was removed. ∀κ, ∃i ∈ [1..M ], AC i < AC i * ⇒ PP (κ) = 0 Thus, if the configuration κ resulting from the capacity planning algorithm does not have a minimal cost ∃i ∈ [1..M ](PP (κ) = 1) ∧ (AC i > AC i *) (4) That means that, in Algorithm 2, the dichotomic search on ACi iterated on the high dichotomy ]ACi.. maxAC] instead of iterating on the low dichotomy [minAC.. ACi]. This corresponds to one of the two cases at line 27 or line 44 of the algorithm. However, in both cases, SLA constraints are
228
not met, which contradicts Equation 4 and thus, contradicts the supposition that the configuration produced by the capacity planning algorithm does not have a minimal cost.
Proof of Termination Obviously, the model algorithm presented in Algorithm 1 terminates in O(M ⋅ N) calculation steps. Furthermore, Algorithm 2 that describe the capacity planning method consists of three successive parts. The first part (lines 1−3) evidently terminates in M steps. The second part (lines 4−12) iterates until a service configuration that guarantees SLA is found. Based on properties P1 and P2, this second part of the algorithm eventually terminates after O(M2 ⋅ N ⋅ log2(ACmax)) calculation steps. Finally, the third part of Algorithm 2 (13−45) terminates after O(M2 ⋅ N ⋅ log2(ACmax) + M3 ⋅ N ⋅ log2(LCmax)) steps. Thus, the capacity planning algorithm is guaranteed to terminate.
AUTOMATIC AND ONLINE MOKA CALIBRATION Online monitoring and internal calibration allow to automatically calibrate the proposed model and capacity planning methods with their input values without requiring manual calibration or profiling from a human administrator. This enables MoKa to self-adapt to changes in the workload mix and to precisely exhibit the new behavior of the Internet service. Online monitoring of Internet services is, first, performed using sensors that periodically measure the state of the service and collect data such as the workload amount N, the client think time Z, and the visit ratios V of the multi-tier service. Then, average values of the collected data are calculated using the EWMA (Exponentially Weighted Moving Average) filter (Box, 2009). This filter produces average values of past observations on a time window; and the
Performance, Availability and Cost of Self-Adaptive Internet Services
weighting, which decreases exponentially for older observations, gives more importance to recent observations. Thus, once collected by the sensors, the data is filtered and the average values are given as inputs for the modeling and capacity planning methods that underlie adaptive control. Other input variables are needed by the modeling and capacity planning methods such as the service times S<S1,…,SM> and inter-tier delays D. Because these variables are too sensitive to monitoring, they are automatically calculated by an internal calibration process of the proposed adaptive control system as depicted by Figure 7. This is done using the descending gradient method, a first-order optimization method that allows to efficiently calculate the parameter values that provide the best accuracy for the model predictions (Avriel, 2003). To do so, the latency and abandon rate of the multi-tier service are monitored online as described earlier. Roughly speaking, this monitored data is compared with the predictions of the model when using different values of S and D (in addition to N, Z and V that were obtained as described earlier), and the values of S and D that finally maximize the accuracy of the model are chosen.
EVALUATION Experimental Environment We implemented the MoKa adaptive control of Internet services as a Java software prototype. The MoKa prototype consists of three mains parts: one for the modeling of services, one for the capacity planning of services, and one for the service controller. Furthermore, MoKa is designed as an open framework that can be easily extended to include new model and capacity planning algorithms, and to compare them regarding their accuracy, optimality and efficiency. Moreover, MoKa follows a proxy-based approach in order to integrate the proposed adaptive control to an Internet service
in a non-intrusive way. This allows MoKa, for instance, to integrate monitoring sensors and reconfiguration actuators of an Internet service. The evaluation of the proposed MoKa modeling and adaptive control was conducted using the TPC-W benchmark (TPC-W, 2010). TPC-W is an industry-standard benchmark from the Transaction Processing Performance Council that models a realistic web bookstore. TPC-W comes with a client emulator which generates a set of concurrent clients that remotely access the bookstore application. They emulate the behavior of real web clients by issuing requests for browsing the content of the bookstore, requesting the best-sellers, buying the content of the shopping cart, etc. The client emulator generates different profiles by varying the workload amount and workload mix (the ratio of browse to buy). In our experiments, the on-line bookstore was deployed as a two-tier system, consisting of a set of replicated web/business front-end servers, and a set of replicated back-end database servers. The client emulator was running on a dedicated machine to remotely send requests to the online bookstore. Two workload mixes were used for our experiments: mix X1 representing a version of TPC-W’s browsing mix with read-only interactions, and mix X2 that extends X1 for a heavier workload on the back-end tier. Whereas the original TPC-W client emulator allows to specify given static workload amount and workload mix, we modified the client emulator in order to introduce more dynamics to the generated workload. Thus, during a given experiment, the workload amount and the workload mix vary over time. The following experiments with MoKa were conducted on the Grid’5000 experimental platform (Grid’5000, 2010). The machines consist of Intel Xeon processors running at 2.33 GHz, they embed a memory of 4 Go, and are interconnected via a Gigabit Ethernet network. The machines run the Linux 2.6.26 kernel, Apache Tomcat 5.5 for web/ application servers, and MySQL 5.0 for database
229
Performance, Availability and Cost of Self-Adaptive Internet Services
servers. Round-robin was used to dynamically balance the load among server replicas.
Model Evaluation This section evaluates the accuracy of the proposed analytic model of Internet services. It considers a workload that varies over time in amount and in mix, as described in Figure 14. Here, the workload mix varies from mix X1 to mix X2, and for each mix, the workload amount varies between 250 and 1250 clients. In this context, the behavior of the real multi-tier Internet service is compared with the predictions of the proposed model. Furthermore, two (static) configurations of the multi-tier Internet service are considered: κ1(2,<1,1>,<500,500>) and κ2(2,<3,6>,<200,100>). The former configuration represents a minimal architectural configuration where the capacity of the system is increased by increasing the local configuration. While the latter configuration uses default local configurations of Tomcat front-end server and MySQL back-end server, and increases the capacity of the system by increasing the architectural configuration. The use of these two configurations Figure 14. Workload variation
230
intends to mimic human administrators who apply ad-hoc configuration to increase the capacity of Internet services. Figure 15 compares the latency measured on the online Internet service vs. the latency predicted by the model, and Figure 16 compares the real abandon rate of the online Internet service with the abandon rate calculated by the model. Both figures show that the model is able to accurately predict the latency and abandon rate of the multi-tier Internet service. For instance, the average difference between the real latency and the predicted latency is 8% for κ1 and an absolute difference of 20ms for κ2. The abandon rate is predicted with an average error not exceeding 2%. Furthermore, since the prediction of the model regarding the cost of an Internet service is straightforward, it is thus accurate and not presented here.
Capacity Planning Evaluation This section evaluates the proposed capacity planning method with regard to the optimality of the configuration produced by the method. To do so,
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 15. Accuracy of performance – real latency vs. predicted latency
Figure 16. Accuracy of availability – real abandon rate vs. predicted abandon rate
231
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 17. Optimality of architectural configuration
the following SLA is considered with a maximum service latency ℓmax of 1s and a maximum service abandon rate αmax of 10%. That means that capacity planning must produce a service configuration with a minimal cost while guaranteeing that at least 90% of client requests are processed within 1s. Here, different workload mixes and different workload amount values are considered for the two-tier TPC-W Internet service. For each workload value, the proposed capacity planning method calculates an architectural configuration and a local configuration of the Internet service that are respectively presented in Figure 14 and Figure 15. Furthermore, we compare the result of the proposed capacity planning method with the result of another method based on an exhaustive search. This latter method performs a search on the set of possible architectural and local configurations of the Internet service. It compares all possible configurations and produces the optimal
232
configuration that guarantees the SLA and minimizes the cost of the service. Figure 17 and Figure 18 compare the two methods with regard to their calculated architectural and local configurations, and show that the proposed capacity planning method produces the optimal configuration of the Internet service.
Adaptive Control Evaluation This section presents the evaluation of the MoKabased control applied to the online two-tier TPC-W Internet service. Here, the SLA specifies a maximum service latency ℓmax of 1s and a maximum service abandon rate αmax of 5% (these values are chosen for illustration purposes, although ℓmax and αmax can have other values, e.g. αmax = 0% to represent a service that is always available). Figure 16 describes the variation over time of the Internet service workload, i.e. the variation of the workload mix from mix X1 to mix X2 and, for
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 18. Optimality of local configuration
Figure 19. Workload variation
each mix, the variation of the workload amount between 250 and 1000 concurrent clients. In this context, the behavior of the MoKa-based controlled system is first compared with two baseline systems: one with a small (static) configuration
κ1(2,<1,1>,<500,500>) and another with a larger (static) configuration κ2(2,<3,6>,<200,100>). Figure 20 and Figure 21 respectively present the service latency and service abandon rate of the multi-tier Internet service, comparing the
233
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 20. Service latency with and without MoKa control
Figure 21. Service abandon rate with and without MoKa control
MoKa-based controlled system with the two non-controlled baseline systems κ1 and κ2. These figures show that κ1 is not able to meet the SLA performance constraint, and that κ2 is not able to meet the SLA availability constraint when the
234
workload is too heavy. In comparison, MoKa is able to control the configuration of the Internet service in order to meet SLA constraints. The different points where the MoKa-based controlled system is above the SLA limits correspond to the
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 22. Service architectural configuration with and without MoKa control
Figure 23. Service local configuration with and without MoKa control
occurrence of workload changes and the necessary time for the system to reconfigure and stabilize. Here, there is an order of magnitude between the average latency of the non-adaptive κ1 system and the average latency of the MoKa-based system; and
there is a factor of 3 between the abandon rate of κ1 vs. the abandon rate of the MoKa-based system. The adapted architectural and local configurations of the MoKa-based controlled system are shown in Figure 22 and Figure 23, and the cost
235
Performance, Availability and Cost of Self-Adaptive Internet Services
is given in Figure 24. This shows that MoKa is able to assign to the Internet service the strictly necessary servers to guarantee the SLA, with a saving of up to 67% of servers. Finally, MoKa is able to automatically detect service workload changes and self-calibrate with the current workload amount (N between 250 and 1000 in the present experiment), and the current workload mix (e.g. mix X1 (Z = 7 s; V ; S <S1 = 9 ms, S2 = 14 ms>; D ), and mix X2 (Z = 7 s; V ; S <S1 = 11.5 ms, S2 = 27 ms>; D )) Based on this automatic MoKa calibration, optimal configuration is dynamically applied to the online service. In addition to the previous comparison of MoKa with ad-hoc static configurations, we also compare MoKa with a linear control approach, a technique classically used for Internet service provisioning. Here again, the SLA specifies a maximum service latency ℓmax of 1s and a maximum service abandon rate αmax of 5%. And the Internet service workload varies over time, with a workload mix variation from mix X1 to mix X2
and a workload amount variation between 200 and 4000 concurrent clients. Roughly speaking, the linear control is first calibrated with two internal parameters: a reference service configuration κ0 and a reference workload amount N0. The reference workload amount is the maximum number of concurrent clients that the reference service configuration can handle while guaranteeing the SLA. These internal parameters of the linear controller are obtained through preliminary profiling of the Internet service. Afterwards, the controller applies a linear capacity planning method that simply calculates an architectural configuration that is proportional to the current workload amount N and the reference workload amount N0 and reference service configuration κ0. In the following experiments, the reference service configuration and workload amount for the calibration of the linear control are respectively κ0(<2,<1,3>,<600,200>) and N0 = 630 clients. Figure 25 and Figure 26 respectively present Internet service latency and abandon rate, and compare the MoKa-based controlled service with
Figure 24. Service cost with and without MoKa control
236
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 25. Service latency with different control techniques
Figure 26. Service abandon rate with different control techniques
237
Performance, Availability and Cost of Self-Adaptive Internet Services
Figure 27. Service cost with different control techniques
the linearly controlled service. The figures show that both control approaches are able to meet SLA requirements. However, MoKa keeps the latency and abandon rate near the SLA limits, which allows it to improve resource usage, and thus to reduce service cost compared to the linear control. This is shown in Figure 27 where, compared to MoKa optimal control, the linear control induces a cost overhead of 23%.
RELATED WORK The control of services to guarantee the SLA is a critical requirement for successful performance and availability management of Internet services (Loosley, 1997; Marcus, 2003; Menascé, 2001). The management of service performance and availability is usually achieved by system administrators using ad-hoc tuning (Brown, 2010; Microsoft, 2010). However, new approaches tend to appear to ease the management of such systems. These approaches differ with regard to several criteria: tackling performance and/or availability
238
objectives, handling Internet service workload variations in terms of workload amount and/or workload mix, the used control techniques, and the applied control mechanism (i.e. actuators). Different control mechanisms may be considered to manage service performance and availability, such as server provisioning, admission control, service differentiation, service degradation, and request scheduling (Guitart, 2010). In the following, we will focus on approaches using the two first techniques, namely admission control for a local configuration of the concurrency level of a server, and server provisioning for an architectural configuration of the size of a replicated distributed Internet service. Admission control fixes the MPL concurrency level of a multi-programming system (e.g. multithreaded servers). It has been extensively studied in server systems, and it was applied to a web server (Elnikety, 2004), a database server (MilanFranco, 2004), or a multi-tier system (Menascé, EC 2001). Some admission control solutions are proposed in the form of heuristics (Heiss, 1991; Chen, 2003; Milan-Franco, 2004). Hill-climbing
Performance, Availability and Cost of Self-Adaptive Internet Services
is a well-known heuristic applied in several solutions of admission control. These solutions have the advantage to be simple to implement; however, they provide a best-effort behavior without guarantees on the quality-of-service and SLA of the Internet services. Other approaches tend to provide strict guarantees on the quality-of-service, and are usually based on analytic models to characterize the system and control it. For instance, there are linear models and nonlinear models (Diao, 2002; Parekh, 2002; Tipper, 1990; Wang, 1996), queuing theory-based models or control theory-based models (Parekh, 2002; Diao, 2002; Malrait, 2009), models for central systems or for distributed services (Bouchenak, 2006; Sivasubramanian, 2006; Urgaonkar, 2007), used for providing guaranties on a unique QoS criterion or for combining multiple criteria (Chase, 2001; Menascé, EC 2001), applying a unique or multiple control mechanisms, i.e. actuators, (Diao, 2002; Milan-Franco, 2004). Other approaches control Internet services by provisioning/unprovisioning servers to the service. Autonomic provisioning of database servers is presented in (Chen, 2006), and server provisioning in multi-tier systems is described in (Bouchenak, 2006). While these systems are based on heuristics, other approaches tend to better characterize multi-tier applications through analytic modeling for provisioning multi-tier systems (Villela, 2007; Urgaonkar, 2007). However, these approaches are restricted to performance management and do not take into account service availability objectives. Furthermore, they require extensive model calibration with appropriate parameter values; and this calibration is tied to a given workload mix and must be changed each time the workload mix changes, which is not easily detectable. In summary, MoKa differs from the other approaches in many aspects: (i) it takes into account and combines service performance and service availability SLA objectives, (ii) it combines admission control with server provisioning for a better usage of resources and service cost minimization,
and (iii) it automatically handles both workload amount and workload mix variations without requiring manual recalibration of MoKa for an adaptive control of Internet services.
CONCLUSION This chapter presented MoKa, a system for adaptive control of Internet services to guarantee performance and availability objectives and to minimize cost. The contribution of MoKa is multifold. First, a utility function is defined to quantify the performance, availability and cost of distributed Internet services. Second, a utility-aware capacity planning method is developed; given SLA performance and availability constraints, it calculates a configuration of the Internet service that guarantees the constraints while minimizing the cost of the service. Third, a queuing theorybased analytic model of multi-tier Internet services is proposed; the model accurately predicts service performance, availability and cost, and is used as a basis of the capacity planning. Finally, an adaptive control of online Internet services is proposed in the form of a feedback control loop that automatically detects workload mix and workload amount variation, and reconfigures the service with its optimal configuration. The proposed model, capacity planning and control methods are implemented and applied to an online bookstore. The experiments show that the Internet service successfully self-adapts to both workload mix and workload amount variations, and present significant benefits in terms of service performance and availability, with a saving of resources underlying the Internet service. We hope that such a method will lead to a more principled, less ad-hoc implementations of resource management in Internet services and cloud computing environments. The reader is also suggested to consult Chapter 4 related to utilityaware performance management of composite services.
239
Performance, Availability and Cost of Self-Adaptive Internet Services
ACKNOWLEDGMENT Experiments presented here were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (Grid’5000, 2010).
REFERENCES Avriel, M. (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. Box, G. E. P., Luceno, A., & Del Carmen Paniagua-Quinones, M. (2009). Statistical Control by Monitoring and Adjustment. Broché. Brown, M. (2010). Optimizing Apache Server Performance. Retrieved May 20, 2010, from http://www.serverwatch.com/tutorials/article. php/3436911 Chase, J. S., Anderson, D. C., Thakar, P. N., Vahdat, A. M., & Doyle, R. P. (2001). Managing Energy and Server Resources in Hosting Centers. The 18th ACM Symposium on Operating Systems Principles (SOSP’01), New York, NY. Chen, J. Soundararajan, G., & Amza, C. (2006). Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers. The 3rd IEEE International Conference on Autonomic Computing (ICAC 2006), Dublin, Ireland. Chen, X., Chen, H., & Mohapatra, P. (2003). Aces: An Efficient Admission Control Scheme for QoSaware Web Servers. Computer Communications, 26(14). doi:10.1016/S0140-3664(02)00259-1 Diao, Y., & Gandhi, N. Hellerstein, J. Parekh, S., & Tilbury, D. (2002). Using MIMO Feedback Control to Enforce Policies for Interrelated Metrics with Application to the Apache Web Server. Network Operations and Management Symposium (NOMS).
240
Diao, Y., Hellerstein, J. L., Parekh, S., Shaikh, H., & Surendra, M. (2006). Controlling Quality of Service in Multi-Tier Web Applications. The 26th International Conference on Distributed Computing Systems (ICDCS 2006), Lisbon, Portugal. Elnikety, S., Tracey, J., Nahum, E., & Zwaenepoel, W. (2004). A method for transparent admission control and request scheduling in e-commerce web sites. The 13th International conference on World Wide Web (WWW 2004), New York, NY. Ferguson, P., & Huston, G. Quality of Service: Delivering QoS on the Internet and in Corporate Networks. John Wiley & Sons, 1998. Grid’5000. Grid’5000. Retrieved May 20, 2010, from http:// www.grid5000.fr/ Guitart, J., Torres, J., & Ayguadé, E. (2010). A Survey on Performance Management for Internet Applications. Concurrency and Computation, 22(1). doi:10.1002/cpe.1470 Heiss, H.-U., & Wagner, R. Adaptive Load Control in Transaction Processing Systems. The 17th International Conference on Very Large Data Bases (VLDB 1991), Barcelona, Spain, Sep. 1991. C. Loosley, F. Douglas, and A. Mimo. HighPerformance Client/Server. John Wiley & Sons, November 1997. Infiniti Malrait, L., Bouchenak, S., & Marchand, N. (2009). Fluid Modeling and Control for Server System Performance and Availability. The 39th Annual IEEE/IFIP Conference on Dependable Systems and Networks (DSN 2009). Malrait, L., Bouchenak, S., & Marchand, N. (2010). Experience with ConSer: A System for Server Control Through Fluid Modeling. IEEE Transactions on Computers, 2010. Marcus, E., & Stern, H. (2003). Blueprints for High Availability. New York: Wiley. Menascé, D. A. Almeida., V. A. F. (2001). Capacity Planning for Web Services: Metrics, Models, and Methods. Upper Saddle River, NJ: Prentice Hall.
Performance, Availability and Cost of Self-Adaptive Internet Services
Menascé, D. A., Barbara, D., & Dodge, R. (2001). Preserving QoS of E-Commerce Sites Through Self-Tuning: A Performance Model Approach. The ACM Conference on Electronic Commerce (EC’01), Tampa, FL, Oct. 2001. Milan-Franco, J., Jimenez-Peris, R., Patino-Martinez, M., & Kemme, B. (2004). Adaptive Middleware for Data Replication. The 5th ACM/IFIP/ USENIX international conference on Middleware (Middleware 2004), New York, NY, USA, 2004. Microsoft. Optimizing Database Performance. Retrieved May 20, 2010, from http://msdn.microsoft.com/enus/library/aa273605(SQL.80).aspx Parekh, S. S. (2002). Gandhi, N. J. L. Hellerstein, D. M. Tilbury, T. S. Jayram, and J. P. Bigus. Using Control Theory to Achieve Service Level Objectives in Performance Management. Real-Time Systems, 23(1-2).
Tipper, D., & Sundareshan, M. (1990). Numerical Methods for Modeling Computer Networks Under Nonstationary Conditions. IEEE Journal on Selected Areas in Communications, 8(9). TPCW. TPC-W: a transactional web e-Commerce benchmark. Retrieved May 20, 2010, from http:// www.tpc.org/tpcw/ Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., & Tantawi, A. (2007). Analytic Modeling of Multi-Tier Internet Applications. ACM Transactions on theWeb (ACM TWEB), 1(1). Villela, D., Pradhan, P., & Rubenstein, D. (2007). Provisioning Servers in the Application Tier for E-Commerce Systems. ACM Transactions Interet Technolies, 7(1).
Reiser, M., & Lavenberg, S. S. (1980). MeanValue Analysis of Closed Multi-Chain Queuing Networks. Journal of the ACM, 27(2), 313–322. doi:10.1145/322186.322195
Wang, W.-P., Tipper, D., & Banerjee, S. (1996). A Simple Approximation for Modeling Nonstationary Queues. In Proceedings of The 15th Annual Joint Conference of the IEEE Computer and Communications Societies, Networking the Next Generation (IEEE INFOCOM’ 96), San Francisco, CA, USA, Mar. 1996.
Sivasubramanian, S. (2006). Pierre, G. van Steen, M., & Bhulai, S. Amsterdam: SLA-Driven Resource Provisioning of Multi-Tier Internet Applications. Technical Report, Department of Mathematics and Computer Science, Vrije Universiteit.
Zhang, Q., Cherkasova, L., & Mi, N. (2008). A Regression-Based Analytic Model for Capacity Planning of Multi-Tier Applications. Journal of Cluster Computing, 11(3). doi:10.1007/s10586008-0052-0
241
Section 3
Dependability
243
Chapter 11
Performability Evaluation of Web-Based Services Magnos Martinello Federal University of Espírito Santo (UFES), Brazil Mohamed Kaâniche CNRS; LAAS & Université de Toulouse, France Karama Kanoun CNRS; LAAS & Université de Toulouse, France
ABSTRACT The joint evaluation of performance and dependability in a unique approach leads to the notion of performability which usually combines different analytical modeling formalisms (Markov chains, queueing models, etc.) for assessing systems behaviors in the presence of faults. This chapter presents a systematic modeling approach allowing designers of web-based services to evaluate the performability of the service provided to the users. We have developed a multi-level modeling framework for analyzing the user perceived performability. Multiple sources of service unavailability are taken into account, particularly i) hardware and software failures affecting the servers, and ii) performance degradation due to e.g. overload of servers and probability of loss. The main concepts and the feasibility of the proposed framework are illustrated using a web-based travel agency. Various analytical models and sensitivity studies are presented considering different assumptions with respect to users profiles, architecture, faults, recovery strategies, and traffic characteristics.
INTRODUCTION Business and individuals are increasingly relying on web-based services for a growing number of applications, facilitated among other things by the cloud computing paradigm. Service-oriented DOI: 10.4018/978-1-60960-794-4.ch011
programming is based on composing applications by invoking network-available services to accomplish some tasks. Web services are currently the most promising based technology for serviceoriented computing. They use the Internet as the communication medium and open Internet-based standards, including the Simple Object Access Protocol for transmitting data, the Web Services
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Performability Evaluation of Web-Based Services
Description Language for defining services, and the Business Process Execution Language for Web Services for orchestrating services. However, the quality of service of these offerings, that is usually quantified in terms of performance and dependability measures, can differ significantly (Rosenberg, Platzer, & Dustdar, 2006). Even though some commercial service providers advertise four-9’s or five-9’s for server availability (i.e., an availability of 99.99 or 99.999, respectively), highly available servers are not sufficient for providing highly available services due to the huge varieties of failures that may prevent users to access services. In practice, these numbers reflect performance under optimal operating conditions, rather than in real-world environment. (Paxson 1997) has shown that significant routing pathologies may prevent certain pairs of hosts from communicating about 1.5% to 3.3% of the time. Moreover, (Kalyanakrishnan, Iyer, & Patel, 1999) has suggested that average availability of a typical web service is about two-9’s (15 minutes of service unavailability per day, from end users perspective), which is far from the announced five-9’s (5 minutes of unavailability per year). This chapter concentrates on building analytic models to evaluate service oriented computing quality of service, including several sources of service unavailability such as hardware and software failures, as well as performance degradation related failures. Such models can be used by the designers of applications based on web services to analyze the quality of service delivered to the users and to support the selection of the best composition of web services that maximizes user satisfaction. As analyses from a pure dependability viewpoint tend to be conservative, because they do not include performance degradation, while analyses from pure performance viewpoint tend to be optimistic because they do not integrate the failure and repair events, we address the joint evaluation of dependability and performance of web-based services. Integrating performance and dependability in a unique approach leads to
244
the notion of performability that is necessary to capture the various degraded operational states, measuring not only whether the service is up or down but also operational degraded services. Due to the complexity of the target system and to the difficulty to combine various types of information, we have developed a multi-level modeling framework for analyzing the user perceived performability. The analysis concerns the identification of the functions and services provided to the users, and the resources contributing to their accomplishment. Also, it characterizes the various interactions of the users with the applications. Modeling is carried out considering four levels, namely: user, functions, service and resource. The first level describes how the users invoke the application (the operational profile) and the three remaining levels detail how user requests are handled by the application at distinct abstraction levels. The multi-level modeling framework is then applied to the case of a web-based travel agency as an example. The objectives of the case study are twofold: i) show how to go through the four levels of the modeling framework, and ii) present examples of performability analysis and evaluation results obtained from modeling, to help the service providers in making objective design decisions. Various techniques can be used to model each level: fault trees, reliability block diagrams, Markov chains, stochastic Petri nets, stochastic activities networks, etc. The selection of the right technique mainly depends on the kind of dependencies between the elements of the considered level, the distributions of the events involved, and the quantitative measures to be evaluated. In this chapter, for the case study considered, we will make use of block diagrams and Markov chains. The chapter is organized in six sections. Section 2 gives the background of performability modeling and evaluation as well as related work. Section 3 presents the modeling framework for evaluating the user-perceived performability of
Performability Evaluation of Web-Based Services
web-based applications. Section 4 illustrates the main concepts of this modeling framework on a web-based travel agency as example. Section 5 is dedicated to the performability evaluation considering examples of web server architectures taking into account system components failures and recovery, performance degradation due to a significant workload (i.e., to system overload), etc. Section 6 concludes the chapter.
the identification of the failure modes to be taken into account in the assessment of these measures. To define the dependability measures, two classes of states of the service delivered should be distinguished: correct and incorrect. Two main categories of measures are generally distinguished: •
BACKGROUND Concepts According to the terminology presented in (Avizienis, Laprie, & Randell, 2004; Laprie 1995), dependability is the ability to deliver a service that can justifiably be trusted. Depending on the applications considered, emphasis may be put on different attributes of dependability such as reliability, availability, safety, etc. These attributes allow to: a) express the properties expected from the system, and b) to assess the quality of the service delivered, as resulting from the threats and the means used to avoid them. The threats to dependability correspond to faults, errors, and failures that might affect the service(s) delivered by the system. Several complementary means are needed to cope with such threats. They can be grouped into four major categories: a) fault prevention, fault tolerance, fault removal and fault forecasting. This chapter addresses fault forecasting in the context of web applications and systems that are based on service oriented computing architectures. Fault forecasting is aimed at the estimation of the presence of threats and the evaluation of their likely consequences on the quality of service delivered to the users. Evaluation is generally based on probabilistic techniques using stochastic models and data from real-life measurements. It requires, in a first step, the definition of the quantitative measures to assess the dependability attributes and
•
Measures that characterize the sojourn time in the states where the correct service is being delivered (before reaching the incorrect service states): these correspond for example to reliability and MTTF (Mean Time To Failure), which measure the time of correct service delivery prior to a failure. Measures that characterize the delivery of correct service with respect to the alternation of correct and incorrect services: these encompass the various forms used to measure availability (time instant, time interval, or asymptotic).
In this chapter, we focus on the second category of measures considering that several modes of services (correct and incorrect) can be distinguished. In particular, gracefully degradable computer systems such as web based systems are generally assumed to be able to perform at various levels of performance. When all components of the system are operational the system will operate at maximal performance. The failure of a specific component does not necessarily affect the overall service availability provided by the system, but it does affect the speed at which the service can be provided. When a component fails, it is expected that the system will reconfigure itself and restart its activities, with degraded performance. For such systems, it is relevant to carry out combined performance and dependability analyses, i) to assess the impact of failures on the performance of the system, ii) to study how performance degradations may lead to failures, and iii) to assess the combined impact of failures and performance degradation on system behavior.
245
Performability Evaluation of Web-Based Services
The impact of many types of failures on the web service availability can be accounted for by adopting a composite performance and dependability evaluation approach, well-known as performability modeling approach (Meyer 1980). The combination of performance related measures and dependability measures has proven to be well suited to evaluate the impact of service degradation on system dependability (De Souza et Silva & Gail, 1992; Trivedi, Muppala, Woolet, & Haverkort, 1992). This chapter presents performability models that can be used to evaluate the availability of web based applications and systems, taking into account explicitly i) performance-related failures, that are due to the overload of servers, and ii) the computer host hardware and software failures leading to web server failure. The main idea of a performability modeling approach consists in combining the results obtained from two models: a performance model and a dependability model. The performance model takes into account the processes of arrival and service quantifying performance related measures conditioned on states determined from the availability model. The availability model is used to evaluate the steady state probability associated to the system states that result from the occurrence of failures and recoveries. This approach relies on the assumption that the system reaches a quasi steady state with respect to the performance-related events, that is, long time intervals between successive occurrences of failure-recovery events. This assumption holds when the failure/recovery rates are much lower than the arrival/service rates, which is typically true in service oriented computing.
RELATED WORK This section discusses related work addressing dependability and performability analysis and assessment in the context of web-based services, using measurements and analytical models.
246
A few studies have been devoted to measure the web service availability and reliability in order to characterize the corresponding failure behavior. As an example, (Oppenheimer, Ganapathi, & Patterson, 2003) have studied the causes of failures and the potential effectiveness of various techniques for preventing service failure using data from three large-scale Internet services.. Recently, an extensive evaluation of real-world web services was conducted in (Zheng, Zhang & Lyu, 2010). This study investigated 21,358 web services from the Internet and assessed several user-dependent QoS metrics from different distributed locations. We can also mention the experiments based on fault injection presented in Chapter 3 that are related to the measurement of response time and the analysis of exception propagation of a number of web services used in e-science and bioinformatics applications. Considering more specifically the evaluation of composed web services, they usually have complex application logic in which any piece of code and any application component deployed on a system can be reused and transformed into a network-available service (Papazoglou & Traverso, 2007). Thus, to check whether or not the composed service will meet its dependability and performance requirements is recognizably a non-trivial problem. Research results aimed at addressing these issues are recent and not mature yet. As an example, (Zeng et al. 2004) presented a middleware platform for the dynamic composition of web services using optimization algorithms that take into account five generic quality criteria (execution price, execution duration, reputation, successful execution rate, and availability). Chapter 4 describes a QoS Broker architecture that manages performance of composite web service applications using an utility function combining various QoS metrics for selecting the service providers supporting the application. In (Lakhal, Kobayashi, & Yokota, 2005), a mathematical model is developed for estimating and analyzing the execution time and the reliability
Performability Evaluation of Web-Based Services
of fault tolerant and dynamically executed web services compositions. Recently, methodologies combining business process modeling and quantitative performance and dependability modeling to assess composed web services have been also investigated, e.g., in (Gönczy et al., 2006; Bruneo, Distefano, Longo, & Scarpa, 2010). Clearly, several open problems need to be addressed to enable the dependability assessment of web based services and systems. The contributions presented in this chapter are aimed at fulfilling this objective. Our research builds on the multilevel modeling approach proposed in (Kaâniche, Kanoun, & Rabah 2001; Kaâniche Kanoun, & Martinello 2003) and presented in Section 3.
Multi-Level Modeling Framework Overview Web based service oriented computing is implemented on largely distributed infrastructures, with multiple interconnected layers of software and hardware components, involving various types of servers such as web, application, and database. Usually three key players are involved: the users, the web application provider, and external suppliers. The users interact through a web browser with the web application providing the service. The web application provider implements a set of functions that are invoked by the users. These functions are based on a set of services and resources that are under the direct control of the application provider (internal services) or are provided by external suppliers (external services). Every request initiated by a user is processed in several steps. It starts in the users browsers, flows through the Internet being executed in the web application provider (or shortly provider), and it can also be processed by the application using the external suppliers. For example, a provider can offer a book selling electronic service by outsourcing shipping, payment and billing to other external service suppliers. At the provider level, the user requests and the interactions with
the external suppliers are supported by a set of complex distributed applications and middleware such as web servers, application servers and database servers. Similar infrastructures are used at the external supplier sites. Users may invoke the various functions provided by the web application in different ways with variable frequencies. Also, the types of functions invoked and the resources involved to support such functions are not necessarily the same. As a consequence, the user perceived performability is affected not only by the user operational profile (i.e. workload) but also by the state of the components supporting the functions and their performance and dependability related characteristics. Generally, the service provider architecture is under direct control of web application designers. Therefore, a detailed analysis of this architecture can be carried out to support design architectural decisions. However, only limited information is usually available about the infrastructure supporting the external suppliers. For the external suppliers, remote measurements can be performed in order to characterize the dependability of such services. These measures can then be incorporated into the models describing the impact of component failures and repairs on web service performability. From the designers point of view, it is critical to understand how the different components of the distributed infrastructure supporting the provided service might affect the user perceived performability. Hierarchical modeling is well suited to support such analysis and master complexity by describing progressively the target system at different abstraction levels, with a sub-model associated to each level. The performability measures can be computed based on the hierarchical composition of the sub-models. The multi-level modeling framework proposed in (Kaaniche, Kanoun, & Rabah, 2001) and illustrated in Figure 1, follows such a compositional approach. Four abstraction levels are distinguished.
247
Performability Evaluation of Web-Based Services
•
•
•
The user level models the execution scenarios performed by the user when visiting the web site(s). Each scenario is defined by the set of functions invoked and the probability of activation of each function in the corresponding scenario. The function level describes the set of functions available to the user level. Some of these functions (e.g., Search, Login) may be found in most web sites, whereas others are characteristic of certain web sites or of specific types of web applications. The service level describes the main services needed to implement each function and the interactions among them. Two categories of services are distinguished: internal and external services.
Figure 1. Performability modeling framework
248
•
The resource level describes the architecture on which the services identified at the service level are implemented. At this level, the architecture, fault tolerance and maintenance strategies implemented at the provider site are detailed. However, each service provided by an external supplier is considered as a black box.
The function and service levels describe according to a top-down approach, how the application software implementing the e-business logic is structured and decomposed, whereas the resource level describes the corresponding execution environment (software, hardware components, etc). The performability modeling and evaluation is directly related to the system hierarchical de-
Performability Evaluation of Web-Based Services
scription. The outputs of a given level are used in the next immediately upper level to compute the performability measures associated to this level (denoted by M(x) where x is a user, a function, a service or a resource). Accordingly, at the service level, the performability of each service is derived from the performability of the resources involved in its accomplishment. Similarly, at the function level, the availability of each function is obtained from the performability of the services implementing it. Finally, at the user level, the performability measures are obtained from the performability of the functions invoked by the users. At the service/resource level, one or several performability models are built based on the knowledge of the infrastructure and the resources implementing the required services. This level includes also the fault tolerance and recovery mechanisms as well as the maintenance policies at the service provider site(s). Various techniques can be used to build and solve these models including non state-based techniques also called combinatorial techniques (e.g. fault trees, reliability block diagrams), state-based techniques (e.g. Markov reward models, Generalized Stochastic Petri Nets GSPNs), or hybrid approaches combining different techniques. The selection of the right technique mainly depends on: i) the distributions associated to failure, repair and other events included in the model, ii) the dependencies between the elements of the underlying level and iii) the performability measures to be evaluated. For example, the performability model at the function level relies on the knowledge of the performability of all services involved in each function accomplishment and all possible execution scenarios associated to each function. The outputs of this level are the performability of the functions denoted M(Fi) defined as: N
M(Fi ) =∑ ω j M( σ j (Fi )) j= 1
(1)
• • • •
N is the number of execution scenarios for function Fi; ωj is the probability of activation for execution scenario j; σj(Fi) denotes the set of services involved in execution scenario j; M(σj(Fi)) is the performability of the services involved in execution scenario j;
This equation is general and can be applied in a similar way to user level (Kaâniche, Kanoun, & Martinello, 2003; Martinello 2005). The next section illustrates the main concepts of the multi-level modeling framework using a travel agency application as an example. The objectives are: i) to show how to apply the proposed framework based on the decomposition of the target application according to the four levels and ii) to present typical performability modeling and analysis results that can be used for supporting design decisions.
PERFORMABILITY OF A WEBBASED TRAVEL AGENCY We consider as an example, a virtual travel agency (TA) for booking trips over the web. The TA interacts through dedicated interfaces with several flight reservation systems (AF, TAM,...), hotel reservation systems (Holiday Inn, Sofitel,...), and car rental systems (Avis, Hertz,...). The TA application can be seen as composed of two basic components (Periorellis, & Dobson, 2001): the client side, and the server side. The client side handles user’s inputs, performs necessary checks and forwards the data to the server side component. The latter is the main component of the TA. It is designed to respond to a number of calls from the client side concerning e.g., availability checking, booking, payment and cancellation of each item of a trip. It handles all transactions to, and from, the booking systems, composes items into full trips, converts incom-
249
Performability Evaluation of Web-Based Services
ing data into a common data structure and finally handles all exceptions. Starting from this very high-level description, we will further detail it according to the various aspects required for the hierarchical description. We will first focus on the function and user levels together, then the service and function levels before addressing the resource level.
•
Function and User Level Modeling
•
The behavior of the users accessing the TA web site is characterized by the operational profile example presented in Figure 2. The graph is the basis for capturing the navigational pattern of a group of clients, as seen from the server side. The nodes “Start” and “Exit” represent the start and the end of a user visit to the TA web site, and the other nodes identify the requested functions during the visit. For illustration purposes, we have considered five functions for the TA example:
•
• •
Home: invoked when a user accesses the TA home page. Browse: the customer navigates through the links available at the TA site to view
Figure 2. User operational profile
250
any page of the site (promotions, help, frequent queries pages, etc). Search: the TA checks the availability of trip offers according to the user request. A user request can be composed of a flight, a hotel and a car reservation. The TA converts the user requests into transactions to hotel, flight and car reservation systems and returns the results to the user. Book: the customer chooses the trip and confirms his reservation. Pay: the customer is ready to pay for the trips booked on the TA site.
The transitions among the nodes and the associated probabilities pij describe how the users interact with the TA web site. Different operational profiles with different probabilities pij can be defined to analyze different user profiles: heavy buyers, occasional buyers, etc. The set of probabilities pij can be obtained by collecting data on the web site (see e.g., (Zaiane, Xin, & Han,1998)). As illustrated in Figure 2, all possible user execution scenarios (or shortly, scenarios) when visiting the TA web site can be derived from the operational profile. Each scenario is defined by the set of functions invoked and the probability of
Performability Evaluation of Web-Based Services
activation of each function in the corresponding scenario. The “Start” and “Exit” nodes denote the beginning and end of a user scenario. The identification of the most frequently activated scenarios gives useful insights into the most significant scenarios to be considered when evaluating the user perceived performability. Indeed, the higher the activation probability of a given scenario, the higher its impact on the performability perceived at the user level. Such measure is affected by the performability of the functions, services and resources involved in this scenario.
Function and Service Level Modeling The service level identifies the set of servers involved in the execution of each function and describes their interactions. This analysis requires a deep understanding of the business logic and the technical solutions implemented by the TA system provider. For the sake of illustration, Figure 3-a gives a simplified example of mapping between the functions provided at the TA site, the internal servers directly controlled by the TA system provider and the external servers operated and controlled by external suppliers.
Figure 3. Function level modeling
251
Performability Evaluation of Web-Based Services
The external services correspond to the flight, hotel, and car reservation, and payment services. The internal services are supported by three types of servers: i) Web servers that receive user requests and send back the requested data, ii) application servers that implement the main operations needed to process user requests, and iii) database servers handling data related operations (for storing and retrieving information about flight, hotel and car reservation companies, as well as information on users’ orders). The “Home” function execution involves the web server only. However, for the other functions, several servers are required. It is thus necessary to analyze for each function the interactions among the servers involved and all possible execution scenarios (referred to as function scenarios). Several graphical notations and formalisms (e.g. Unified Modeling Language-UML, Message Sequence Charts-MSC, etc.) can be used not only to describe, for each function, the dynamic interactions between the servers involved in its accomplishment, but also to identify all possible execution scenarios of the function. As an example, the graphical representation given in Figure 3-b for the Browse and the Search functions is based on the Interaction Diagram formalism defined in (Menascé, & Almeida 2000). The “Begin” and “End” nodes identify the beginning and the end of each function execution. Each path from the “Begin” node to the “End” node identifies one possible function scenario. The probability of activation of each scenario can be evaluated by taking into account the probabilities qij associated to the transitions involved in the corresponding scenario. Note that the probability of activation of non-labeled transitions is one. Considering for example the Browse function, we identify the following three scenarios: •
252
1→2→3: The user sends a request to the web server (node 2). The data requested is available in the local cache and returned back to the user (node 3).
•
•
1→2→4→5→6: The web server accepts the user request and sends it to the application server (node 4). In this case the requested data is not available in the local cache. The application server processes the request and returns a dynamically generated page to the web server (node 5). The latter is then forwarded to the user (node 6). The database is not involved in this case. 1→2→4→7→8→9→10: The application server requires some specific information that is on the TA database server (node 7). After the database server has answered the application server, the latter processes the user request (node 8) and sends the results to the web server (node 9). The latter generates an HTML page incorporating the corresponding outputs (node 10).
Considering the Search function, its interaction diagram in Figure 3-c is decomposed into 9 stages. The input data provided in the search request issued by the user (node 1) are first processed by the web server WS (node 2). WS performs necessary checks, and then breaks down the user request into three individual requests corresponding to each aspect of the trip. If data is correct and in the right format, it is forwarded to the application server AS (node 4), otherwise an exception is sent to the user (node 3). AS uses the request information to formulate a query and asks the database server (node 5) for the list of booking systems to be contacted. Based on the answer received, AS sends a query (node 6) to the selected systems (identified by the Flight, Hotel and Car nodes). The AND operator means that the request is submitted to the three types of booking systems (nodes 7.a, 7.b, 7.c). The answers returned to AS are formatted (node 8) and sent to WS (node 9) that forwards them to the user (node 10). The number of Flight, Hotel and Car reservation systems is not indicated in this figure. We assume that the TA always interacts with the same systems. A transaction is success-
Performability Evaluation of Web-Based Services
ful when, for each service (Flight, Hotel and Car reservation), at least one system responds.
Service and Resource Level Modeling The various services are mapped into the resources involved in their accomplishment. Therefore, we need to take into account the real hardware and software organization of the system. Figure 4 presents a general architecture that can be used to implement the TA. We assume that each of the external services (Flight, Hotel, and Car reservation) are provided by respectively NF, NH and NC independent service providers. A similar assumption is considered for the payment service. As the architecture of each of these service providers is not known to the TA provider, we consider each architecture as a black box. Concerning internal services, the system architecture is usually known by the TA provider, and several variants are possible. In particular, different organizations of the servers on the hardware
support (e.g., dedicated hosts for each server, or multiple servers on the same host) and different fault tolerance strategies (non-redundant or replicated servers) can be considered. Dispatchers are generally needed to balance the load among the servers. Redundant servers can be either located at one site with several local area networks (LANs) interconnecting them, or geographically distributed at distinct sites. Also, fault tolerance can be applied to provide redundant accesses to the Internet or redundant communication links between internal resources. Additionally, several maintenance strategies can be adopted by the TA provider (e.g., immediate or deferred maintenance, dedicated or shared repair resources). Section 5 will present examples of web server architectures of the TA provider. In the rest of this section, we will only illustrate how the quantitative measures derived from detailed models (such as those presented in Section 5) can be combined to assess the performability at the service, function and user levels.
Figure 4. Architecture
253
Performability Evaluation of Web-Based Services
Performability Evaluation Based on simplified assumptions, this section illustrates how the availability measures derived from the performability modeling at the resource level of our multi-level modeling framework can be composed to evaluate availability measures in the context of the TA example. In this section, we assume that the availability of the service delivered by the web server architecture presented in Figure 4, A(WS) is known, and has been evaluated based on models similar to those in Section 5.
Service Level Availability External services. Each external system is modeled as a black box that is assumed to fail independently of all the others. Let us consider the following notations: •
• •
AF (i), AH(j) and AC(k): Availabilities of a flight, hotel and car reservation system, ↜(i = 1, …, NF; j = 1, … NH; k = 1, …, NC). APS: Availability of the payment system. Anet: Availability of the TA connectivity to the Internet.
Using the failure independence assumption and considering that the service is provided as long as at least one reservation system for each item of a trip (flight, hotel and car reservation) is available, the availability of the external services is given in Table 1. It is worth mentioning that if the TA connectivity to the Internet is unavailable, none of these services is provided. Thus, the availability of the TA connectivity to the Internet, Anet, will be accounted for by multiplying the user perceived availability equation by Anet. Internal services. These concern the web, the application and the database services, and the communication between these servers, that is achieved by a LAN. As the primary objective of this section is to show the applicability of the proposed approach to the TA example, we make
254
simplistic assumptions to show its feasibility. More realistic assumptions are made for web service in Section 5. We assume that the application service and the database service are implemented on a duplex architecture with two redundant computer hosts for each service and two mirrored disks. Also, it is assumed that the computer hosts and the disks fail independently of each other. Let us denote by A(CAS) and A(CDS), the availabilities of the computer hosts associated to the application and database servers, respectively. The disk availability is denoted by A(Disk) and the LAN availability is denoted by ALAN. The application and database service availability are given in Table 1.
Function Level Availability The availability of each function is based on the availabilities of the services involved in its accomplishment. When many scenarios of execution are possible, the availability of each function relies on the activation probability of each scenario. Table 1 gives the availability for the Home, Browse, Search, Book and Pay functions based on the availabilities of the involved external and internal services. The parameters qij involved in the availability of the Browse function are associated to the three execution scenarios of this function given in Figure 3-b. Note that all function equations include the product Anet ALAN, meaning that if the TA connectivity to the Internet or the internal communication among the servers is not available, none of the TA functions can be invoked by the users. Also, the Book function has the same availability equation as the Search function. This is due to the assumption that the former uses a subset of the resources used by the latter. Indeed, in our example the Book function is achieved only if the Search function has succeeded. This led us to assume that if the Search function succeeds, automatically the Book function succeeds. Clearly, other situations can be modeled.
Performability Evaluation of Web-Based Services
Table 1. Service, functional and user level availabilities Service level
External services
Internal services NF
A(Flight )
= 1 − ∏ [1 −AF (i)]
A(Hotel )
= 1 − ∏ [1 −AH (i)]
A(Car )
i =1
NH
Application service: A(AS)=1-[1-A(CAS)]2 Database service: A(DS)=[1-[1-A(CAS)]2 [1-[1-A(Disk)]2]
i =1
NC
= 1 − ∏ [1 −AC (i )] i =1
A(Payment) = A(PS) Function level
A (Home) = Anet ALAN A(WS) A (Browse) = Anet ALAN A(WS) [q23 + A(AS)(q24.q45 + q24.q47.A(DS))] A (Search) = A (Book) =AnetALAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A (Pay) = Anet ALAN A(WS) A(AS) A(DS) A(PS)
User level
A(user) = Anet ALAN A(WS) [π1 + (π2 +π3) { q23 + A(AS) (q24 q45 + q24 q47 A(DS)}+A(AS)A(DS)A(Flight)A(Hotel)A(Car) {(π4 +π5+π6+π7+π8+π9) + (π10 +π11+π12)A(PS)}]
User Level Availability The user perceived availability can be obtained by evaluating the availability of each user execution scenario derived from the operational profile. If many functions are in a given scenario, there may be several dependencies among the functions due to shared services. A careful analysis of the dependencies is needed to evaluate the availability measure associated to the considered user scenario. Based on the availability of the functions involved in each scenario presented in Figure 2, and on the activation probabilities of all user scenarios, πi, the user perceived availability is given in Table 1. It can be seen that the availabilities of the LAN, the net and the web service are the most influential ones (i.e., their impact is of the first order, while the others are at least at the second order). This is due to the fact that all requests (i.e., all scenarios) use these three services.
NUMERICAL RESULTS For the sake of illustration, Table 2 shows two examples of user profiles, A and B, defined by the probabilities associated to the user execu-
tion scenarios presented in Figure 2. The notation {Function1-Function2}* means that these functions are activated more than once in the corresponding scenario, due to the presence of cycles in the graph. The scenarios are grouped into four categories according to the activated functions: • •
• •
SC1 gathers all scenarios involving only Home or Browse functions. SC2 gathers all scenarios involving the Search function, but not the Book and Pay functions. SC3 gathers all scenarios involving the Book function. SC4 gathers all scenarios that reach the Pay function.
Their probabilities are obtained by summing the probabilities associated to the corresponding scenarios. In profile A, a high proportion of users are mainly seeking for information without a buying intention, whereas profile B is characterized by a higher proportion of users really seeking for booking a trip. Indeed, the percentage of transactions that end up with a payment of a trip is around 20% for user profile B while it is
255
Performability Evaluation of Web-Based Services
is about 173 hours per year for profile A and 190 hours for profile B. Figure 5 also gives the user perceived unavailability according to the four categories of user scenarios, where UA denotes the unavailability perceived the users, and UA(SCi) denotes the contribution of scenario SCi to UA. It can be seen that the unavailability caused by SC4 that ends up with a trip payment is higher for profile B compared to profile A (43 hours downtime per year for profile B compared to 16 hours for profile A, when considering the steady values). Therefore, the impact in terms of loss of revenue for the TA provider will be higher. Indeed, if the users transaction rate is 100 per second, the total number of transactions ending up with a payment that are lost is 5.7 millions for profile A and 15.5 millions for profile B. Assuming that the average revenue generated by each transaction is $100. Then the loss of revenue amounts to $570 million and $1.55 billion, respectively. This result clearly shows that it is important to have a faithful
almost 3 times lower for profile A. Moreover, for profile B, around 80% of user transactions lead to the invocation of the functions Search, Book or Pay, and around 50% when considering profile A. Such scenarios involve not only the TA system but also the external reservation systems. Therefore, the quality of the service supported by these reservation systems has a significant impact on the user perceived availability. Each table of Figure 5 gives the availability as perceived by users with profiles A and B, based on the equations of Table 1 and the parameter values given in Table 2. Figure 5 shows that the user perceived availability increases significantly when the number of reservation systems increases from 1 to 4, and then stabilizes. The availability variation rate is directly related to the availability assigned to each reservation system. These results obtained show that different operational profiles might lead to significant differences in the availability perceived by the users. For instance, for NF= NH =NC ≥ 5, the user perceived unavailability Table 2. Model parameters User scenario
Probabilities (in%), πi, Profile A
Profile B
1: St-Ho-Ex
10.0
10.0
2: St-Br-Ex
26.7
6.6
3: St-{Ho- Br}*-Ex
11.3
4.2
4: St-Ho-Se-Ex
18.4
13.9
5: St-Br-Se-Ex
12.2
20.4
6: St-{Ho- Br}*-Se-Ex
7.6
9.7
7: St-Ho-{Se-Bo}*-Ex
3.0
4.7
8: St-Br-{Se-Bo}*-Ex
2.0
6.9
9: St-{Ho- Br}*-{Se-Bo}*-Ex
1.3
3.3
10: St-Ho-{Se-Bo}*-Pa-Ex
3.6
6.4
11: St-Br-{Se-Bo}*-Pa-Ex
2.4
9.4
12:St-{Ho-Br}*-{Se-Bo}*-Pa-Ex
1.5
4.5
St: Start Ho: Home Br: Browse Se: Search Bo: Book Pa: Pay Ex: Exit APS = AFi = AHi = ACi = 0.9 Anet = ALAN = 0.9966 A(CAS) = A(CDS) = 0.996 A(Disk) = 0.9 A(WS) = 0.999995587 q23 = 0.2; q24 = 0.8; q45 = 0.4; q47 = 0.6
256
Scenario category
SC probabilities Profile A
Profile B
SC1
47.9%
20.8%
SC2
38.2%
44%
SC3
6.4%
14.9%
SC4
7.5%
20.3%
Performability Evaluation of Web-Based Services
estimation of the user operational profile to obtain realistic predictions of the impact of failures from the economic and business viewpoints.
WEB SERVICE ARCHITECTURE PERFORMABILITY MODELING This section presents examples of performability models that can be used to evaluate the availability of the web service denote by A(WS) in the previous section, taking into account explicitly i) performance-related failures due to the servers overload, or probability of loss and ii) the computer host hardware and software failures leading to web server failure. We will first consider a very simple architecture, based on a single computer, then a fault tolerant architecture, composed of N redundant computers. For the latter, we will analyze the impact of the efficiency of the fault tolerance mechanisms (through the coverage factor) and then the impact of error detection and error recovery strategies following a web server failure.
Non Redundant Architecture The web service relies on a single computer host. We assume that requests are not serviced either due to a buffer overflow or to a failure of the web server computer. Let us denote by Lb the probability that the web server input buffer (whose size is b) is full when a request is received, and A(Cws) the steady-state availability of the computer hosting the web service. The availability of the web service, A(WS), is given by: A(WS) = (1- Lb) A(Cws)
(2)
This equation shows the inherent dependence between performance and dependability. Lb is derived from the performance model and depends on the distribution of the request arrival process and the request service process. Let us assume that the request arrivals are modeled by a Poisson process with rate γ and the request service times are exponentially distributed with rate ν. Τhe system load is thus ρ = γ/ν. The web server behavior can be modeled by a classical M/M/1/b queue, for which the probability that
Figure 5. User perceived availability and unavailability
257
Performability Evaluation of Web-Based Services
an arriving request is lost due to buffer overflow is well-known (see e.g. (Gross, & Harris, 1985):
(3)
Redundant Architecture: Impact of Coverage Factor The architecture is composed of N identical redundant web servers. We assume that all component failures are independent and that the web service is provided as long as at least one of the servers is available. The performance model representing this architecture is assumed to be an M/M/k/b queue, where k is the number of servers available and b is the size of the buffer. For a system with k operational servers, the probability that web requests are lost due to buffer overflow denoted by Lb(k), is given by (see, e.g. Gross, & Harris, 1985):
•
•
After a covered failure (transition with rate kcλ, where c is the coverage factor), the system is automatically reconfigured into an operational state with (k-1) web servers. Upon the occurrence of an uncovered failure (transition with rate k(1-c)λ), the system moves to a down state Yk, where a manual reconfiguration is required before moving to operational state (k-1). The reconfiguration times are exponentially distributed with mean 1/β.
Let us denote by πk the steady state probability of state k (k = 0, 1, …, N) where k servers are available to the requests arrival process. The probabilities πk and πYk can be obtained using traditional techniques for solving Markov models (Ross 2003). As states Yk correspond to down states, the unavailability of the web service UA(WS) is computed as:
(4) Note that Lb (1) is given by equation (3). The availability model describes the architecture behavior as resulting from the occurrence of failure and repair processes. It is used to evaluate the steady state probability associated to system states k (k being the number of operational servers, as denoted above). The availability model presented in Figure 6 for the purpose of illustration is based on the assumption that each web server runs on a dedicated computer host with a constant failure rate λ. The model assumes shared repair facilities with repair rate μ. From each state k, two transitions are considered:
258
(5)
where Lb(k) is given by equation (4). Figure 7 shows examples of results of UA(WS) versus the number of web servers, N, for two values of the web server failure rates λ (10-2/h, 10-4/h). For N =1, the results correspond to the non redundant architecture. The parameters used are indicated on the figure (the buffer size b is assumed to be 10). This figure shows that increasing N up to 4 (depending on the failure and request arrival rates) significantly reduces UA(WS). The trend is reversed for N higher than 4. This is due to the fact that when the coverage is imperfect, increasing the number of servers also increases the probability for the system being in states Yk where the web service is unavailable and a manual reconfiguration action is required. Actually, the probability of loss plays a significant role for small values of N. When N is higher than the
Performability Evaluation of Web-Based Services
Figure 6. Markov model of a redundant architecture with imperfect coverage
threshold value, the total service rate and the buffer capacity are sufficient to handle the flow of arrivals without rejecting requests. In this case, UA(WS) mainly results from hardware and software failures leading the web server architecture to a down state. Design decisions can be made based on the results of Figure 7. In particular, we can determine the number of servers needed to achieve a given availability requirement, or evaluate the maximum availability that can be obtained when the number of servers is fixed. For example:
•
•
The number of servers needed to satisfy UA(WS) < 10-5 (i.e., UA(WS) less than 5 min/year), with a failure rate λ = 10-4/h, is at least N = 2 if the request arrival rate is 50/second, and at least N = 4 if the request arrival rate is 100/second. However, such a requirement cannot be satisfied with λ = 10-2/h. If the number of web servers were set to 3, then UA(WS) is 1 hour /year, if the failure rate were in the range of 10-2/h - 10-4/h and the system load ρ = γ/ν is less than 1.
Figure 7. Web service unavailability
259
Performability Evaluation of Web-Based Services
Redundant Architecture: Impact of Error Detection and Recovery The performability model discussed in this section is aimed at providing insights into the impact on the web service availability of error detection latency and error recovery strategies following a web server failure. We consider a refined version of the redundant architecture with N servers including a dispatcher that distributes the load among the servers according to a round robin strategy. It is also assumed that each server has an associated buffer with a limited capacity. Thus, the requests sent to the server when the buffer is full are lost. In addition, the dispatcher runs a monitoring process, based e.g., on heartbeat messages in order to detect server failures. The objective is to early detect the failed servers and disconnect them from the system. Failures of servers have a direct impact on the performance capabilities of the system. Indeed, when servers fail and are disconnected, the remaining healthy servers have to handle all the original traffic, including the queries previously serviced by the failed servers. This leads to the increase of the workload in each of the remaining servers, with a potential degradation of the corresponding quality of service. For example, the failure of two servers in a five server architecture implies a 66% increase in the load that has to be processed by each of the three remaining servers. Clearly, an accurate estimation of the web service availability should take into account such performance degradation related to server failures. Moreover, detailed assumptions need to be stated to describe the consequences of web server failures. For this purpose, two recovery strategies are compared in this section: Non Client-Transparent (NCT), and Client-Transparent (CT). From the modeling point of view, NCT and CT are defined as follows: •
260
NCT recovery: all requests in progress as well as the input requests directed by the
•
dispatcher to the failed server before the failure is detected are lost; CT recovery: all requests in progress and the input requests directed by the load manager to the failed server during failover latency time1 are not lost; they are redirected to the non failed servers.
Let us assume that the times to failure of the servers are exponentially distributed with rate λ, and that the detection of failures occurs with rate α. After detection, the system is reconfigured by disconnecting the failed server. The latter is reintegrated after recovery. The recovery times are assumed to be exponentially distributed with μ. Figure 8 shows the Markov model describing the behavior of the system governed by server failures, detections and recovery processes. In states k=1,...,N, the system has k servers available for processing the input traffic. However, requests could be rejected in these states due to overload conditions. The failure of a server in state k, leads the system to state Dk with a transition rate kλ. In states Dk, although the server has failed, this failure is not yet perceived by the dispatcher. Accordingly, client requests could still be directed by the dispatcher to the failed server during the failover latency time. Upon detection, the system moves to state k-1 after the disconnection of the failed server and the restoration of this server is initiated. In this model, it is assumed that no other failure can occur when the system is in state Dk. This assumption is acceptable because the failover latency times are generally very small compared to the times to failure. The web service unavailability UA(SW) is defined as the probability in steady state that a web request received by the system is not processed successfully. This means that the request is lost: i) upon arriving to the system, due to overload conditions, or ii) while it is being processed or waiting for service, due to the failure of the server handling the request. Let us denote by:
Performability Evaluation of Web-Based Services
Figure 8. Markov model of a redundant architecture with error detection latency
• •
•
L(k): the request loss probability in state k due to overload conditions; L(kλ): the request loss probability caused by a transition from state k to state Dk due to a server failure; L(Dk): the request loss probability in state Dk during server failure detection time.
The web service unavailability UA(WS) can be computed as follows:
(6) πk and πDk are the steady state probabilities associated to states k and Dk of Figure 8. Assuming that each server is modeled by an M/M/1/b queue, closed form-expressions have been obtained in (Martinello 2005) for L(k), L(kλ), and L(Dk) considering both NCT and CT recovery strategies. The traffic received by the system is balanced by the dispatcher among the k operational servers. Two traffic models representing the requests submitted to the web servers are considered. First, a Poisson process capturing the average arrival rate of requests, and secondly a multi-stage Modulated Markov Poisson Process (MMPP) that is well suited to describe request arrivals with a rate varying according to the period of the day (daily cycles). These models reflect the access patterns observed in different web environments (see e.g. Iyengar et al. ; Muscariello, Mellia, Meo, Marsan, & Cignob, 2005).
Figure 9 presents examples that illustrate the impact of the recovery strategy and the traffic model for the NCT and CT recovery strategies. Figure 9-a plots UA(WS) versus the number of web servers N, for two values of the mean time to detection MTTD=1/α (2 and 20 seconds). Each server is assumed to have a mean time to failure MTTF=1/λ =10 days. For both strategies, increasing the number of servers from 1 to 8 leads to a significant unavailability decrease. However, for a larger number of servers, we observe different trends: •
•
For NCT: the trend is reversed from N=8. This is explained by the fact that the larger is the number of servers, the higher is the probability of the system being in states Dk, which increases the probability of requests loss during failure detection time. This behavior is more effective for longer MTTD. For CT: having MTTD=20 seconds, UA(WS) increases for N = 9 to 40. In this case, the loss probability in states Dk is only due to overload conditions. This probability decreases as the load submitted to each server decreases, this is why UA(WS) decreases substantially for N > 40 especially for fast failure detection (MTTD=2 seconds).
Figure 9-a shows that there is a notable difference between the NCT and CT strategies, especially for small MTTD values and a large number of servers. When the number of servers increases, the loss probability due to the overload
261
Performability Evaluation of Web-Based Services
Figure 9. Web service unavailability
of the servers decreases significantly, especially when failure detection is fast. However, for NCT, even for small values of MTTD, all the requests that were queued when the server fails as well as those that are directed to this server before failure detection, are lost. Figure 9-b shows the impact of the traffic model on UA(WS) for different values of the server load ρ =γ/Nv. The latter corresponds to the initial load of each server when all the servers are available. Considering NCT, similar unavailability results are obtained with MMPP and Poisson traffic models when the load is less than 0.3. The difference between the two traffic models becomes significant for heavier load. However, for the CT strategy, MMPP and Poisson may differ even for light loads (ρ < 0.3). These results show the importance of a precise knowledge of the input traffic for accurate performability analysis.
CONCLUSION AND FUTURE RESEARCH ISSUES This chapter has addressed the problem of evaluating the performability of web-based services relying on analytical modeling approaches. The developed models include multiple sources of service unavailability taking into account particu-
262
larly i) hardware and software failures affecting the servers, and ii) performance related failures that are due to e.g., the servers overload or probability of loss. We have presented a hierarchical modeling framework that provides information at distinct layers to the designers that allows analyzing and quantifying the performability of the service delivered to the users. The proposed framework distinguishes four different abstraction levels, namely, user, function, service and resources levels. Such decomposition enables the designers to better understand how the various components of the web based application and infrastructure might impact the quality of service delivered to the users under a performability viewpoint. The proposed performability models are designed to lead to the selection of the best composition of web services that maximizes user satisfaction. These models are based on Markov chains and queuing models to evaluate quantitative availability measures. Various sensitivity analyses are carried out considering different assumptions on the architecture, the faults types, the recovery strategies, the users profile and the traffic characteristics. It is noteworthy that although most of the examples are specially focused on the availability evaluation, the multi-level modeling framework can be applied to other performability analyses.
Performability Evaluation of Web-Based Services
Several directions can be explored for future research to extend the contributions presented in this chapter towards performability modeling and evaluation of web-based services. First, it is important that we apply our hierarchical multilevel modeling to a more complex case study with detailed information describing the web based application. In addition, it would be useful to complement the presented models including measurement-based analyses in order to validate some of the assumptions and model parameters. A challenging example would be to model and evaluate the dependability and quality of service that can be achieved by e.g. the cloud computing paradigm. Indeed, computing clouds exhibit significant challenges that include growing and evolving complexity of applications, non-stationary workloads, virtualization and consolidated environments, and dynamic service composition, together with the need to provide predictable performance, continuous availability, and endto-end dependability. These challenges require the development of efficient dependability and performability assessment techniques combining analytical models, simulation and experimental evaluation. Our multi-level modeling approach paves the way for evaluating the performability of such systems. Finally our offline evaluation approach could be adjusted to dynamically assess, at runtime, the performability of evolving systems for which online evolution may result from changes of the operation environment, failure and degradation conditions, or user requirements (i.e., online assessment).
REFERENCES Avizienis, A., Laprie, J. C., & Randell, B. (2004), Dependability and its threats: A taxonomy, IFIP World Computer Congress (WCC-2004), Toulouse, France (pp 91-120).
Ben Lakhal, N., Kobayashi, T., & Yokota, H. (2005). A failure-aware model for estimating and analyzing the efficiency of web services compositions. 11th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2005), Changsha, Hunan, China (pp. 114-124). Bruneo, D., Distefano, S., Longo, F., & Scarpa, M. (2010). QoS assessment of WS-BPEL processes through non-Markovian Stochastic Petri Nets. 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), Atlanta, USA (pp. 1-12). De Souza e Silva, E., & Gail, H. R. (1992). Performability analysis of computer systems: from model specification to solution, Performance Evaluation, Volume 14. Issues (National Council of State Boards of Nursing (U.S.)), 3-4, 157–196. Gönczy, L., Chiaradonna S., Di Giandomenico, F., Pataricza, A., Bondavalli A., & Bartha (2006). Dependability evaluation of web service-based processes, LNCS, Formal Methods and Stochastic Models for Performance Evaluation (pp.166-180). Gross, D., & Harris, G. (1985). Fundamentals of Queuing Theory. New York: John Wiley. Iyengar, A. K., Squillante, M. S., & Zhang, L. (1999), Analysis and characterization of largescale web server access patterns and performance, IEEE World Wide Web Conference (pp. 85-100). Kaâniche, M., Kanoun, K., & Martinello, M. (2003). User-perceived availability of a web based travel agency, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2003), San Francisco, CA, USA (pp. 709-718). Kaâniche, M., Kanoun, K., & Rabah, M. (2001). A framework for modeling the availability of e-business systems. International Conference on Computer Communications and Networks (ICCCN) (pp.40-45).
263
Performability Evaluation of Web-Based Services
Kaâniche, M., Kanoun, K., & Rabah, M. (2003). Multi-level modeling approach for the availability assessment of e-business applications. Software, Practice & Experience, 33, 1323–1341. doi:10.1002/spe.550 Kalyanakrishnan, M., Iyer, R. K., & Patel, J. U. (1999). Reliability of Internet hosts: a case study from the end user’s perspective. Computer Networks, 31, 47–57. doi:10.1016/S01697552(98)00229-3 Laprie, J. C. (1995) Dependable computing: Concepts, limits and challenges, 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS-25) – Special Issue, Pasadena, CA, USA (pp. 42-54). Martinello, M. (2005) Availability modeling and evaluation of web-based services: a pragmatic approach, PhD Thesis, Institut National Polytechnique de Toulouse, France. Menascé, D. A., & Almeida, V. A. F. (2000). Scaling for e-business: technologies, models, performance, and capacity planning. Prentice Hall PTR. Meyer, J. F. (1980). On evaluating the performability of degradable computer systems. IEEE Transactions on Computers, 29(8), 720–731. doi:10.1109/TC.1980.1675654 Muscariello, L., Mellia, M., Meo, M., Marsan, A., & Cignob, R. (2005). Markov models of Internet traffic and a new hierarchical MMPP model. Computer Communications, 28, 1835–1851. doi:10.1016/j.comcom.2005.02.012 Oppenheimer, D., Ganapathi, A., & Patterson, D. A. (2003). Why do Internet services fail, and what can be done about it? USITS’03: Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems.
264
Papazoglou, M. P., & Traverso, P. (2007). ServiceOriented Computing: State of the art and research challenges. IEEE Computer, 40, 38–45. Paxson, V. (1997). Measurements and analysis of end-to-end Internet dynamics. PhD thesis, University of California. Periorellis, P., & Dobson, J. (2001). The travel agency case study. DSoS Project IST-1999-11585. Rosenberg, F., Platzer, C., & Dustdar, S. (2006), Bootstrapping performance and dependability attributes of web services. IEEE International Conference on Web Services (ICWS ‘06) (pp. 205-212). Ross, S. M. (2003). Introduction to probability models (9th ed.). Elsevier. Trivedi, K., Muppala, J., Woolet, S., & Haverkort, B. (1992). Composite performance and dependability analysis. Performance Evaluation, 14, 197–215. doi:10.1016/0166-5316(92)90004-Z Zaiane, O. R., Xin, M., & Han, J. (1998) Discovering web access patterns and trends by applying OLAP and data mining technology on web Logs, Advances in Digital Libraries conference (pp. 19-29). Zheng, Z., Zhang, Y., & Lyu, M. R. (2010), Distributed QoS evaluation for real-world web services, 8th International Conference on Web Services (ICWS2010), Miami, Florida, USA.
ENDNOTE 1
The failover latency corresponds to the time taken by the dispatcher to detect the failure and remove the failed server from the list of alive servers.
265
Chapter 12
Measuring and Dealing with the Uncertainty of SOA Solutions Yuhui Chen University of Oxford, UK Anatoliy Gorbenko National Aerospace University, Ukraine Vyachaslav Kharchenko National Aerospace University, Ukraine Alexander Romanovsky Newcastle University, UK
ABSTRACT The chapter investigates the uncertainty of Web Services performance and the instability of their communication medium (the Internet), and shows the influence of these two factors on the overall dependability of SOA. We present our practical experience in benchmarking and measuring the behaviour of a number of existing Web Services used in e-science and bio-informatics, provide the results of statistical data analysis and discuss the probability distribution of delays contributing to the Web Services response time. The ratio between delay standard deviation and its average value is introduced to measure the performance uncertainty of a Web Service. Finally, we present the results of error and fault injection into Web Services. We summarise our experiments with SOA-specific exception handling features provided by two web service development kits and analyse exception propagation and performance as the major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services.
INTRODUCTION The paradigm of Service-Oriented Architecture (SOA) is a further step in the evolution of the well-known component-based system development with Off-the-Shelf components. SOA and
Web Services (WSs) were introduced to ensure effective interaction of complex distributed applications. They are now evolving within critical infrastructures (e.g. air traffic control systems), holding various business systems and services together (for example, banking, e-health, etc.).
DOI: 10.4018/978-1-60960-794-4.ch012
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Measuring and Dealing with the Uncertainty of SOA Solutions
Their ability to compose and implement business workflows provides crucial support for developing globally distributed large-scale computing systems, which are becoming integral to society and the economy. Unlike common software applications, however, Web Services work in an unstable environment as part of globally-distributed and looselycoupled SOAs, communicating with a number of other services deployed by third parties (e.g. in different administration domains), typically with unknown dependability characteristics. When complex service-oriented systems are dynamically built or when their components are dynamically replaced by the new ones with the same (or similar) functionality but unknown dependability and performance characteristics, ensuring and assessing their dependability becomes genuinely complicated. It is this fact that is the main motivation for this work. By their very nature Web Services are black boxes, as neither their source code, nor their complete specification, nor information about their deployment environments are available; the only known information about them is their interfaces. Moreover, their dependability is not completely known and they may not provide sufficient Quality of Service (QoS); it is often safer to treat them as “dirty” boxes, assuming that they always have bugs, do not fit well enough, and have poor specification and documentation. Web Services are heterogeneous, as they might be developed following different standards, fault assumptions and different conventions, and may use different technologies. Finally, Service-Oriented Systems are built as overlay networks over the Internet and their construction and composition are complicated by the fact that the Internet is a poor communication medium (e.g., it has low quality and is not predictable). Therefore, users cannot be confident of their availability, trustworthiness, reasonable response time and other dependability characteristics (Avizienis et al, 2004), as these can vary over
266
wide ranges in a very random and unpredictable manner. In this work we use the general synthetic term uncertainty to refer to the unknown, unstable, unpredictable, changeable characteristics and behaviour of Web Services and SOA, exacerbated by running these services over the Internet. Dealing with such uncertainty, which in the very nature of SOA, is one of the main challenges that researchers are facing. To become ubiquitous, Service-Oriented Systems should be capable of tolerating faults and potentially-harmful events caused by a variety of reasons, including low or varying (decreasing) quality of components (services), shifting characteristics of the network media, component mismatches, permanent or temporary faults of individual services, composition mistakes, and service disconnection, changes in the environment and in the policies. The dependability and QoS of SOA has recently been the aim of significant research effort. A number of studies (Zheng, & Lyu, 2009; Maamar, Sheng, & Benslimane, 2008; Fang et al., 2007) have introduced several approaches to incorporating resilience techniques (including voting, backward and forward error recovery mechanisms and replication techniques) into WS architectures. There has been work on benchmarking and experimental measurements of dependability (Laranjeiro, Vieira, & Madeira, 2007; Duraes, Vieira, & Madeira, 2004; Looker, Munro, & Xu, 2004) as well as dependability and performance evaluation (Zheng, Zhang, & Lyu, 2010). But even though the existing proposals offer useful means for improving SOA dependability by enhancing particular WS technologies, most of them do not address the uncertainty challenge which exacerbates the lack of dependability and varying quality. The uncertainty of Web Services has two main consequences. First, it makes it difficult to assess dependability and performance of services, and hence to choose between them and gain confidence in their dependability. Secondly, it becomes dif-
Measuring and Dealing with the Uncertainty of SOA Solutions
ficult to apply fault tolerance mechanisms efficiently, as too much of the data which is necessary to make choices is missing. The purpose of the chapter is to investigate the dependability and uncertainty of SOA and the instability of the communication medium through large-scale benchmarking of a number of existing Web Services. Benchmarking is an essential and very popular approach to web services dependability measurement. Apart from papers (Laranjeiro et al., 2007; Duraes et al., 2004) we need to mention such recent and ongoing European research projects as AMBER (http://www.amber-project.eu/) and WS-Diamond (http://wsdiamond.di.unito. it/). Mostly relying on stress-testing and failure injection techniques, these works analyse services robustness, their behaviour in the presence of failure or under stressed load, and compare the effectiveness of the technologies used to implement web services. Hardly any of the studies, however, address the web services instability issue or offer a strong mathematical foundation or proofs - mostly because, we believe, there is no general theory to capture uncertainties inherent to SOA. In this chapter we present our practical experience in benchmarking and measuring a number of existing WSs used in e-science and bio-informatics (Blast and Fasta, providing API for bioinformatics and genetic engineering, and available at http:// xml.nig.ac.jp, and BASIS, the Biology of Ageing E-Science Integration and Simulation System, available at http://www.basis.ncl.ac.uk/WebServices.html). This chapter summarises our recent work in the area (for more information, the readers are referred to (Gorbenko et al., 2007; Gorbenko et al., 2008; Chen et al., 2009). In the first section we describe the experimental techniques used, investigate performance instability of the Blast, Fasta and BASIS WSs and analyse the delays induced by the communication medium. We also show results of statistical data analysis (i.e. minimal, maximal and average values of the delays and their standard deviations) and present probability distribution series.
The second section analyses the instability involved in delays as elements of the web service response time. In this section we report the latest results of advanced BASIS web services measurements, capable of distinguishing between the network round trip time (RTT) and the request processing time (RPT) on the service side. The section also provides results of checking hypotheses about the distribution law of the web service response time and its component values RPT and RTT. The uncertainty discovered in web services operations affects the dependability of SOA and will require additional specific resilience techniques. Exception handling is one of the means widely used for attaining dependability and supporting recovery in SOA applications. The third section presents the results of error and fault injection into web services. We summarise our experiments with SOA-specific exception handling features provided by two tool kits: the Sun Microsystems JAX-RPC and the IBM WebSphere Software Developer Kit for developing web services. In this section we examine the ability of built-in exception handling mechanisms to eliminate certain causes of errors and analyse exception propagation and performance as the major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in web services.
MEASURING DEPENDABILITY AND PERFORMANCE UNCERTAINTY OF SYSTEM BIOLOGY APPLICATIONS Measuring Uncertainty of Blast and Fasta Web Services In our experiments we dealt with the DNA Databank, Japan (DDBJ), which provides API for bioinformatics and genetic engineering (Miyazaki, & Sugawara, 2000). We benchmarked the Fasta and Blast web services provided by DDBJ, which implement algorithms commonly used in the in
267
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 1. Response delay trends: (a) Fasta web service; (b) Blast web service
silico experiments in bioinformatics to search for gene and protein sequences that are similar to a given input query sequence.
Experimental Technique A Java client was developed to invoke the Fasta and Blast WSs at DDBJ during five days from 04 June 2008 to 08 June 2008. In particular, we invoked the getSupportDatabaseList operation supported by both the Fasta and Blast WSs. The size of the SOAP request for the Fasta and Blast WS is 616 bytes, whereas the SOAP responses are 2128 and 2171 bytes respectively. The services were invoked simultaneously, using threads every 10 minutes (in total, more than 650 times during the five days). At the same time, the DDBJ Server was pinged to assess the network RTT (round trip time) and to take into account the Internet effects on the web service invocation delay. The total number of the ICMP Echo requests sent to the DDBJ Server was more than 200000 (one per two seconds).
Performance Trends Analysis Figure 1 shows the response delays of the Fasta (a) and Blast (b) WSs. In spite of the fact that we
268
invoked similar operations of these two services with similar sizes of SOAP responses simultaneously, they had different response times. Moreover, the response time of Blast was more unstable (see the probability distribution series of Fasta (a) and Blast (b) in Figure 2). This difference can be explained by internal reasons (such as a varying level of CPU utilization and memory usage while processing the request, some differences in implementations, etc.). Besides, we noted a period of time, Time_slot_2 (starting on June 05 at 23:23:48 and lasting for 3 hours and 8 minutes), during which the average response time increased significantly for both Fasta and Blast (see Figure 1). Table 1 presents the results of statistical data analysis of response times for the Fasta and Blast WSs for the stable network route period, Time_slot_1, and for the period when the network route changed, Time_slot_2. Standard deviation of response time for Fasta is about 16% of its average value, whereas for the Blast web service it equals 27% and 45% for Time_slot_1 and Time_slot_2 respectively. We believe this shows that a significant time uncertainty exists in Service Oriented Systems. Further investigation of the ping delays confirmed that this was a period during which the
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 2. Response time probability distribution series: (a) Fasta web service; (b) Blast web service
Table 1. Response time statistics summary Invocation response time (RT), ms Time slot
min.
max.
avg.
std. dev.
Fasta WS Time slot 1
937
1953
996.91
163.28
Time slot 2
937
4703
1087.28
171.12
Time slot 1
1000
1750
1071.17
291.57
Time slot 2
1015
3453
1265.72
572.70
Blast WS
network route between the client computer at Newcastle University (UK) and the DDBJ server in Japan changed. Moreover, during the third time slot we observed 6 packets lost in 20 minutes. Together with the high RTT deviation, it indicates that significant network congestion occurred.
PINGing Delay Analysis Through monitoring the network using ICMP Echo requests, we discovered that the overall testing interval can be split into three time slots with their own particular characteristics of the communication medium as shown in Figure 3 and Table 2. Time_slot_1 is a long period of time characterized by a highly stable network condition (see Figure 4-a) with the average Round Trip Time (RTT) of 309,21 ms. This was observed over most
of the testing period. According to the TTL parameter returned in ICMP Echo reply from DDBJ server, the network route contained 17 intermediate hosts (routers) between Newcastle University Campus LAN and the DDBJ server. Time_slot_2 began on June 05 at 23:23:48, ending on June 06 at 02:31:30. This was a sufficiently stable period with the average Round Trip Time (RTT) of 332,72 ms (see Figure 4-b). The ratio of the standard deviation of the delay to the average value (referred to as the coefficient of variation), used in the chapter as a measure of uncertainty, was about 1% for this period. This is accounted for by the fact that during this time slot the network route was changed. The number of intermediate hosts (routers) grew from 17 to 20. This also affected the average response time of the Fasta and Blast WSs.
269
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 3. PINGing time slots
Time_slot_3 is a short period (of about 20 minutes) characterized by a high RTT instability (a higher value of standard deviation than in time slots 1 and 2) (see Figure 4-c and Table 2 for more details). It was too short, however, to analyse its impact on the Fasta and Blast response times. Packet losses occurred during all of the time slots, on average once in every two hours (the total number of losses was 44, 8 of which were double losses). Sometimes the RTT increases significantly over a short period. This indicates that transient network congestions occurred periodically throughout the testing period. At the same time, we were surprised by the high stability of network connection during long periods. We had expected a greater instability of the round trip time due to the use of the Internet as a medium and the long distance between the client and Web Services. To understand this, the
DDBJ server was pinged from KhAI University LAN (Kharkiv, Ukraine) during another two days. As a result, we observed a significant instability of the RTT (see Figure 4-d). The standard deviation of RTT is about 16% of its average value. Besides, packet losses occur, on average, after every 100 ICMP Echo requests. We used the tracert command to discover the reason for such instability and found that the route from Ukraine to Japan includes 26 intermediate hosts and goes through the UK (host’s name is ae-2.ebr2.London1.Level3.net, IP address is “4.69.132.133”) but the main instability takes place at the side of local Internet Service Provider (ISP). The RTT standard deviation for the first five intermediate hosts (all located in Ukraine) was extremely high (about 100% of its average value). As a consequence, the standard deviation of response time for the requests sent to the Fasta
Table 2. PINGing statistics summary PING round trip time (RTT), ms €€€€€Time slot
min.
max.
avg.
std. dev.
PINGing from Newcastle University LAN (UK) Time slot 1
309
422
309.21
1.40
Time slot 2
332
699
332.72
3.48
Time slot 3
309
735
312.94
12.73
994
396.27
62.14
PINGing from KhAI University LAN (Kharkiv, Ukraine) -
270
341
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 4. PING probability distribution series of network round trip time: (a) Time_slot_1; (b) Time_ slot_2; (c) Time_slot_3; (d) pinging from KhAI University LAN
and Blast WSs from the KhAI University LAN has dramatically increased as compared to the ones sent from Newcastle University. This came as a result of superposition of high network instability and the observed performance uncertainty inherent to the Fasta and especially Blast WSs.
Measuring Uncertainty of BASIS System Biology Web Service In this section, we present a set of new experiments conducted with an instance of the System Biology Web Service (BASIS WS) to continue our research on measuring the performance and dependability of Web Services used in e-science experiments. In a study reported in the previous section we found evident performance instability existing in SOA that affects the dependability of web services and its clients.
The Fasta and Blast WSs we experimented with were the part of DNA Databank (Miyazaki, & Sugawara, 2000) that was out of our general control. Thus, we were unable to capture the exact causes of performance instability. The main difference between that work and our experiments with the BASIS web service, hosted by the Institute for Ageing and Health (Newcastle University), is the fact that this WS is under our local administration. Thus we are able to look inside its internal architecture and to perform error and time logging for every external request. Moreover, we have used several clients from which the BASIS WS was benchmarked to give us a more objective view and to allow us to see whether the instability affects all clients in the same way or not. The aims of the work are as follows: (i) to conduct a series of experiments similar to those reported in the previous section but with access to
271
Measuring and Dealing with the Uncertainty of SOA Solutions
inside information to obtain a better understanding of the sources of exceptions and performance instability; (ii) to conduct a wider range of experiments by using several clients from different locations over the Internet; (iii) to gain an inside understanding of the bottlenecks of an existing system biology application to help in improving it in the future.
BASIS System Biology Applications Our experiments were conducted in the collaboration with a Systems Biology project called BASIS (Biology of Ageing E-Science Integration and Simulation System) (Kirkwood et al., 2003). The BASIS application is a typical, representative example of a number of SOA solutions found in e-science and grid. Being one of the twenty pilot projects funded under the UK e-science initiative in the development of the UK grid applications, BASIS at the Institute for Ageing and Health in Newcastle University, aims at developing webbased services that help the biology-of-ageing research community for quantitative study of the biology of ageing by integrating data and hypotheses from diverse biological sources. With the Figure 5. The architecture of BASIS system
272
association and expertise from the UK National e-Science Centre on building Grid applications, the project has successfully built a system that integrates various components such as model design, simulators, databases, and exposes their functionalities as Web Services (Institute for Ageing and Health, 2009). The architecture of the BASIS Web Service (basis1.ncl.ac.uk) is shown in Figure 5. The system is composed of a BASIS Server (2x2.4GHz Xeon CPU, 2GB DDR RAM, 73GB 10,000 rpm U160 SCSI RAID), including a database (PostgreSQL v8.1.3) and Condor v 6.8.0 Grid Computing Engine; a sixteen computer cluster, an internal 1Gbit network, and a web service interface deployed on Sun Glassfish v2 Application Server with JAX-WS + JAXB web service development pack. BASIS offers four main services to the community: • • •
BASIS Users Service allows users to manage their account; BASIS Simulation Service allows users to run simulations from ageing research; BASIS SBML Service allows users to create, use and modify SBML models.
Measuring and Dealing with the Uncertainty of SOA Solutions
•
The Systems Biology Markup Language (SMBL) is a machine-readable language, based on XML, for representing models of biochemical reaction networks. SBML can represent metabolic networks, cell-signalling pathways, regulatory networks, and other kinds of systems studied in systems biology; BASIS Model Service allows users to manage their models.
The most common BASIS usage scenario is: (i) to upload a SMBL simulation model into BASIS server; (ii) to run uploaded SMBL model with the biological statistics from BASIS database; (iii) to download simulation results. The size of SMBL models and simulation results uploaded and downloaded to/from the BASIS server can wary in a wide range and can be really huge (up to tens and even hundreds of megabytes). It can be a real problem for the remote clients, especially for those with the low-speed or low-quality Internet connections.
Experimental Technique To provide a comprehensive assessment we used five clients deployed in different places over the Internet: Frankfurt (Germany), Moscow (Russia), Los Angeles (USA) and two clients in Simferopol (Ukraine) that use different Internet service providers. Figure 6, created by tracing routes between clients and the BASIS WS, demonstrates different number of intermediate routers between the BASIS WS and each of the clients. Note that there are parts of the routes common to different clients. Our plan was to perform prolonged WS testing to capture long-term performance trend, to disclose performance instabilities and possible failures. The GetSMBL method, returning SMBL simulation result of 100 Kb, has been invoked simultaneously from all clients every 10 minutes during five days starting from Dec 23, 2008 (more than 600 times in total). At the same time the BASIS SBML Web Service has been pinged to assess network round trip time RTT and to take into account the Internet effects on the WS invocation delay. Total numbers of ICMP Echo requests sent
Figure 6. Internet routes used by different clients
273
Measuring and Dealing with the Uncertainty of SOA Solutions
to BASIS Server were more than 10000. In additional to that we traced network routes between clients and the web service to find out an exact point of network instability. The experiment was run over the Christmas week for the following reasons. The University’s internal network activity was minimal during this week. At the same time the overall Internet activity typically grows during this time as social networks (e.g. Facebook, YouTube, etc.) and forums experience a sharp growth during the holidays (Goad, 2008). A Java-based application called Web Services Dependability Assessment Tool (WSsDAT) which is aimed at evaluating the dependability of Web Services (Li, Chen, & Romanovsky, 2006) was used to test the BASIS SBML web service from remote hosts. The tool supports various methods of dependability testing by acting as a client invoking the Web Services under investigation. It enables users to monitor Web Services by collecting the following characteristics: (i) availability; (ii) performance; (iii) faults and exceptions. During our experimentation we faced with several
organizational and technical problems. Thus, test from Los Angeles was started up 16 hours late. The Moscow client were suddenly terminated after first thirty requests and restarted only five days later when the first step of the experiment was already finished.
Performance Trends Analysis Figure 7 shows the response time statistics for different clients. The summary of the statistical data describing the response time and client instability ranks are also presented in Table 3. An average request processing time by the BASIS WS was about 163 ms. Thus, the network delay makes the major contribution to the response time. To analyse performance instability for each particular client we have estimated how many percent the standard deviation (std. dev) of response time takes from its average (avg) value (i.e. coefficient of variation – cv). The fastest response time (in average) was observed for the client from Frankfurt whereas Los Angeles client was the slowest one. This
Figure 7. BASIS WS response time trends from different user-side perspectives
274
Measuring and Dealing with the Uncertainty of SOA Solutions
situation was easy to predict. However, we have also found that the fastest client was not the most stable. Quite the contrary, the most stable response time has been observed by the client from Los Angeles. The most unstable response time has been observed by Simferopol_1. From time to time all clients (except for Los Angeles) have faced very high delays. Some of these were ten times larger than average response time and twenty times larger than the minimal values. The clients located in Moscow and in Simferopol_1 were faced with high instability of response time due to high network instability (as it was found from the ping statistics analysis). A deeper analysis of the trace_route statistics helped us to find out a remarkable fact that network instability
(instability of network delay) happened on the part of a network route that was closer to the particular client than to the web service. Access to the inside information (server log) and additional network statistics (like the ping and trace_rout logs) allowed us to get a better understanding of the sources of performance instability and exceptions. For example, in Figure 8-a showing the response time of the Frankfurt client, we can see five time intervals characterized by high response time instability. These were caused by different reasons (see Table 4). During the first and the fourth time intervals all clients were affected by the BASIS service overload due to high number of external requests and the database backup. The second time interval was the result
Table 3. BASIS WS response time statistics summary and client instability ranks Client location
Response Time min, ms
max, ms
avg, ms
std.dev, ms
Number of intermediate routers
Instability Rank
cv, %
Frankfurt
317
6090
383.17
71.91
18.77
IV
11
Moscow
804
65134
1228.38
437.69
35.63
III
13
Simferopol_1
683
125531
1186.74
895.18
75.43
I
22
Simferopol_2
716
11150
1272.12
634.53
49.88
II
19
Los Angeles
1087
3663
1316.54
129.79
9.86
V
22
Figure 8. WS Response time for German client: (a) response time trend; (b) response time probability distribution series
275
Measuring and Dealing with the Uncertainty of SOA Solutions
Table 4. Response time instability intervals experienced by the Frankfurt client Time interval 1 2 3 4 5
Date/Time from:
Dec 23/12:23:59
to:
Dec 23/13:23:59
from:
Dec 23/23:03:59
to:
Dec 24/01:44:00
from:
Dec 24/11:34:00
to:
Dec 24/17:44:01
from:
Dec 25/14:24:15
to:
Dec 26/00:14:15
from:
Dec 27/02:14:23
to:
Dec 27/07:14:23
of BASIS Service maintenance. The BASIS server was restarted several times. As a result all clients caught exceptions periodically and suffered from response time instability. Response time instability during the third time interval was caused by extremely high network instability occurred between the second and the third intermediate routers (counting from the Frankfurt client towards the BASIS service). It was an interval where the network RTT suddenly increased three times in average (from 28.3 ms up to 86.7 ms) and had a great deviation (32.2 ms). The last unstable interval was observed by the Frankfurt client on December 27 (from 02 a.m. to 07 a.m.). In fact, the Frankfurt host is an integration server that is involved in software development. At the end of the week it performs automatic procedures of program code merging and unit testing. As a result, the host was overloaded by the local tasks and our testing client even caught several operating system exceptions “java.io.IOException: error=24, Too many open files”.
Response Time Probability Density Analysis Probability distribution series of response time that were obtained for different clients statistics are
276
Instability cause BASIS Service overload BASIS Service failure and maintenance actions Network delay instability due to network congestion BASIS Database backup Local host overload
shown in Figures 8-b and 9. As it can be seen all probability distribution series of service response time, taken in our experiments from different client’s perspectives, tend to be described by the Poisson law, whereas network RTT and request processing time RPT by the BASIS WS match well the Exponential distribution. However, unlike the Poisson and Exponential distributions all probability distribution series obtained in our experiments have heavy tails caused by the ‘real world’ instability when delays increase suddenly and significantly due to different reasons that are hard to predict. This finding is in line with (Reinecke, van Moorsel, & Wolter, 2006) and other existing experimental works. Thus, more realistic assumptions and more sophisticated distribution laws are needed to fit better the practical data. It may be the case that the Exponential distribution of RTT and RPT can by replaced with one of the heavy tailed distribution like log-normal, Weibull or Beta. At the same time the service RT for different clients could be described in a more complex way as a composition of two distribution: RTT (that is unique for each particular client) and RPT (that is unique for the service used and, hence, is the same for all clients with the identical priority).
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 9. Probabilities Distribution Series of WS Response Time from different user-side perspectives: (a) Moscow, Russia; (b) Los Angeles, USA; (c) Simferopol_1, Ukraine; (d) Simferopol_2, Ukraine
Errors and Exceptions Analysis During our experiments, several clients caught different errors and exceptions with different error rates. Most of them (1-3) were caused by BASIS Service maintenance when the BASIS WS, server and database were restarted several times (see Table 5). The first one (‘Null SOAP body’) resulted in a null-sized response from web service. It is a true failure that may potentially cause dangerous situation as it was not reported as an exception! According to the server side log, the failures were caused by errors occurred when the BASIS business logic processing component was trying to connect to the database. As the database was shutdown due to an exception, this component
failed to handle the connection exception, and returned empty results to the client. Apparently, the BASIS WS should be improved to provide better mechanisms for error diagnosis and exception handling. The second exception was caused by BASIS WS shutdown, whereas the third one was likely a result of BASIS server shutdown while the BASIS WS was operated. However, we cannot be sure because ‘Null pointer exception’ gives too little information for troubleshooting. The reason of the forth and fifth exception were network problems. It is noteworthy, that ‘UnknownHostException’ caused by silence of DNS-server takes about 2 minutes (too long!) to be reported to the client.
277
Measuring and Dealing with the Uncertainty of SOA Solutions
Table 5. BASIS WS errors and exceptions statistics Error/Exception
Number of exceptions per client Germany
Simferopol_1
Simferopol_2
Error: Null SOAP body
4
4
6
Exception: HTTP transport error: java.net.ConnectException: Connection refused
2
0
0
Exception: java.lang.NullPointerException
3
4
3
Exception: HTTP transport error: java.net.NoRouteToHostException: No route to host
0
1
2
Exception: HTTP transport error: java.net.UnknownHostException: basis1.ncl.ac.uk
0
0
1
Error rate
0.015
0.015
0.02
Discussion The purpose of the first section was to examine the uncertainty inherent to Service-Oriented Systems by benchmarking three bioinformatics web services typically used to perform in silico experiments in systems biology studies. The main finding is that the uncertainty comes from three sources the Internet, the web service and from the client itself. Network instability as well as the internal instability of web services throughput significantly affect service response time. Occasional transient and long-term Internet congestions and network route changes that are difficult-to-predict affect the stability of Service-Oriented Systems. Because of network congestions causing packet losses and multiple retransmissions, the response time could sharply increase in an order of magnitude. Because of the Internet, different clients have their own view on Web Service performance and dependability. Each client has its own unique network route to the web service. However, some parts of the route can be common for several clients or even for all of them (see Figure 6). Thus, number of clients simultaneously suffering from the Internet instability depends on the point where network congestions or failures happen. More objective data might be obtained by aggregation of clients’ experience, for example, in a way proposed
278
in (Zheng et al, 2010) and/or by having internal access to the Web Service operational statistics. We can also conclude from our practical experience, that the instability of the response time depends on the quality of the network connection used rather than on the length of the network route or the number of the intermediate routers. QoS of a particular WS cannot be ensured without guaranteeing network QoS, especially when the Internet is used as a communication medium for the global service-oriented architecture. During the WS invocation different clients caught different number of errors and exceptions, but not all of them were caused by service unreliability. In fact, some clients were successfully serviced whereas others, at the same time, were faced with different problems due to timing errors or network failures. These errors might occur in different system components depending on the relative position in the Internet of a particular user and a web service, and, also, on the instability points appearing during the execution. As a result, web services might be compromised by the client side or network failures, which, actually, are not related to web service dependability. Most of the time, the clients are not very interested in their exact cause. Thus, from different client side perspectives the same web service usually has different availability and reliability characteristics. A possible approach to solving the problem of
Measuring and Dealing with the Uncertainty of SOA Solutions
predicting the reliability of SOA-based applications given their uncertainty by users collaboration is proposed in (Zheng, & Lyu, 2010). Finally, the Exponential distribution that typically used for networks simulation and response time analysis does not represent well such unstable environments as the Internet and SOA. We believe that the SOA community needs a new exploratory theory and more realistic assumptions to predict and simulate performance and dependability of Service-Oriented Systems by distinguishing different delays contributing to the WS response time.
INSTABILITY MEASUREMENT OF DELAYS CONTRIBUTING TO WEB SERVICE RESPONSE TIME This section reports a continuation of our previous work with BASIS System Biology Applications aiming to measure the performance and dependability of e-science WSs from the end user’s perspective. In the previous investigation we found evident performance instability existing in these SOAs and affecting dependability of both, the WSs and their clients. However, we were unable to capture the exact causes and shapes of performance instability. In this section we focus on distinguishing between different delays contributing to the overall Web Service response time. Besides, new theoretical results are presented at the end of this section where we rigorously investigate the real distribution laws, describing response time instability in SOA.
Experimental Technique Basically, we used the same experimental technique described in the previous section. The BASIS WS, returning SMBL simulation result, has been invoked by the client software placed in five different locations (in Frankfurt, Moscow, Los Angeles and two in Simferopol) every 10 minutes during eighteen days starting from April,
11 2009 (more than 2500 times in total). Simultaneously, we traced the network route (by sending ICMP-echo requests) between the client and the BASIS SBML web service to understand how the Internet latency affects the WS invocation delay and to find out where possible the exact points of network instability. At the same time there are significant differences in the measurement techniques presented in the previous section and the work reported in this section. The main one is that in our new experiments we measure four time-stamps T1, T2, T3 and T4 (see Figure 10) for each request instead of only T1 and T4 (as it was done in the previous experiments). It becomes possible because during our new experiments we had an internal access to the BASIS WS and were able to install directly into the BASIS WS a monitoring software to capture the exact time when the user’s requests come to BASIS and when it returns the responses. This allowed us to separate the two main delays contributing to the WS response time (RT): the request processing time (RPT) by a web service and the network (the Internet) round trip time (RTT), i.e. RT = RPT + RTT. Besides, we investigated how the performance of the BASIS WS and its instability changed during the 3 months since our previous large-scale experiment to check the hypothesis that once measured they stay true. Finally, when we set these new experiments we wanted to know is there a way to predict and represent the performance uncertainty in SOA by employing one of the theoretical distributions, used to describe such random variables like the web service response time. A motivation for this is the fact shown by many studies (e.g., Reinecke et al., 2006) that the Exponential distribution does not represent well the accidental delays in the Internet and SOA. After processing statistics for the all clients located in different places over the Internet we found the same uncertainty tendencies. Thus, in this section we report results obtaining only for the Frankfurt client.
279
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 10. Performance measurement
Performance Trends Analysis Performance trends and probability density series of RPT, RTT and RT captured during eighteen days are shown in Figure 11. It can be see that RTT and especially RPT have significant instability that contribute together to the instability of the total response time RT. Sometimes, delays were twenty times (and even more) longer than their average values (see Table 6). In brackets we give estimation of the maximal and average values of RPT, RTT and RT and their standard deviations that were obtained after taking out of consideration (discarding) ten the most extreme values of delays. A ratio between delay standard deviation and its average value is used as the uncertainty measure. As compared with our experiments of three month prescription we have observed a significant increase of the average response time (see Table 6) fixed by the Frankfurt client (889.7 ms instead of 502.15 ms). In additional to this, an uncertainty of BASIS performance from the client-side perspectives has been increased in times (94.1% instead of 18.77%). The network route between the BASIS WS and the Frankfurt client has also changed significantly (18 intermediate routers instead of 11). In our current work we set the number of bars in the histogram representing probability density series (see Figure 11) equal to the square root of the number of elements in experimental data, that 280
is similar to the Matlab histfit(x) function. This allowed us to find out new interesting properties. In particular, we could see that about 5% of RPT, RTT and RT are significantly larger than their average values. It is also clear that the probability distribution series of RTT has two extreme points. Besides, more than five percents of RTT have value that is 80ms (1/5) less than the average one. Tracing routes between the client and the service allows us to conclude that these fast responses were caused by shortening the network routes. This seems to be very unusual for RPT but should be typical for the Internet. Finally, this peculiarity of RTT causes an appearance of the observable left tail in the RT probability distribution series. It also makes it difficult to find the theoretical distribution, representing RTT. As the availability concern we should mention that the BASIS WS was unavailable for four hours (starting from 19:00 April, 11) because of the network rerouting. Besides, two times the WS reported an exception instead of returning the normal results.
Retrieval of Real Distribution Laws of Web Service Delays Hypothesis Checking Technique In this section we provide results of hypotheses checking about distribution law of web service response time (RT) and its component values RPT
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 11. Performance trends and probability density series: RPT, RTT and RT
Table 6. Performance statistics: RPT, RTT, RT Min, ms
Max, ms
Avg, ms
Std. Dev.
Cv, %
RPT
287.0
241106.0 (8182.0)
657.7 (497.6)
4988.0 (773.5)
758.4 (155.4)
RTT
210.0
19445.0(1479.0)
405.8 (378.2)
621.1 (49.2)
153.1 (13.0)
RT
616.0
241492.0 (11224.0)
1061.5 (889.7)
5031.0 (837.4)
474.1 (94.1)
Ping RTT
26.4
346.9 (50.4)
32.0 (31.9)
3.6 (0.9)
11.3 (2.8)
and RTT. In our work we use the Matlab numeric computing environment (www.mathworks.com) and its Statistics Toolbox, a collection of tools supporting a wide range of general statistical
functions, from random number generation, to curve fitting. The techniques of hypothesis checking consist of two basic procedures. First, values of distribution parameters are to be estimated by
281
Measuring and Dealing with the Uncertainty of SOA Solutions
analysing experimental samples. Second, the null hypothesis that experimental data have a particular distribution with certain parameters should be checked. To perform hypothesis checking itself we used the kstest function: [h, p] = kstest(x, cdf) performing a Kolmogorov-Smirnov test to compare the distribution of x to the hypothesized distribution defined by matrix cdf. The null hypothesis for the KolmogorovSmirnov test is that x has a distribution defined by cdf. The alternative hypothesis is that x does not have that distribution. Result h is equal to “1” if we can reject the hypothesis, or “0” if we cannot reject that hypothesis. The function also returns the p-value which is the probability that x does not contradict the null hypothesis. We reject the hypothesis if the test is significant at the 5% level (if p-value less than 0.05).
Goodness-of-Fit Analysis In our experimental work we have checked six hypotheses that experimental data conform Exponential, Gamma, Beta, Normal, Weibull or Poisson distributions. These checks were performed for the request processing time (RPT), round trip time (RTT) and response time (RT) as a whole. Our main finding is that none of the distributions fits to describe the whole performance statistics, gathered during 18 days. Moreover, the more experimental data we used the worse approximation were provided by all distributions! It means that in the general case an instability existing in Service-Oriented Architecture cannot be predicted and described by analytic formula. The further work focused on finding the distribution law that fits the experimental data within limited time intervals. We have chosen two short time intervals with the most stable (from 0:24:28 of April, 12 until 1:17:50 of April, 14) and the least stable (from 8:31:20 of April, 23 until 22:51:36 of April, 23) response time. The first time interval includes 293 request samples. Results of hypothesis checking for
282
RPT, RTT and RT are given in Table 7, Table 8 and Table 9 respectively. The p-value, returned by the kstest function, was used to estimate the goodness-of-fit of the hypothesis. As it can be seen, the Beta, Weibull and especially Gamma (1) distributions fit the experimental data better than others. Besides, RPT is approximated by these distributions better than RT and RTT. x − 1 a −1 y = f (x | a, b ) = a x e b b “ (a )
(1)
Typically, the Gamma probability density function (PDF) is useful in reliability models of lifetimes. This distribution is more flexible than the Exponential one, which is a special case of the Gamma function (when a=1). It is remarkable, that the Exponential distribution in our case describes experimental data worst of all. However, close approximation even by using the Gamma function can be achieved only within the limited sample interval (25 samples in our case). Moreover, RTT (and sometimes RT) can hardly be approximated even under such limited sample length. For the second time interval all six hypotheses failed because of the low confidence of the p-value (less than confidence interval of 5%). Thus, we can state that the deviation of experimental data significantly affects goodness of fit. However, we also should mention that the Gamma distribution also gave better approximation than other five distributions.
Discussion In these experiments the major uncertainty came from the BASIS WS itself, whereas in the experiments conducted three month before this (during the Christmas week) the Internet most likely was the main cause of the uncertainty. An important fact we found is that RPT has a higher instability than RTT, however, in spite of
Measuring and Dealing with the Uncertainty of SOA Solutions
Table 7. RPT Goodness-of-fit approximation Number requests
of
Approximation goodness-of-fit (p-value) Exp.
293 (all)
7.8 10^-100
Gam. 1.1 10^-06
Norm.
Beta
9.5 10^-63
9.3 10^-25
Weib. 2.3 10^-11
Poiss. 4.9 10^-66
First half
1.1 10^-99
0.0468
1.2 10^-62
0.0222
0.00023
1.1 10^-65
Second half
1.3 10^-47
0.2554
5.1 10^-30
0.2907
0.0729
1.6 10^-31
First 50
6.9 10^-18
0.2456
2.3 10^-11
0.2149
0.0830
7.5 10^-12
First 25
2.3 10^-09
0.9773
5.1 10^-06
0.9670
0.5638
2.9 10^-06
Second 25
2.5 10^-09
0.2034
5.2 10^-06
0.1781
0.0508
3.1 10^-06
Table 8. RTT Goodness-of-fit approximation Number requests
of
Distribution goodness-of-fit (p-value) Beta
Weib.
Poiss.
293 (all)
2.1 10^-94
Exp.
5.1 10^-30
Gam.
4.4 10^-59
Norm.
7.0 10^-39
5.0 10^-38
7.5 10^-85
First half
6.5 10^-52
2.6 10^-17
9.1 10^-33
1.1 10^-16
2.6 10^-19
1.0 10^-45
Second half
5.0 10^-44
2.5 10^-11
1.8 10^-27
4.6 10^-16
4.6 10^-13
8.1 10^-40
First 50
8.1 10^-18
1.9 10^-04
2.1 10^-11
2.9 10^-04
2.0 10^-07
2.1 10^-15
First 25
2.7 10^-09
0.004
4.2 10^-06
0.0043
0.0133
4.6 10^-08
Second 25
1.6 10^-09
6.0 10^-04
4.0 10^-06
5.4 10^-04
3.5 10^-04
4.8 10^-08
Weib.
Poiss.
Table 9. RT Goodness-of-fit approximation Number requests
of
Distribution goodness-of-fit (p-value) Exp.
Gam.
Norm.
Beta
293 (all)
1.6 10^-96
1.8 10^-14
4.4 10^-60
4.4 10^-29
1.0 10^-19
4.0 10^-67
First half
2.6 10^-52
0.0054
9.4 10^-33
0.0048
1.1 10^-06
2.6 10^-35
Second half
1.0 10^-45
9.8 10^-08
1.9 10^-28
5.2 10^-15
9.1 10^-09
2.2 10^-32
First 50
6.1 10^-18
0.1159
2.1 10^-11
0.1083
0.1150
6.1 10^-12
First 25
2.4 10^-09
0.8776
4.2 10^-06
0.8909
0.7175
2.7 10^-06
Second 25
1.9 10^-09
0.0843
4.5 10^-06
0.0799
0.0288
2.8 10^-06
this RPT can be better represented using a particular theoretical distribution. At the same time the probability distribution series of RTT has unique characteristics making it really difficult to describe them theoretically. Among the existing theoretical distributions Gamma, Beta and Weibull capture our experimental response time statistics better than others. However, goodness of fit is good enough only within short time intervals.
We also should mention here that performance and other dependability characteristics of WSs could become out of date very quickly. The BASIS response time has changed significantly after three months in spite of the fact that there were no essential changes in its architecture apart from changes of the usage profile and the Internet routes. The BASIS WS is a typical example of a number of SOA solutions found in e-science and grid. It
283
Measuring and Dealing with the Uncertainty of SOA Solutions
has a rather complex structure which integrates a number of components, such as a SBML modeller and simulator, database, grid computing engine, computing cluster, etc., typically used for many in silico studies in systems biology. We believe that performance uncertainty, which is partially due to the systems themselves, can be reduced by further optimisation of the internal structure and the right choice of components and technologies that suit each other and fit the system requirements better. Finally, our concrete suggestion for bioscientists using BASIS is to set up a time out that is 1.2 times longer than the average response time estimated for 20-25 last requests. When the time out is exceeded, a recovery action based on a simple retry can be effective most of the time in dealing with transient congestions happening in the Internet and/or the BASIS WS. A more sophisticated technique that would predict the response time more precisely and set up the time out should assess the average response time and coefficient of variation. To be more dependable, clients should also distinguish between different exceptions and handle them in different ways depending on the exception source. All experimental results can be found at http:// homepages.cs.ncl.ac.uk/alexander.romanovsky/ home.formal/Server-for-site.xls, including the invocation and the ping RTT statistics for the Frankfurt client, and the probability distribution series (RPT, RTT, and RT). An extended version of this section is submitted to SRDS’2010.
BENCHMARKING EXCEPTION PROPAGATION MECHANISMS Exception handling is one of the popular means used for improving dependability and supporting recovery in the Service-Oriented Architecture. Knowing the exact causes and sources of exceptions raising during operation of Web Service allows developers to apply the more suitable fault-tolerant and error recovery techniques
284
(AmberPoint, 2003). In this section we present an experimental analysis of the SOA-specific exception propagation mechanisms and provide some insights into differences in error handling and propagation delays between two implementations of web services in IBM WebSphere SDK1 and Sun Java application server SDK2. We analyse an ability of the exception propagation mechanisms of the two Web Services development toolkits to disclose the exact roots of processing different exceptions and to understand their implications for performance and uncertainty of the SOA applications using them. To provide such an analysis we have used a fault injection which is a well-proven method for assessing the dependability and fault-tolerance of a computing system. In particular, (Looker et al., 2004) and (Duraes et al., 2004) present a practical approach for the dependability benchmarking and evaluation of the robustness of Web Services. However, the existing works neither consider the propagation behaviour of the exceptions raised because of the injected faults nor study the performance with respect to the exception propagation caused by the use of different Web Services platforms.
Experimental Technique To conduct our experiments we first implemented a Java class, WSCalc, which performs a simple arithmetic operation upon two integers, converting the result into a string. Then we implemented two testbed Web Services using two different development toolkits: i) Sun Java System (SJS) Application Server and ii) IBM WebSphere Software Developer Kit (WSDK). The next steps were analysis of SOA-specific errors and failures and injection them into testbed web service architecture. Finally, analysis and comparison of the exception propagation mechanisms and performance implications was done.
Measuring and Dealing with the Uncertainty of SOA Solutions
Web Services Development Toolkits In our work we experimented with two widely used technologies: the Java cross-platform technology, developed by Sun and the IBM Web Service development environments and runtime application servers. The reasons for this choice are that Sun develops most of the standards and reference implementations of Java Enterprise software whereas IBM is the largest enterprise software company. NetBeans IDE/SJS Application Server. NetBeans IDE3 is a powerful integrated environment for developing applications on the Java platform, supporting Web Services technologies through the Java Platform Enterprise Edition (J2EE). Sun Java System (SJS) Application Server is the Java EE implementation by Sun Microsystems. NetBeans IDE with SJS Application Server support JSR-109, which is a development paradigm that is suited for J2EE development, based on JAX-RPC (JSR-101). IBM WSDK for Web Service. IBM WebSphere Software Developer Kit Version 5.1 (WSDK) is an integrated kit for creating, discovering, invoking, and testing Web Services. WSDK v5.1 is based on WebSphere Application Server v5.0.2 and provides support for the following open industry standards: SOAP 1.1, WSDL 1.1, UDDI 2.0, JAX-RPC 1.0, EJB 2.0, Enterprise Web services 1.0, WSDL4J, UDDI4J, and WS-Security. WSDK can be used with the Eclipse IDE4 which provides a graphical interactive development environment for building and testing Java applications. Sup-
porting the latest specifications for Web Services WSDK enables to build, test, and deploy Web Services on industry-leading IBM WebSphere Application Server. Functionality of the WSDK v5.1 has been incorporated into the IBM WebSphere Studio family of products. Note that at the time of writing, the JAX-RPC framework was extensively replaced by the newer JAX-WS framework (with SOAP 1.2 compliance), but, we believe our findings will still apply to the present and future Web Services technologies as they will be facing the same dependability issues.
Web Service Testbed The starting point for developing a JAX-RPC WS is the coding of a service endpoint interface and an implementation class with public methods that must throw java.rmi.RemoteException. To analyse features of the exception propagation mechanisms in the service-oriented architecture we have developed a testbed WS executing simple arithmetic operations. The implementation bean class of the Web Service providing arithmetic operations is shown in Figure 12. The testbed service was implemented by using two different development kits provided by Sun and IBM. Two diverse web services obtained in such a way were deployed on the two hosts using the same runtime environment (hardware platform and operating system) but different application servers: i) IBM WebSphere and ii) SJS AppServer. These hosts operated under Windows XP Profession Edition were located in the university
Figure 12. The implementation bean class of the Web Service providing simple arithmetic operations
285
Measuring and Dealing with the Uncertainty of SOA Solutions
LAN. Thus, transfer delays and other network problems were insignificant and affected both testbed services in the same way.
Error and Failure Injection In our work we have experimented with 18 types of the SOA-specific errors and failures occurring during service binding and invocation, SOAP messaging and request processing by a web service (see Table 10) and dividing into three main categories: (i) network and remote system failures, (ii) internal errors and failures and (iii) client-side binding errors. They are general (not application specific) and can appear in any Web Service application during operation. Common-case network failures are down state of DNS or packets lost due to the network congestion. Besides, the operation of a WS depends on Table 10. SOA-specific errors and failures Type of error/failure
Error/failure domain
Network connection break-off Domain Name System is down Loss of request/response packet Remote host unavailable
Network and system failures
Application Server is down Suspension of WS during transaction System run-time error Application run-time error
Service errors and failures
Error causing user-defined exception Error in Target Name Space Error in Web Service name Error in service port name Error in service operation name Output parameter type mismatch Input parameter type mismatch Error in name of input parameter Mismatching of number of input params WS style mismatching (“Rpc” or “Doc”)
286
Client-side binding errors
the operation of the system software like webserver, application server and database management system. In our work we analysed failures occurring when the application servers (WebSphere or SJS AppServer) were shut down. Client errors in early binding or dynamic interface invocation (DII) (like “Error in Target Name Space”, “Error in Web Service name”, etc.) occur because of the changes in the invocation parameters, and/or inconsistencies between the WSDL-description and the service interface. Finally, the service failures are connected with program faults and run-time errors causing system- or user-defined exceptions. System run-time errors like “Stack overflow” or “Lack of memory” result in the exceptions at the system level as a whole. Operation “Division by zero” is also caught and generates an exception at the system level but it is easier to simulate such system error than other ones. The typical examples of the application run-time errors are “Operand type mismatch”, “Product overflow” and “Index out of bounds”. In our experiments we injected the “Operand type mismatch” error, hangs of the WS due to its program getting into a loop and error causing user-defined exception (exception defined by a programmer during WS development). Service failures (6, 7, 8) were simulated by fault injection at the service side. Client-side binding errors (10-18) which are, in fact, a set of robustness tests (i.e., invalid web-services call parameters) were applied during web-services invocation in order to reveal possible robustness problems in the web-services middleware. We used a compile-time injection technique (Looker et al., 2004) where a source code is modified to inject simulated errors and faults into the system. Network and system failures were simulated by shutting down manually of DNS server, application server and network connections at the client and service sides.
Measuring and Dealing with the Uncertainty of SOA Solutions
Errors and Exceptions Correspondence Analysis Table 11 describes a relationship between errors/ failures and the exceptions raised at the top level on different application platforms. As it was discovered, some injected errors and failures cause the same exception so we were not always able to define the precise exception cause. There are several groups of such errors and failures: 1 and 2 (Sun); 3 and 6 (Sun); 4 and 5 (Sun); 1, 2 and 5 (IBM); 3 and 6 (IBM). Some client-side binding errors (11 – “Error in Web Service name”, 12 – “Error in service port name”) neither raise exceptions nor affect the service output. This happens because the WS is actually invoked by the address location, whereas the service and port names are only used as supplementary information. Moreover, the WS developed by using IBM WSDK and deployed on the IBM WebSphere application server, tolerates such binding errors internally: 10 - “Error in Target Name Space”, 14 - “Output parameter type mismatch”, and 16 - “Error in name of input parameter”. These features are supported by the
WSDL description and a built-in function of automatic type conversion. Errors in the name of the input parameter were tolerated because checking the order of parameters has a priority over the coincidence of parameter names in the IBM implementation of web service. On the other hand it seems like Websphere is unable to detect a potentially dangerous situation resulted from the parameters mishmash.
Exception Propagation and Performance Analysis Table 11 shows the exceptions raised at the top level on the client side. However, a particular exception can be wrapped dozens of times before it finally propagates to the top. This process takes time and significantly reduces performance of exception handling in service-oriented architecture. An example of the stack trace corresponding to the “Operand Type Mismatch” run-time error caught by a web service is given in Figure 13. The exception propagation chain has four nested calls (started with “at” preposition) in case of using WS development kit from Sun Microsystems. For comparison, the stack trace of IBM-based
Table 11. Example of top-level exceptions raised by different types of errors and failures Type of error/failure
Exception message at using Sun Microsystems WS Toolkit
Exception message at using IBM WS Toolkit (WSDK)
Network connection breakoff; DNS is down
“HTTP transport error: java.net. UnknownHostException: c1.xai12.ai”
“{http://websphere.ibm.com/webservices/} Server.generalException”
Remote host unavailable (off-line)
“HTTP Status-Code 404: Not Found - /WS/ WSCalc”
“{http://websphere.ibm.com/webservices/} HTTP faultString: (404)Not Found”
Suspension of Web Service during transaction
Waiting for response during too much time (more than 2 hours) without exception
“{http://websphere.ibm.com/webservices/} Server.generalException faultString: java.io. Interrupted IOException:Read timed out”
System run-time error (“Division by Zero”)
“java.rmi.ServerException: JAXRPC.TIE.04: Internal Server Error (JAXRPCTIE01: caught exception while handling request: java.lang. ArithmeticException: / by zero)”
“{http://websphere.ibm.com/webservices/} Server.generalException faultString: java.lang. Arithmetic Exception: / by zero”
Application error causing user-defined exception
“java.rmi.RemoteException: ai.c1.loony.exception. UserException”
“{http://websphere.ibm.com/webservices/} Server.generalException faultString:(13)UserException”
Error in name of input parameter
“java.rmi.RemoteException: JAXRPCTIE01: unexpected element name:expected=Integer_2, actual=Integer_1”
OK - Correct output without exception
287
Measuring and Dealing with the Uncertainty of SOA Solutions
Figure 13. Stack trace of failure No 8, raised in the client application developed in NetBeans IDE by using JAX-RPC implementation of Sun Microsystems
implementation has 63 nested calls for the same error. The full stack traces and technical details can be found in (Gorbenko et al., 2007). The results of exception propagation and performance analysis are represented in Table 12. This table includes a number of exceptions stack trace (length of exceptions propagation chain, i.e. the count of different stack traces for this particular failure) and propagation delay (min, max and average values) which is a time between the invocation of a service and capture of the exception by a catch block. As it can be seen from Table 12, the IBM implementation of the web service has almost twice as good a performance as that of the service implemented in the Sun technology. The performance of exception propagation mechanisms has been monitored at the university LAN on heterogeneous server platforms. The first row of the table corresponds to the correct service output without any exceptions. The rows, marked in bold, correspond to the cases of correct service outputs without exceptions in spite of injected errors. It is clear from the table that the exceptions propagation delay is several times greater than normal working time. However, the exception propagation delay of the Web Service developed with NetBeans IDE using the JAX-RPC implementation from Sun Microsystems was two times shorter than the delay we experienced when we used IBM WSDK. This can be accounted for by the fact that the exception propagation chain in
288
the IBM implementation of the web service is usually much longer. The factors affecting the performance and differences between the two webservice development environments most probably depend on the internal structure of toolkits and the application servers used. We believe that the most likely reason for this behaviour is that the JAXRPC implementation by Sun Microsystems has a larger number of nested calls than IBM WSDK. In case of service suspension or packet loss, the service client developed using the Sun WS toolkit may not raise an exception even over as long as 2 hours’ time. This results in retarded recovery action and complicates developers’ work. Analysing the exception stack trace and propagation delay can help in identifying the source of the exception. For example, failures 1 - “Network connection break-off” and 2 - “Domain Name System (DNS) is down” raise the same top-level exception “HTTP transport error: java.net.UnknownHostException: loony.xai12.ai”. However, if we use the Sun WS toolkit, we can distinguish between these failures by comparing numbers of the stack traces (38 vs. 28). If we use IBM WSDK, we are able to distinguish failure 5 – “Application Server is down” from failures 1 and 2 by analysing the exception propagation delay (the first one is greater by one order).
Discussion Exception handling is widely used as the basis for forward error recovery in service-oriented
Measuring and Dealing with the Uncertainty of SOA Solutions
Table 12. Performance analysis of exceptions propagation mechanism WS Development Toolkit
NetBeans IDE (Sun)
IBM WSDK
exception propagation delay, ms
no of stack traces
min
max
av.
no of stack traces
Without Error/Failure
0
40
210
95
1.
Network connection break-off
38
10
30
2.
Domain Name System is down
28
16
32
3.
Loss of packet with client request or service response
-
4.
Remote host unavailable (off-line)
9
110
750
5.
Application Server is down
9
70
456
6.
Suspension of Web Service during transaction (getting into a loop)
-
7.
System run-time error (“Division by Zero”)
7
90
621
8.
Calculation run-time error (“Operand Type Mismatch”)
4
90
9.
Application error causing user-defined exception
4
10.
Error in Target Name Space
11.
№
Type of error/failure
exception propagation delay, ms min
max
av.
0
15
120
45
23
16
10
40
28
27
16
15
47
34
15
300503
300661
300622
387
11
120
580
350
259
16
100
550
287
15
300533
300771
300642
250
62
120
551
401
170
145
63
130
581
324
100
215
175
61
150
701
366
4
100
281
180
0
10
105
38
Error in Web Service name
0
40
120
80
0
10
125
41
12.
Error in service port name
0
30
185
85
0
15
137
53
13.
Error in service operation name
4
90
270
150
58
190
511
380
14.
Output parameter type mismatch
14
80
198
160
0
15
134
48
15.
Input parameter type mismatch
4
80
190
150
76
90
761
305
16.
Error in name of input parameter
4
70
201
141
0
10
150
47
17.
Mismatching of number of input service parameters
4
80
270
160
61
130
681
350
18.
Web Service style mismatching
4
70
350
187
58
90
541
298
architecture. Its effectiveness depends on the features of exception raising and on the propagation mechanisms. This work allows us to draw the following conclusions. 1. Web services developed by using different toolkits react to some DII client errors differently (“Output parameter type mismatch”, “Error in name of input parameter”). Sometimes this diversity can allow us to mask client errors, yet in other cases it will lead to an erroneous outcome. Moreover,
>7200000
>7200000
the exception messages and stack traces gathered in our experimentation were not always sufficient to identify the exact cause of these errors. For example, it is not possible to know if a remote host is down or unreachable due to transient network failures. All this contributes to SOA uncertainty and prevents developers from applying an adequate recovery technique. 2. Clients of web services developed using different toolkits can experience different response time-outs. In our experimentation
289
Measuring and Dealing with the Uncertainty of SOA Solutions
3.
4.
5.
6.
7.
290
with simple Web Services we also observed substantial delays in client software developed using the Sun Microsystems toolkit caused by WS hangs or packet loss. Web Services developed using different toolkits have different exception propagation times. This affects failure detection and failure notification delay. We believe that WSDK developers should make an effort to reduce these times. Analysing exception stack traces and propagation delays can help identify the exact sources of exceptions even if we have caught the same top-level exception messages. It makes for a better fault diagnosis, which identifies and records the cause of exception in terms of both location and type, as well as better fault isolation and removal. Knowing the exact cause and sources of exceptions is useful for applying appropriate failure recovery or fault-tolerant means during exception handling. Several types of failures resulting in exceptions can be effectively handled on the client side, whereas others should be handled on the service side. Exceptions handling of the client side errors in early binding procedures may include a retry with the help of dynamic invocation. Transient network failures can be tolerated by a simple retry. In other cases redundancy and majority voting should be used. Gathering and analysing exception statistics allow improvement of fault handling, which prevents located faults from being activated again by using system reconfiguration or reinitialization. This is especially relevant to a composite system with several alternative WSs. Analysing exception stack traces helps identify the application server, WSDK, libraries and packages used for WS development. This information is useful for choosing diverse variants from a set of alternative
Web Services deployed by third parties and building effective fault-tolerant systems by using WS redundancy and diversity. Below is a summary of our suggestions as to how exception handling should be implemented in SOA systems to help develop systems that handle exceptions optimally. First of all, a Web Service should return exceptions as soon as possible. Long notification delays can significantly affect SOA performance, especially in complex workflow systems. To decrease the exception propagation delay, developers should avoid unnecessary nesting of exceptions and reduce the overall number of exception stack traces. It is also essential that exceptions should contain more detailed information about the cause of error and also provide additional classification attributes to help error diagnosis and fault tolerance. For example, if an exception reports whether the error seems to be transient or permanent, a user’s application will be able to automatically choose and perform the most suitable error recovery action (a simple retry in case of transient errors or more complex fault-tolerant techniques otherwise).
CONCLUSION AND FUTURE WORK Service-Oriented Architecture and Web Services technologies support rapid, low-cost and seamless composition of globally distributed applications, and enable effective interoperability in a looselycoupled heterogeneous environment. Services are autonomous, platform-independent computational entities that can be dynamically discovered and integrated into a single service to be offered to the users or, in turn, used as a building block in further composition. The essential principles of SOA and WS form the foundation for various modern and emerging IT technologies, such as
Measuring and Dealing with the Uncertainty of SOA Solutions
service-oriented and cloud computing, SaaS (software as a service), grid, etc. According to International Data Corporation (2007), Web Services and service-oriented systems are now widely used in e-science, critical infrastructures and business-critical systems. Failures in these applications adversely affect people’s lives and businesses. Thus, ensuring dependability of WSs and SOA-based systems is a must, as well as a challenge. To illustrate the problem, our earlier extensive experiments with the BASIS and BLAST bioinformatics WSs show that the response time varies greatly because of such various unpredictable factors as Internet congestions and failures and WS overloads. In particular, the BASIS WS response time ranges from 300 ms to 120000 ms, the response time in 22% of the requests is at least twice longer than the observed minimal value and the response time in about 5% of requests is more than 20 times longer. We believe it is impossible to build fast and dependable SOAs without tackling these issues. Our recent experimental work supports our claim that dealing with the uncertainty inherent in the very nature of SOA and WSs is one of the main challenges in building dependable SOAs. Uncertainty needs to be treated as a threat in a way similar to and in addition to faults, errors and failures as traditionally dealt with by the dependability community (Avizienis et al., 2004). The response time instability can cause timing failures when the time of response arrival or the time in which information is delivered at the service interface (i.e. the timing of service delivery) deviates from the time required to execute the system function. A timing failure may be in the form of early or late response, depending on whether the service is delivered too early or too late (Avizienis et al., 2004). In complex ServiceOriented Systems composed of many different Web Services, some users may receive a correct service whereas others may receive incorrect services of different types due to timing errors. These errors may occur in different system components
depending on the relative position of a particular user and particular Web Services in the Internet, and on the instability points appearing during the execution. Thus, timing errors can become a major cause of inconsistent failures usually referred to, after Lamport, Shostak, & Pease (1982), as the Byzantine failures. The novel concepts of Service-Oriented Systems and their application in new domains clearly call for continued attention to the SOA-specific uncertainty issues. For the open intra-organisational SOA systems using the Internet, this uncertainty is unavoidable and the systems should be able to provide the trustworthy service in spite of it. This, in turn, will require developing new resilience engineering techniques and resilience-explicit mechanisms to deal with this threat. Good measurement of uncertainty is important (and our work contributes to this topic), and yet this is just the beginning because, once measured, the non-functional characteristics of WSs cannot be assumed to be true forever. This is why developing dynamic fault-tolerant techniques and mechanisms setting timeouts on-line and adopting system architecture and its behaviour on the fly is crucial for SOA. In fact, there is a substantial number of dependability-enhancing techniques that can be applied to SOA (Zheng, & Lyu, 2009; Maamar et al., 2008; Laranjeiro, & Vieira, 2008; Fang et al., 2007; Salatge, & Fabre, 2007, etc.), including retries of lost messages, redundancy and replication of WSs, variations of recovery blocks trying different services, etc. These techniques exploit the flexibility of the service infrastructure, but the major challenge in utilising these dependability techniques is the uncertainty inherent in the services running over the Internet and clouds. This uncertainty exhibits itself through the unpredictable response times of the Internet messages and data transfers, the difficulty of diagnosing the root cause of service failures, the lack of ability to see beyond the interfaces of a service, unknown common mode failures, etc. The uncertainty of the Internet and service
291
Measuring and Dealing with the Uncertainty of SOA Solutions
performance instability are such that on-line optimization of redundancy can make a substantial difference in perceived dependability. There are, however, no good tools available at the moment for a company to carry out such optimisation in a rigorous manner. We believe that uncertainty can be resolved by two means: uncertainty removal through advances in data collection and uncertainty tolerance through smart algorithms that improve decisions despite a lack of data (e.g. by extrapolation, better mathematical models, etc.). The user can intelligently and dynamically switch between the Internet service providers or WS providers if she/he understands which delay makes the major contribution to the response time and its instability. The more aware the user is of the response time, different delays contributing to response time and their uncertainty, the more intelligent will be his/her choice. Future solutions will need to deal with a number of issues, such as the uncertainty of fault assumptions, of redundant resource behaviour, of error detection, etc. The traditional adaptive solutions based on the control feedback will not be directly applicable as they are designed for predictable behaviour. One of the possible ways to resist uncertainty is to use service and path redundancy and diversity inherent to SOA. In (Gorbenko, Kharchenko, & Romanovsky, 2009) we propose several patterns for dependability-aware service composition that allows us to construct composite Service-Oriented Systems resilient to various types of failure (signalled or unsignalled; content, timing or silent failures) by using the inherent redundancy and diversity of Web Service components which exist in the SOA and extending the mediator approach proposed by Chen, and Romanovsky (2008).
292
ACKNOWLEDGMENT A. Romanovsky is partially supported by the UK EPSRC TrAmS platform grant. A. Gorbenko is partially supported by the UA DFFD grant GP/ F27/0073 and School of Computing Science, Newcastle University.
REFERENCES AmberPoint, Inc. (2003). Managing Exceptions in Web Services Environments, An AmberPoint Whitepaper. Retrieved from http://www.amberpoint.com. Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. doi:10.1109/TDSC.2004.2 Chen, Y., & Romanovsky, A. (2008, Jan/Feb) Improving the Dependability of Web Services Integration. IT Professional: Technology Solutions for the Enterprise, 20-26. Chen, Y., Romanovsky, A., Gorbenko, A., Kharchenko, V., Mamutov, S., & Tarasyuk, O. (2009). Benchmarking Dependability of a System Biology Application. Proceedings of the 14th IEEE Int. Conference on Engineering of Complex Computer Systems (ICECCS’2009), 146 – 153. Duraes, J., Vieira, M., & Madeira, H. (2004). Dependability Benchmarking of Web-Servers. In M. Heisel et al. (Eds.), Proceedings of the 23rd Int. Conf. on Computer Safety, Reliability and Security (SAFECOMP’04), LNCS 3219, (pp. 297-310). Springer-Verlag. Fang, C.-L., Liang, D., Lin, F., & Lin, C.-C. (2007). Fault tolerant web services. Journal of Systems Architecture, 53(1), 21–38. doi:10.1016/j. sysarc.2006.06.001
Measuring and Dealing with the Uncertainty of SOA Solutions
Goad, R. (2008, Dec) Social Xmas: Facebook’s busiest day ever, YouTube overtakes Hotmail, Social networks = 10% of UK Internet traffic, [Web log comment]. Retrieved from http://weblogs. hitwise.com/robin-goad/2008/12/facebook_youtube_christmas_social_networking.html Gorbenko, A., Kharchenko, V., & Romanovsky, A. (2009). Using Inherent Service Redundancy and Diversity to Ensure Web Services Dependability. In Butler, M. J., Jones, C. B., Romanovsky, A., & Troubitsyna, E. (Eds.), Methods, Models and Tools for Fault Tolerance, LNCS 5454 (pp. 324–341). Springer-Verlag. doi:10.1007/978-3642-00867-2_15 Gorbenko, A., Kharchenko, V., Tarasyuk, O., Chen, Y., & Romanovsky, A. (2008). The Threat of Uncertainty in Service-Oriented Architecture. Proceedings of the RISE/EFTS Joint International Workshop on Software Engineering for Resilient Systems (SERENE’20082008), ACM, 49-50. Gorbenko, A., Mikhaylichenko, A., Kharchenko, V., & Romanovsky, A. (2007). Experimenting With Exception Handling Mechanisms Of Web Services Implemented Using Different Development Kits. Technical report CS-TR 1010, Newcastle University. Retrieved from http://www.cs.ncl.ac.uk/ research/pubs/trs/papers/1010.pdf. Institute for Ageing and Health. (2009). BASIS: Biology of Ageing e-Science Integration and Simulation System. Retrieved June 1, 2010, from http:// www.basis.ncl.ac.uk/. Newcastle upon Tyne, UK: Newcastle University. International Data Corporation. (2007). Mission Critical North American Application Platform Study, IDC White Paper. Retrieved from www. idc.com. Kirkwood, T. B. L., Boys, R. J., Gillespie, C. J., Proctor, C. J., Shanley, D. P., & Wilkinson, D. J. (2003). Towards an E-Biology of Ageing: Integrating Theory and Data. Journal of Nature Reviews Molecular Cell Biology, 4, 243–249. doi:10.1038/nrm1051
Lamport, L., Shostak, R., & Pease, M. (1982). The Byzantine Generals Problem. ACM Trans. Programming Languages and Systems, 4(3), 382–401. doi:10.1145/357172.357176 Laranjeiro, N., & Vieira, M. (2008). Deploying Fault Tolerant Web Service Compositions. International Journal of Computer Systems Science and Engineering (CSSE): Special Issue on Engineering Fault Tolerant Systems, 23(5). Laranjeiro, N., Vieira, M., & Madeira, H. (2007). Assessing Robustness of Web-services Infrastructures. Proceedings of the International Conference on Dependable Systems and Networks (DSN’07), 131-136 Li, P., Chen, Y., & Romanovsky, A. (2006). Measuring the Dependability of Web Services for Use in e-Science Experiments. In D. Penkler, M. Reitenspiess, & F. Tam (Eds.): International Service Availability Symposium (ISAS 2006), LNCS 4328, (pp. 193-205). Springer-Verlag. Looker, N., Munro, M., & Xu, J. (2004). Simulating Errors in Web Services. International Journal of Simulation Systems, Science & Technology, 5(5) Maamar, Z., Sheng, Q., & Benslimane, D. (2008). Sustaining Web Services High-Availability Using Communities. Proceedings of the 3rd International Conference on Availability, Reliability and Security, 834-841. Miyazaki, S., & Sugawara, H. (2000). Development of DDBJ-XML and its Application to a Database of cDNA [Tokyo: Universal Academy Press Inc.]. Genome Informatics, 2000, 380–381. Reinecke, P., van Moorsel, A., & Wolter, K. (2006). Experimental Analysis of the Correlation of HTTP GET invocations. In A. Horvath and M. Telek (Eds.): European Performance Engineering Workshop (EPEW’2006), LNCS 4054, (pp. 226237). Springer-Verlag.
293
Measuring and Dealing with the Uncertainty of SOA Solutions
Salatge, N., & Fabre, J.-C. (2007). Fault Tolerance Connectors for Unreliable Web Services. Proceedings of the International Conference on Dependable Systems and Networks (DSN’07). 51-60. Zheng, Z., & Lyu, M. (2009). A QoS-Aware Fault Tolerant Middleware for Dependable Service Composition. Proceedings of the International Conference on Dependable Systems and Networks (DSN’09), 239-248. Zheng, Z., & Lyu, M. (2010). Collaborative Reliability Prediction for Service-Oriented Systems. Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering (ICSE’10), 35-44.
294
Zheng, Z., Zhang, Y., & Lyu, M. (2010). Distributed QoS Evaluation for Real-World Web Services. Proceedings of the IEEE International Conference on Web Services (ICWS’10), 83-90.
ENDNOTES 1
2
3
4
http://www-128.ibm.com/developerworks/ webservices/wsdk/ http://www.sun.com/software/products/appsrvr_pe/index.xml http://www.netbeans.org http://www.eclipse.org
295
Chapter 13
Achieving Dependable Composite Services through Two-Level Redundancy Hailong Sun Beihang University, China Jin Zeng China Software Testing Center, China Huipeng Guo Beihang University, China Xudong Liu Beihang University, China Jinpeng Huai Beihang University, China
ABSTRACT Service composition is a widely accepted method to build service-oriented applications. However, due to the uncertainty of infrastructure environments, service performance and user requests, service composition faces a great challenge to guarantee the dependability of the corresponding composite services. In this chapter, we provide an insightful analysis of the dependability issue of composite services. And we present a solution based on two-level redundancy: component service redundancy and structural redundancy. With component service redundancy, we study how to determine the number of backup services and how to guarantee consistent dependability of a composite service. In addition, structural redundancy aims at further improving dependability at business process level through setting up backup execution paths. DOI: 10.4018/978-1-60960-794-4.ch013
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Achieving Dependable Composite Services through Two-Level Redundancy
INTRODUCTION Service oriented computing is known as a new computing paradigm that utilizes services as fundamental elements for developing applications. A traditional way of software development is based on the “divide-and-conquer” manner, in which software is divided into several modules and modules are implemented separately. Instead, service oriented computing provides a novel approach to building new software applications through the reuse of existing services. A service is a self-describing entity that can be discovered and accessed through standard-based protocols. Thus software development with service oriented computing is more about integrating existing services to meet application requirements. Intuitively, software productivity can be much improved with service oriented computing technologies. The underlying technologies generally include SOA (Service Oriented Architecture) and Web services. Following the service-oriented architecture, Web service supports better interoperability, higher usability, and increased reusability compared to traditional software technologies such as RMI and CORBA. According to(Zhang, Zhang, & Cai, 2007), the lifecycle of an SOA solution consists of modeling, development, deployment, publishing, discovery, composition, collaboration, monitoring and management. Especially, service composition is widely considered as an effective method to support the development of business applications using loosely-coupled and distributed web services over the Internet. As a result, a software application built in this approach exists in the form of a composite service, which is composed of a business process structure and a set of component services. One of the challenges faced by service composition is how to make a system adaptable to rapid changing user requirements and runtime environments. One way to address this issue is late-binding, which means that only abstract services are specified in the process of modeling and development while concrete services are chosen
296
in runtime. To be more specific, service oriented software is finally implemented and instantiated in runtime. In other words, the development and running of service oriented software can not be separated clearly like traditional software. Therefore, service composition incorporates both design time and runtime issues. Due to the highly dynamic and uncontrolled Internet environments, composite service dependability is one of the most important challenges to deal with in the field of service composition. The dependability property of composite services is of great importance to users, which includes many critical factors, such as availability, reliability and so on. A service with higher availability promises more functional? time and a more reliable service reduces the probability to fail when it is invoked. A key issue to service composition is that the dependability of a composite service may change dynamically over time. Users however, desire that a composite service delivers consistent dependability, which makes it possible for the users to obtain the expected results. This is particularly important when the application is mission-critical, e.g., a disaster management. When the dependability of a composite service degrades, users may severely suffer from the decreased performance. The intrinsic dynamicity of composite services stems from the fact that a composite service is composed of many component services that potentially belong to different providers distributed over the open Internet. First, each component service is subject to a specific environment and various changing factors, such as varying system load and available bandwidth. Second, a more complicated situation is that multiple component services simultaneously fail to deliver the required dependability. This chapter is devoted to discussing the dependability of composite services, which is a critical issue for service-oriented applications. Redundancy is a well-known method to obtain the needed dependability in distributed systems. According to the aforementioned analysis, we
Achieving Dependable Composite Services through Two-Level Redundancy
aim at dealing with this issue from two aspects respectively: component service redundancy and structural redundancy. The component service redundancy is adopted to select a substituent service for a failed component service; while the structural redundancy is designed for dealing with simultaneous failure of multiple component services, which involves changing the business process behind a composite service. In all, we propose a two-level redundancy mechanism to achieve highly-dependable composite services. This chapter will be organized as follows. First, we describe the background problem, the state of the art, and the general solution to the dependability issues in service composition. Then we provide a two-level redundancy framework for achieving dependable composite services. Finally, we summarize this chapter and point out further research directions.
BACKGROUND There are many attributes associated with a web service. And the dependability can be considered as an aggregate property of availability, reliability, safety, integrity and so forth(Avizienis, Randell, & Landwehr, 2004). These attributes of services reflect from different perspectives the capability of a service. And the dependability of composite services is difficult to achieve because the components are autonomous, heterogeneous and usually come from different administrative domains. In traditional software development theories and technologies, many software reliability models have been presented such as Jelinski-Moranda model, Littlewood-Vereall model and Nelson model (Goyeva-Popstojanova, Mathur, & Trivedi, 2001; Kai-Yuan, 1995; Ramamoorthy & Bastani, 1982; Tsai et al., 2004) to solve the reliability problem. But these models do not suit serviceoriented software. First, service providers and users are distributed, and the processes of service publication, search and invoking are separated.
When certain components fail, a service user cannot modify them but has to choose other services with the same function. Second, components with the same function can be viewed as independent and can be used to improve the availability of composite services. Finally, the available state of a composite service may change due to the autonomy and dynamic of components. In a word, the distinct characteristics of service composition require new reliability models. To achieve dependable composite services, many efforts have been made in terms of modeling, service selection, service composition, service replication, monitoring and adaption. The problem of service dependability in terms of modeling technologies has been investigated (Hamadi. & Benatallah., 2003), and a Petri netbased algebra is proposed to model control flows. Discriminator operator is used to place redundancy orders to different suppliers offering the same service to increase reliability. ASDL (Abstract Service Design Language) is proposed for modeling Web Services and providing a notation for the design of service composition and interaction protocols at an abstract level (Solanki, Cau, & H.Zedan, 2006). With dynamic service composition method (Baresi & Guinea, 2006; Mennie & Pagurek, 2000; Sun, Wang, Zhou, & Zou, 2003), only necessary functions are defined in the design period and component services are bound and instantiated at runtime. Composite services need to communicate with service registry dynamically so as to find necessary component services according to the pre-defined strategies. Service selection methods based on QoS attributes (Casati, Ilnicki, Jin, Krishnamoorthy, & Shan, 2000; Liu, Ngu, & Zeng, 2004), or based on semantic(Verma et al., 2005) are developed to improve the flexibility of composition and dynamic adaptability. But dynamically searching and several remote interactions with the service registry will affect system efficiency. And the quality of services cannot be guaranteed because the existing service registries
297
Achieving Dependable Composite Services through Two-Level Redundancy
cannot guarantee the authenticity of data and the state of registered services. The service management research aims at improving availability by monitoring components and recovering them after failure. The Web Services architecture is extended, and high availability and autonomic behavior is gained through tracking the health of components, integrating self-monitoring, self-diagnosis and selfrepair mechanisms, and so on(Birman, Renesse, & Vogels, 2004). In some other work, process description, monitoring methods and recovery strategies are proposed to achieve self-healing service composition (Guinea, 2005). Web Service replication is also studied in this regards. An infrastructure, WS-Replication, is proposed for WAN replication of Web Services (Salas, Pérez-Sorrosal, Patiño-Martínez, & Jiménez-Peris, 2006). A middleware supporting reliable Web Services based on active replication is presented (Xinfeng Ye & Shen, 2005). Adaptive methods such as VOC and VOCε (Harney & Doshi, 2006, 2007)are proposed to improve dependability of service. VOC method can avoid unnecessary inquiries and reduce overhead by calculating the potential value of changed information. VOCε is the improvement of VOC, which only monitors the service out of expiration time so as to gain better efficiency. They are mainly concerned with efficient and economical verification of attributes. Based on the concept of cooperative atomic action and web service composition action, a forward error recovery method is proposed to achieve fault tolerance and dependable composite service(Tartanoglu, Issarny, Romanovsky, & Levy, 2003). However, these monitoring and recovery technologies do not consider monitoring and recovery efficiency, costs and rewards, and also they do not assess the effect of the monitoring and recovery in quantitative forms. In contrast to existing work, we try to deal with the dependability issue of composite services through the redundancy mechanism at the two
298
levels of component services and business processes. The two types of redundancy complement with each other to form an integrated solution to the dependability problem.
A TWO-LEVEL REDUNDANCY MECHANISM FOR DEPENDABLE COMPOSITE SERVICES Problem Analysis As we have mentioned, a composite service is composed of a business process structure and a set of component services. Therefore, we hope to analyze the dependability of a composite service from two angles: component service and business process structure. First, the dependability of a component service is affected by many factors including implementation, runtime and network environment, and external attacks. The malfunction of a component service will cause the relevant composite service to be not able to provide dependable service to end users. Hence there must be a redundancy mechanism for dynamic replacement of component services when needed. In addition, the dependability of a component is not invariant throughout its lifecycle. The dynamic changing of a component service’s dependability will result in the changing of the respective composite service. However, from a user’s point of view, a software application should deliver dependable function consistently. Thus, it is an important issue to ensure the dependability of a composite service in dynamic environments. Second, the business process of a composite service can also affect the dependability. In some cases, no matter how to select component services it is impossible to obtain a dependable composite service. This can be attributed to the following facts. (1) A business protocol is defined inappropriately. (2) No dependable component services are available for the current business process. This kind of problems can only be found
Achieving Dependable Composite Services through Two-Level Redundancy
after the software application runs for a period of time. However, after a new business process is defined, if we can incorporate redundancy for a sub process in advance, this will increase the dependability of the corresponding composite service. In this chapter, we call this mechanism structural redundancy, which means that given a sub-process, we set up a certain number of backups with equivalent functions but different structure.
A TWO-LEVEL REDUNDANCY FRAMEWORK FOR DEPENDABLE COMPOSITE SERVICES To address the dependability issue in service composition, especially the aforementioned problems, we propose a two-level redundancy framework, as shown in Figure 1. The bottom level is concerned with the dependability of component services. At this level, we propose a component service redun-
dancy mechanism to improve the dependability of component services. Our method supports three redundancy modes including active, passive and hybrid redundancy. However, redundancy services will cause extra costs, thus we propose heuristics to determine which services should be selected as backups. Additionally, to provide consistent dependability of a composite service to users, we propose an adaptive control method based on Kalman Filter. The upper level focuses on business process structure’s influence on the dependability of a composite service. At this level, we propose to use structural redundancy to further improve the dependability. In essence, the goal of structural redundancy is to add backup execution paths to a fragile segment of a composite service. In general, the two kinds of redundancy discussed in this chapter complements with each other to form a complete solution to the dependability issue.
Figure 1. Two-Level Redundancy Framework
299
Achieving Dependable Composite Services through Two-Level Redundancy
DEPENDABLE SERVICE COMPOSITION BASED ON COMPONENT SERVICE REDUNDANCY Design of KAF Scheme To obtain composite services with desirable dependability, we propose an innovative scheme called KAF that constructs a closed-loop control for adaptive maintenance of composite services(Guo, Huai, Li, & Deng, 2008). The KAF architecture is shown in the lower part of Figure 1, which consists of three main components. The Monitor&Estimator component monitors the current attributes of all services and estimates their future states based on the current and historical information. The estimated service information is needed by Decision Maker to produce corresponding maintenance strategy. The decision output is fed to Enforcement Point that implements the decision and dynamically adjusts the composition of the composite service. In addition, SIM (Service Information Manager) is an extended service registry supporting the description and update of QoS information. SIM is responsible for the reporting mechanism, collecting and managing the information of service. Some function of SIM includes service dependability value monitoring, service availability detection under different strategies (such as different frequencies, different incentive incentives, and so forth), evaluation and statistical functions of sampling client monitoring reports. In KAF, two kinds of caching mechanisms including component services cache (or S-Cache) and composite services cache (or CS-Cache) are involved. And we call the cached services controlled objects. Guarantee strategy of composite services is the corresponding controller and is performed by Decision Maker and Enforcement Point. Monitor& Estimator, SIM and sampling clients together form a feedback loop. Composite
300
services, guarantee strategy and the feedback loop constitute a closed-loop feedback control system. In Figure 1, Ok (0≤k≤N) represents the desired values of the dependability attributes of a composite service, where N represents the number of dependability attributes. Mk (0≤k≤N) denotes the feedback values of dependability attributes from sampling clients, which is estimated by Estimator based on summarization of SIM from the report of sampling clients. Ck (0≤k≤N) are control values and are determined by the difference between Ok and Mk. Ck denotes the dependability value to be adjusted. Based on adjustment strategies, the Enforcement Point chooses candidate services and completes the reconstruction of services composition. In KAF architecture, the feedback loop is composed of two mechanisms: execution monitoring and reporting. Execution monitoring obtains the service state information from the execution engine through monitoring. Reporting mechanism summarizes the real service information from sampling clients. And we assume all the sampling clients are honest client nodes selected by SIM. Estimator estimates dependability of component service base on the results of SIM and then computes dependability of composite service. To adjust the dependability effectively, the Decision Maker determines service selecting and update strategy, such as increasing number of backup service, replacing undesirable services. In this chapter, guarantee strategy is a kind of meta-strategies. According to a meta-strategy, Enforcement Point runs the APB algorithm to select new services or to replace declining or failed services to implement the composite service’s re-construction. In addition to the dependability of services, we must also consider the cost of replacement and rewards of expectation, so that to maximize the long-term revenue. To achieve this goal, we propose to use Markov decision process theory to support strategy choices.
Achieving Dependable Composite Services through Two-Level Redundancy
Adaptive Control Mechanism Based on MDP and Kafman Filter
model of component service dependability’s maintenance as follows.
Due to the dynamic nature of component services, such as load, network conditions and churn, the dependability of composite service will be affected. Thus we need to select some redundancy services to ensure the composite service can still deliver the desired dependability in the cases of degraded component service dependability. We design dependability factor f =lnO/lnD, where O is the desired value for dependable properties need to satisfied and D means the design value of dependability. Dependability factor reflecting the important degree of dependability is used to guide the choice of component service. Generally, f is larger than 1, which means D usually is larger than O. We can see that, when the dependability of component services declines rarely, low dependability factor can reduce the cost of constructing composite services. When dependability drops significantly, increasing of dependability factor can improve the quality of composite service, and the cost may increase. To adjust the dependability factor of composite service reasonably, we select MDP (Markov Decision Process)(Puterman, 1994) to modeling the composite service maintenance process: RSC = 〈 S , A, {A(x ) | x ∈ S }, Q, RV 〉 , where S is the set of all possible states and A is the set of decision strategies, {A(x)|x∈S} is the strategy when the state is x, Q(B|x, a) =Q{Xt+1∈B| Xt=x, At=a} specifies the probability to the next state Xt+1 when the state is Xt=x and after strategy At=a is executed, and RV is the reward of executing the strategy. If we assume that the state of system in moment t is Xt=x∈S, while the decision is At=a∈A(x), under the decision of transfer function Q, system transfers to the state Xt+1 with the reward of RV(a,x). After the transfer, the system enters a new state, and then the system makes a new decision to continue decision-making process. Therefore we establish Markov decision process
•
•
•
Dependability of a composite service CS is represented as d = αa + βr + (1 − α − β)t, in which a, r and t are the availability, the reliability and the trust of a composite service in a certain period respectively. Note that α, β and (1−α−β) are the weights of availability, reliability and trust. The assignment of weights can be determined by the users to reflect different emphasis on these attributes. The dependability value range [0,1] is divided into k parts. When d i -1 i , ) , we define that the state of the ∈[ k k composite service is si, marked as si= i -1 i [ , ) . And the state set of composite k k service dependability is denoted by S={s1, …, sm}. By dividing [0, 1], we can reduce the state space and reduce the complexity of decision-making in service composition. Let Δf be the changing amount of dependability factor, We define the adjustment strategy set as A={−Δf, 0, +Δf }, where the strategy −Δf and +Δf mean the amount of reduction or addition values of dependability factor respectively, and the strategy 0 means no changing to the dependability factor. For a composite service CS, suppose that d, ψ , w and η represent the dependability, revenue, service adjustment cost and the importance degree of the CS, where η is a positive real number, we denote by r ν = η(- ln(1 − d )) + ψ − w the reward of composite service CS. This represents if composite service properties such as availability, reliability and trust are higher or the maintenance cost of a composite service is smaller, we can get more reward. Then composite service developers will be
301
Achieving Dependable Composite Services through Two-Level Redundancy
•
more satisfied with the composite service, i.e. the value of v will be larger. Among them, the income is related to the number of service execution and the prices of services. For a composite service CS, at the moment t suppose that CS is in the state si, we denote by r ν = η(- ln(1 − d )) + ψ − w(At , si ) the immediate reward of composite service CS after adopting strategy, where w(At , si ) represents the cost of dependability adjustment.
With an MDP problem, if the state transferred probability function and reward function are known, we can compute optimal decision strategies by using dynamic programming methods. However, in the service-oriented software development process, it is difficult to observe all the historical actions of component services. That means that transition probability function and reward function are unknown, therefore we can not use dynamic programming techniques to determine optimal decision strategy. Instead we apply Kalman-Filter based approach to estimate service states, then adjust controller following the estimated value.
Kalman filter (Haykin, 2002) uses the recurrence of the state equation to achieve the optimal estimate of state variables in the linear dynamic system. The Kalman filter is unbiased and has the smallest variance. It is easy to realize with computer programs and is suitable for on-line analysis. Furthermore, the extended Kalman filter (EKF) can be used in nonlinear systems. Therefore, we introduces the extended Kalman filter to estimate the dependability of component services so as to implement the adaptive control maintenance of a composite service’s dependability (Guo, Huai, Li, & Deng, 2008). According to the aforementioned analysis, we design the adaptive control algorithm for the dependability maintenance of the composite service. The basic idea is as follows. First we collect and summarize the sampling client’s usage data about service execution. Second, we calculate the dependability of composite services and compare with the expected value. Third, we estimate the dependability value of each selected component service by the Kalman filter formula, compute the immediate reward of every action and choose corresponding action following the MDP framework. Finally, we execute the service selection algorithm (see Box 1) and the strategy executing module to select new services and replace some degenerate service.
Box 1. €€€KAF adaptive control algorithm €€€€€€€€€€input: constructed composite service, available component service properties measurement value, expected dependable value Ok of composite service. €€€€€€€€€€output: dependability factor, selected services 1. Read the Ok, determine the weight of every attributes; 2. Collect the initial dependable attribute values Mk from SIM; 3. Calculate the dependable attribute values for each component service respectively; 4. Calculate d*k of composite service; 5. Collect sampling measure values Mk of component services; 6. Predict the value of the next period by Kalman filter following the measure values; 7. Compute the immediate reward; 8. Determine the action and modify the dependable factor f; 9. Calculate updated dk; 10. Calculate Δdk = dk − d*k ; // Δdk is the Ck 11. Select new service following the HAF algorithm; 12. d*k= dk; €€€goto step 5.
302
Achieving Dependable Composite Services through Two-Level Redundancy
The input of adaptive control algorithm includes constructed composite service, available component services’ properties measure value Mk, expected dependable value Ok of composite service. The output includes dependability factor and selected services. The first part of KAF algorithm (lines 1-4) is initialization, including reading the expectation value Ok of composite services, collecting parameters of component services from SIM, calculating the dependable attribute values and the real dependability of the composite service. Then, obtain sampling measure values of component services, and according to Kalman filter calculation formulas estimate the values in the next period. Afterwards in the MDP, compute the immediate reward and determine the action of modifying the dependable factor α (line 5-8). In the third part (lines 9-13), compute the dependable attribute verity values Δdk and select new service following the HAF algorithm (Guo et al., 2007). At last, to measure the component services and start another prediction and control process.
Service Redundancy and Selection Our KAF scheme relies on certain service redundancy and selection mechanisms to enforce the adaptive dependability adjustment. For example, the HAF algorithm used in KAF algorithm is responsible for selecting appropriate backup services to obtain the desirable dependability. In this section, we are concerned with service redundancy and selection issues. Especially, we take availability as an example of dependability attributes. Replication technique is a well known technique to improve service availability. Generally, there are three different redundancy mechanisms: active redundancy, passive redundancy and hybrid redundancy (Barrett et al., 1990; Budhiraja, Marzullo, & Schneider, 1992; Schneider, 1990). Service Redundancy (SR). Service redundancy, SR is denoted as a tuple , |AR|>0, |BR|≥0, where AR and BR are the service subsets
that provide the same or similar functions. AR is the primary service set and BR is the backup service set. Three service redundancy approaches are determined by the number of services in AR and BR (Guo et al., 2007). 1.
2.
3.
Active redundancy. When |AR|>1, |BR|=0, execution of Service redundancy SR means executing all services in AR; Passive redundancy. When |AR|=1, |BR|>0, execution of Service redundancy SR means only executing the primary service in AR, and only if the primary service is failed, one backup service in BR is selected to execute as a primary service; Hybrid redundancy. When |AR|>1, |BR|>0, execution of Service redundancy SR means executing all services in the AR, when one or more primary services fails, some backup services in BR are selected to execute as the primary ones.
In practice, with the fixed size of service subsets, larger |AR| and smaller |BR| mean more costs; in contrast, smaller |AR| and larger |BR| can causes worse user experience. Therefore, it is a tradeoff to decide an appropriate service redundancy approach, and we will not discuss this issue in this chapter. Service availability. Service availability represents the probability that a service is up in a specified period of time under specific conditions. Service availability can be measured as: As i = time s = j
time
a
time (Ran, 2003) time + time s
s
u
(1)
where services sij means the ith service selected to complete task tj, Asi denotes the availability of j
service s , times is the available time of s ij , timeu i j
is the unavailable time of s ij , timea represents the
303
Achieving Dependable Composite Services through Two-Level Redundancy
total time of s ij being measured and timea = times + timeu . The availability of composite based on the redundancy mechanism is influenced by the availability of switch, message replicating and consensus voting component. To simplify the calculation we assume that these components are always available, that means the availability of them is
and the availability of every service sij is the same. si
Let A(proc)t jj denote the availability of task tj after adding service sij, we have the following theorems. Theorem 1. If tasks t1, …, tm in Proc only satisfy parallel and sequence modes, then si
Proof 1. Because A(proc )t jj =A(t 1 )·A(t 2 )
n
1. And we get AAR = APR = AHR = (1 - ∏ (1 - As )) , i =1
i j
where AAR , APR , AHR represent the availability of a task using the three redundancy approaches respectively. The availability of composites is gotten based on the availability of tasks and the service composition modes. In the process of service composition, four basic composition modes: sequence (•), parallel (||), choice (+) and iteration (◦) are used. All processes can be modeled using the four modes in principle (W. v. d. Aalst & Hee, 2002). To compute the availability of composite, we get the following formulae (See Table 1) based on (Cardoso, 2002). Assume that a process Proc includes m tasks t1, …, tm, and every task can be completed by several services with the same function. Suppose that after some redundancy services have been selected, the availability of task tj are A(tj), and we assume A(t1)=min{A(tj)| j=1,…, m}. If candidate services sij for every task tj exist (j=1,…, m),
si
i
A(proc)st11 =max{ A(proc )t jj |j=1,…, m}. ·…·A(tj-1) · (1-(1-A(tj))(1- As i )) · A(ti+1) ·…· A(tm), j
i
si
j=1, 2,...,m. for any 1<j≤ m, A(proc)st11 - A(proc)t jj = ((1-(1-A(t)1)(1- Asi ))·A(tj)-A(t1)·(1-(1-A(tj))(11
As i )))·A(t2) ·…·A(tj-1) ·A(tj+1) ·…· A(tm). From j
the assumption Asi = Asi , A(t1)=min{A(tj)| j=1,…, 1
j
m}, we have (1-(1-A(t1))(1- As i )) · A(tj)-A(t1) · 1
(1-(1-A(tj))(1- As i ))=(A(tj) -A(t1)) As i ≥ 0. So for j
j
s1i t1
si
any 1<j ≤ m, we have A(proc) ≥ A(proc)t jj , i
si
that is A(proc)st11 = max{ A(proc)t jj |j=1,…, m}. Theorem 2. If tasks t1,…, tm in Proc only satisfy the choice mode and the probability of i every task execution is the same, then A(proc)st11 si
=max{ A(proc )t jj |j=1,…, m}. Proof of theorem 2 is similar to the proof of theorem 1, and is omitted. Using redundancy services can improve the availability of composites. However, the cost of
Table 1. Formulae of composite service availability (2) (3) (4) (5)
304
Achieving Dependable Composite Services through Two-Level Redundancy
constructing composites will also increase. To improve the availability of overall system, both quality and cost should be taken into consideration at the same time. The problem of redundancy services chosen can be modeled as a nonlinear mixed integer programming problem. In the travel agent example, we use the services redundancy approach and get the following objective function. Objective: maximize nj
5
{∏ (1 − ∏ (1 − As i ) j )} j =1
yi
(6)
j
i =1
where 1, service s i is selected j y ij = 0, service s ij is not selected
i = 1, 2, ..., n j ; j = 1, 2, ..., 5
(7)
The objective will be different according to the actual process of composite service.
Cost Constraint 5
nj
C sum = (∑ ∑ y ijcsi ) ≤ C const j =1 i =1
j
(8)
where cs i represents the cost of invoking service j
s ij , C sum is the cost of all selected services and C const is the constraint value of the cost. In this chapter we do not concern the cost of testing and maintenance. The Integer programming is a NP hard problem. In a service composition process, if there are m tasks and n physical services to finish every task, the compute complexity in the worst cases is O(2mn). If the number of tasks and candidate services are small, we can use the Exhaustive Search Algorithm to find the optimal result. But in fact with the rapid acceptance of Web Service technology, the number of services with the same
Box 2. €€€HAF algorithm: 1: arrange sij ordered by
As i
j
sorted by tj ;
2: select sij with the maximum A i // modify yij sj
3: calculate every A(tj), Asc, C sum , €€€€€4: arrange the tj ordered by A(tj) €€€€€5: for the task tj with lowest A(tj) do €€€€€6: for all unselected sij to complete task tjdo 7: select sij with largest
As i
j
€€€€€8: if constraint is violated then €€€€€9: the last selected service is excluded for further concern €€€€€10: run Step 7 €€€€€11: end for €€€€€12: if no service available for task tjthen €€€€€13: task tj is excluded for further concern €€€€€14: run step 3 €€€€€15: if constraint is precisely satisfied then €€€€€16: run Step 21 €€€€€17: else €€€€€18: run Step 3 €€€€€19: end for €€€€€20: if all tasks has been excluded then €€€€€21: compute ASC, output ASC and yij.
function will become enormous. At the same time, complicated services that need more partners to work together will become more and more popular. Therefore more efficient algorithm is needed. To select redundancy service, we design the heuristic algorithms based on Theorem 1. The basic idea is as follows. When choosing multiple physical services for tasks, we select a service for the task that has the lowest availability and choose the service with the best availability. The algorithm is called HAF (High Availability First) algorithm. In HAF algorithm (see Box 2), the input information includes availability and price of candidate services, the tasks and the relationships among these tasks. The output includes selected services and availability value of the process. The term A(tj) denotes the availability of task tj. The first part of HAF algorithm (lines 1-2) arranges the candidate components ordered by availability sorted by tasks, and services with the best availability are selected. Then for every task,
305
Achieving Dependable Composite Services through Two-Level Redundancy
one service is selected. Next, calculate availability of every task, availability of the process and the total cost of selected services. And arrange the tasks ordered by availability value (line 3-4). In the third part (lines 5-7), for the task with lowest availability, one more candidate service with highest availability is selected. In the fourth part (lines 8-19), the algorithm checks if the constraint function is violated or not. If it is violated (C sum > C const ), then exclude the last selected service (line 9) and choose another service with best availability in the left services set for this task (line 10). If constraint function is precisely satisfied (C sum = C const ), calculate the availability of the process (line 16). If constraint function is not violated (C sum < C const ), then calculate the availability of tasks and repeat the execution of the third part of the algorithm (line 18). In the last part (lines 20-21), compute the value of composite and output the selected services.
Business Process Soundness
Structural Redundancy
WF-net (W. v. d. Aalst & Hee, 2002): A Petri net WFN=(P,T,F) is a WF-net (Workflow net) if and only if:
Structural redundancy deals with composite service dependability at business process level. After a business process is implemented by a composite service, the runtime data can be collected and analyzed to determine whether the business process is satisfied or not. In terms of dependability, if a composite service cannot deliver dependable services no matter how to select component services, the problem may probably lies in the business process itself. Thus we need an approach to evolve a business process to make it meet dependability requirements. In this section, we are mainly concerned with the availability attribute. A method called AvailEvo (Zeng, Sun, Liu, Deng, & Huai, 2010) is introduced to implement structural redundancy. Since structural redundancy is based on the changing of a business process, how to preserve the process correctness becomes an important issue. We start from the soundness of a business process.
306
For process-aware composite service, the most important correctness criterion is structural soundness (or soundness for short, see definition Soundness) (W. v. d. Aalst & Hee, 2002; Rinderle, Reichert, & Dadam, 2004). For a complex business process it may be intractable to verify its soundness. Especially under the online and timecritical scenarios, there is not enough time to do the verification. In a word, it is highly important and challenging to find an effective method that can satisfy structural redundancy requirements while preserve soundness and functions of original composite services. In AvailEvo, we adopt WF-net to formally describe a composite service. WF-net is a special Petri net and possesses the advantages of formal semantics definition, graphical nature and analysis techniques.
• • •
There is one source place i ∈ P such that •i = Φ; There is one sink place o ∈ P such that o• = Φ; and Every node x ∈ P T is on a path from i to o;
It should be noted that a WF-net specifies the dynamic behavior of a composite service. In WFnet, tasks (component services) are represented by transitions, conditions are represented by places, and flow relationships are used to specify the partial ordering of tasks. In addition, the state of a WF-net is indicated by the distribution of tokens amongst its places. We use a |P| dimension vector M to present the state of a live procedure in WF-net, where every element means the number of tokens in a place. We denote the number of tokens in place p in state M by Mp. Specifi-
Achieving Dependable Composite Services through Two-Level Redundancy
cally, we use M0 (Mend) to denote initial state (final state) which has only token in source (sink) place. A transition t ∈ T is enabled in state M if and only if Mp>0 for any place p such that (p,t) ∈ F. If t is enabled, t can fire leading to a new state M’ such that M p′ =Mp-1 and M p′′ = M p ′ +1 for each (p,t) ∈ F and (t,p’) ∈ F, which is denoted by M [t>M’. Soundness (W. v. d. Aalst & Hee, 2002): A procedure modeled by a WF-net WFN=(P, T, F, i, o, M0) is sound if and only if: For every state M reachable from initial state M0, there exists a firing sequence leading from state M to final state Mend. Formally: * * ∀M (M 0 → M ) ⇒ (M → Mend )
State Mend is the only state reachable from state M0 with at least one token in place o. Formally: * ∀M (M 0 → M ∧ M ≥ M end ) ⇒ (M = M end )
There are no dead transitions in WFN. Formally: * t ∀t ∈ T ∃M , M ' s.t. M 0 → M →M '
Soundness is an important correctness criterion that guarantees proper termination of composite service. The first requirement of the definition Soundness means that starting from the initial state M0, it is always possible to reach the state only with one token in place o. The second requirement states that when a token is put in place o, all the other places should be empty. The last requirement states that there are no dead transitions in the initial state M0. Generally speaking, for a complex WF-net it may be intractable to decide soundness (Cheng, Esparza, & Palsberg, 1993). In this chapter we do not verify soundness of an evolved composite service while we focus on preserve soundness using a basic change operation set. In addition, all of the discussed composite services
are modeled by WF-net are acyclic structures and a component service (transition) appears only once in a composite service (WFN). We do not consider data flow and data constraints issues in this work. A set of change operations preserving soundness is defined including replacement, addition, deletion and process structural adjustments(Zeng, Huai, Sun, Deng, & Li, 2009). We illustrate that a sound WF-net evolved with the set can preserve soundness all the same. Replacement operation means replacing a sound sub WF-net in a sound WF-net by another sound WF-net. Replacement Operation. Let WFN1=(P1,T1,F1, i1,o1,M01) be a sound sub net of a sound WF-net WFN=(P,T,F,i,o,M0) and WFN2=(P2,T2,F2,i2,o2, M02) be a sound WF-net such that P1∩P2=Φ, T1∩T2=Φ, F1∩F2=Φ, then WFN’=(P’, T’, F’, i’, o’, M0’) is the WF-net obtained by replacing WFN1 by WFN2, such that P’=(P\P1)∪P2, T’=(T\T1)∪T2, F’=(F\F1)∪F2∪F”, where F”={(x,i2)∈P×T2|(x, i1)∈F1 }∪{(o2, y)∈ T2×P | (o1, y) ∈F1 }, and initial state M0’ is a |P’| dimension vector, where S\S’ denotes the difference between set S and S’, i.e. the set of elements which is in S and not in S’. (Figure 2) We know if a transition in a sound WF-net is replaced by a other sound WF-net, then the resulting WF-net is also sound (Theorem 3 in literature (W. M. P. v. d. Aalst, 2000)). Our replacement operation is an extension to the result mentioned above, because a sound WF-net behaves like a transition. Obviously, we have the conclusion: Proposition 1. Replacement operation can preserve soundness of a sound WF-net.
Figure 2. Replacement
307
Achieving Dependable Composite Services through Two-Level Redundancy
Addition Operations. Let WFN1=(P1,T1,F1, i1, o1, M01) and WFN2=(P2,T2,F2,i2,o2,M02) be two sound WF-nets, such that P1∩P2=Φ, T1∩T2=Φ, F1∩F2=Φ. •
•
•
Sequence_add (WFN1 → WFN2 ): WFN=(P, T, F, i1, o2, M0) is the WF-net obtained by sequence_adding WFN1 with WFN2, where P=(P1\{o1})∪P2,T=T1∪T2, i2)∈T2×P1|(x, F={(x,y)∈F1|y≠o1}∪{(x, o1)∈F1}∪F2 and M0 is a |P| dimension vector (where first element is 1 and rest ones are 0). (Figure 3a) Choice_add (WFN1+WFN2 ): WFN=(P, T, F, i1, o1, M0) is the WF-net obtained by choice_adding WFN1 with WFN2, where P=P1∪(P2\{i2, o2}), T=T1∪T2, F=F1∪{(x, y)∈F2| x≠i2∧y≠o2 }∪{(i1, y)∈P1×T2|(i2, y)∈F2}∪{(x, o1)∈T2× P1|(x, o2)∈F2} and M0 is a |P| dimension vector (where first element is 1 and rest ones are 0). (Figure 3b) Parallel_add (WFN1||WFN2 ): WFN=(P, T, F, i, o, M0) is the WF-net obtained by parallel_adding WFN1 with WFN2, where P= P1∪P2∪{i, o}, T= T1∪T2∪{tSPLIT, tJOIN}, F=F1∪F2∪{< i, tSPLIT >, < tSPLIT, i1>, < tSPLIT, i2>, < o1, tJOIN >, , <
Figure 3. Addition
308
tJOIN, o>} and M0 is a |P| dimension vector (where first element is 1 and rest ones are 0). (Figure 3c) Addition operations show the method of combining two sound WF-nets together. Proposed Addition operations are sequential, parallel and choice adding a sound WF-net to another sound WF-net, obviously so obtained new WF-net is sound. Hence we have the conclusion: Proposition 2. Addition operations can preserve soundness of a sound WF-net. Delete operations are the reverse of addition operations; we do not discuss a series of delete operations due to the limitation of chapter space. To process evolution, besides above replacement, addition and deletion operations, sometimes we need deal with process structure adjustments. In this chapter, five basic structural adjustment operations are presented to allow modifying some sound sub nets in a sound WF-net. Sequence Structure Adjustments. Let WFN= (P, T, F, i, o, M0) be a sound WF-net consisting of WFN1=(P1,T1,F1,i1,o1, M01) and WFN2=(P2,T2,F2,i2,o2,M02) in sequence, such that P=P1∪P2, T=T1∪T2, F=F1∪F2, i=i1, o1=i2, o=o2 (Figure 4a).
Achieving Dependable Composite Services through Two-Level Redundancy
Figure 4. Sequence Structure Adjustment
•
•
•
Reversal_sequence adjustment: WFN ′ =(P, T, F, i’, o’, M0) is the WF-net obtained by reversal sequence adjustment between WFN1 and WFN2, where i’=i2, o’=o1, o2=i1; (Figure 4b) Sequence-to-parallel adjustment: WFN||=(P||, T||, F||, i||,o||,M0||) is the WF-net obtained by sequence_to_parallel adjustment between WFN1 and WFN2, where P||= P∪{i||, o||}, T||=T∪{ tSPLIT, tJOIN }, F||=F∪{< i||, tSPLIT >, < tSPLIT, i1>, < tSPLIT, i2>, , , < tJOIN, o||>} and M0|| is a |P||| dimension vector (where its first element is 1 and rest ones are 0). (Figure 4c) Sequence_to_choice adjustment: WFN+ = (P+,T+,F+ i+, o+,M0+) is the WF-net obtained by sequence_to_choice adjustment between WFN1 and WFN2, where P+=(P\ {ix, iy, ox, oy}) {i+, o+}, T+=T, F+={(p, q)∈F|p≠i1, i2∧q≠o1, o2}∪{(i+, q)|(i1, q)∈F1 ∨ (i2, q)∈F1 }∪{(p, o+)|(p, o1)∈F1 ∨ (p, o2)∈F2} and M0+ is a |P+| dimension vector (where its first element is 1 and rest ones are 0). (Figure 4d)
Structure Adjustment operations indicate the reassembly method of two sound sub WF-nets in a sound WF-net. Similarly the “parallel_to_sequence adjustment” and “choice_to_sequence adjustment” could be defined. We omit them due to the limitation of chapter space. Because proposed structural adjustments act on two sound sequential sub WF-net, clearly obtained new WF-net is sound. Hence we have the conclusion: Proposition 3. Structural adjustment operations can preserve soundness of a sound WF-net.
AvailEvo: A Structural Redundancy Method The AvailEvo method is an extension and optimization of conventional dependability guarantee methods based on planning in service composition. The main procedure is as follows. First, by analyzing execution logs of component services, the sets of component services with low availability and the execution paths with highest frequency are found out. Thus, a basic execution sequence can be acquired. Certain component services in the basic execution sequence affect whole availability of composite services seriously. Second,
309
Achieving Dependable Composite Services through Two-Level Redundancy
by analyzing the structural relationships between component services, a redundant path is obtained after expanding the basic execution sequence, and then the original composite service are evolved, that is adding the redundant path into the composite service process structure and obtaining a new process structure. Finally, classical planning methods (e.g. HTN planning) are used to select newly component services and to produce a new and executable composite service with high dependability. For introducing our AvailEvo in detail, definitions of Transition Path and Transition Relationships are given. Transition Path.WFN=(P, T, F, i, o, M0) is a sound WF-net, a transition sequence (firing sequence) from initial state M0 to final state Mend is named transition path (denoted by tp), and ep∈T*; in addition, a set containing all of the execution paths is called Log over the WF-net (denoted by L). Transition Relationships.Let WFN= (P, T, F, i, o, M0) be a sound WF-net and L be the Log over WFN, where a and b are transitions of WFN, i.e. a, b ∈ T: •
•
•
•
a is a predecessor of b on tp (or b is a successor of a on tp): there is a tp=t1t2……tn ∈ L, such that ti=a, tj=b, j≥i+1, denoted a>tpb; a is a predecessor of b on L (or b is a successor of a on L): there is each tp ∈ L, if a ∈ tp and b ∈ tp, then a>tpb, denoted a→Lb; a and b are parallel on L: there is tp, tp ′ ∈ L , such that a>tpb and b>tp’a, denoted a||Lb; a and b are choice on L: there is each tp ∈ L, if a ∈ tp and b ∈ tp, then a≯tpb and b≯tpa, denoted a#Lb.
AvailEvo algorithm provides an implementation for the composite service method with availability guarantee. The input of the algorithm consists of a sound WF-net WFNO describing the
310
business process structure of a composite service, a transition path tp with highest execution frequency gained by analyzing execution log and a component service set S not achieving expected availability target. The output of the algorithm is a new evolved composite service WFNN with redundancy paths. First, a basic transition sequence ts is easily obtained, which is generated by a intersection of transition sets contained in tp and set S. However, besides transitions in S, a whole transition sequence ts need to contain necessary transitions in tp(see line 1,2). Here, some special situations with empty ts or only one transition in ts are not considered, i.e. |ts|≥2. Thus, that availability of a part of components in ts has not reached expected target results in low availability of whole composite service. As the sub WF-net composed of transitions in ts is not always sound, the evolution as redundancy path with ts will destroy execution semantic of original composite service. Therefore, the emphasis of the proposed algorithm is that a minimum transition sequence ts’ is found out on the basis of ts, and in WF-net WFN, transitions contained in ts’ compose a sound sub WF-net WFN’. In fact, the process of finding ts’ is the process of extending ts. By comparing the relations between two near transition, ts is extended. For it is a directed sequence, ts must first be extended in the forward direction. Transitions(Set1) in ts have parallel relation with ti and transitions(Set2) in ts have parallel relation with ti+1 are found. When transition sets Set1 and Set2 are not equal, it is regarded that ti and ti+1 belong to two different parallel structures separately, so all transitions in two parallel structures need to be added into ts (here start transition tSPLIT and join transition tJOIN need to be added). Transitions(Set3) in ts have choice relation with ti and transitions(Set4) in ts have choice relation with ti+1 are computed. When transition sets Set3 and Set4 are not equal, it is regarded that ti and ti+1 belong to different choice structures separately, so these transitions in same choice structures need to be added into ts (see line 3 to 14). In a similar
Achieving Dependable Composite Services through Two-Level Redundancy
way, backward extension is carried out (see line 15 to 23). Finally, transition sequence tsFORWARD obtained by forward extension and transition sequence tsBACKWARD obtained by forward extension are merged to gain ts’. At the same time, a sound sub WF-net WFN’ composed of transitions in ts’ is found out in WFNO. then WFN’ is treated with for redundancy by using ts’ (see line 24 to 27) using operation Choice_add in change operations preserving soundness. The key of AvailEvo algorithm is finding redundancy paths according to the different relations between transitions. Preconditions of determining the relation between transitions must be the known the log the WF-net. It is difficult to compute a Log of a sound WF-net directly, so we convert it into a reachability graph(Xinming Ye, Zhou, & Song, 2003). Then the Log can be found out by means of Breadth-First-Search based
on the reachability graph. Here, a necessary explanation is that the size of reachability graph of WF-net is exponential on the size of WF-net in the worst-case time, so the worst-case complexity of AvailEvo Algorithm is exponential (see Box 3). In addition, the main part of AvailEvo algorithm is forward and backward extension operations for basic transition sequence ts, so the algorithm is able to terminate obviously. In course of working, the algorithm virtually includes sequence, parallel and choice structures contained in ts. According to literature(W. v. d. Aalst & Hee, 2002), sub WF-net WFN’ corresponding with redundancy path ts’ must be sound. Figure 5 illustrates a case of dealing with parallel structure and choice structure when AvailEvo algorithm constructs redundancy paths. Supposing that in composite service shown in Figure 5a, the transition path of the highest execu-
Box 3. AvailEvo Algorithm €€€€€€input: WFNO=(PO, TO, FO, iO, oO, M0O), tp, S output: WFNN=(PN, TN, FN, iN, oN, M0N) 1 begin 2€€€€€€ts=SubPath(S,tp) 3 k=|ts| 4 tsFORWARD=ts 5 tsBACKWARD=ts 6€€€€€€for(i=1; i≤k; i++) 7 { ti+1= NEXT(ti) 8 if ti+1≠NULL 9 {if (Set1= {x∈T | ti||Lx })≠(Set2={x∈T | ti+1||Lx }) 10 { tsFORWARD=ADD_FORWARD(Set1∪{x∈T | {y∈T | x||Ly }==Set1 }) 11 tsFORWARD=ADD_FORWARD(Set2∪{x∈T | {y∈T | x||Ly }==Set2 }) } 12 if(Set3= {x∈T | ti#Lx })≠(Set4={x∈T | ti+1#Lx }) 13 { tsFORWARD=ADD_FORWARD({x∈T | {y∈T | x#Ly }==Set3 }) 14 tsFORWARD=ADD_FORWARD({x∈T | {y∈T | x#Ly }==Set4 }) } } } 15 for(i=k; i≤k; i--) 16 { ti-1= PREVIOUS(ti) 17 if ti-1≠NULL 18 {if(Set1= {x∈T | ti||Lx })≠(Set2={x∈T | ti-1||Lx }) 19 { tsBACKWARD=ADD_BACKWARD(Set1∪{x∈T | {y∈T | x||Ly }==Set1 }) 20 tsBACKWARD=ADD_BACKWARD(Set2∪{x∈T | {y∈T | x||Ly }==Set2 }) } 21 if(Set3= {x∈T | ti#Lx })≠(Set4={x∈T | ti+1#Lx }) 22 { tsBACKWARD=ADD_BACKWARD({x∈T | {y∈T | z#Ly }==Set3 }) 23 tsBACKWARD=ADD_BACKWARD({x∈T | {y∈T | x#Ly }==Set4 }) } } } 24 ts’= SYNCRETIZE(tsFORWARD, tsBACKWARD) 25 WFN’=GENERATION(ts’) 26 WFNN= Choice_add (ts’+ WFN’) 27€€€€€€end
311
Achieving Dependable Composite Services through Two-Level Redundancy
Figure 5. Example of Constructing redundancy path
tion frequency is t1t2t3t4t5 and availability of t2 and t4 is poor, then basic transition sequence ts is t2t3t4. Because transitions in ts can not correspond with a sound sub WF-net, obviously ts is not a redundancy path. If here ts is regarded as a redundancy path, a unsound evolved WF-net will resulted in. As shown in Figure 5a, if redundancy path t2’t3’t4’ is executed, the WF-net will be unable to end normally, because the new composite service
312
after evolution is unsound. As shown in Figure 5b, t2’t3’t4’is still regarded as a redundancy path, and a operation Choice_add are executed with a sound sub net in the WF-net. In such a case, when the redundancy path is executed, transition t5 will be ignored. Obviously, functions of the original composite service are destroyed. According to the processing method of parallel structure in AvailEvo algorithm, “join” transition t5 with
Achieving Dependable Composite Services through Two-Level Redundancy
parallel structure is added. Finally, the redundancy path becomes t2t3t4t5. The redundancy path can correspond with a sound sub net, and functions of the new composite service are consistent with functions before evolution when the redundancy path is executed (see Figure 5c). To deal with choice structure is simple correspondingly. Supposing that in the composite service shown in Figure 2d, the transition path of the highest execution frequency is t1t2t6t4t7t5, availability of t6 and t7 is poor, then basic transition sequence ts is t6t4t7,that also is redundancy path. Because, in this case, no other transitions are consistent with t6 in choice relation (i.e. be choice relation with t3), if there are such transitions, they need to be added into the redundancy path (see Figure 5d).
CONCLUSION AND FUTURE WORK In this chapter, we are concerned with the dependability issue in service-oriented software applications. Service composition is widely considered as an effective method to implement a service oriented application, thus the dependability of a composited service becomes the core problem to achieve dependable service oriented applications. We analyze the challenges to achieve a dependable composite service. Unlike traditional software applications, the dependability of service oriented software is largely affected by the uncertainty of component services and runtime environments. Moreover, as component services are provided by third-party providers, it is almost not possible to modify an undependable component service. We find out that component services and business process structure are two important factors to impact the corresponding composite service. Redundancy is a typical and effective method to deal with the dependability issues in distributed systems. We propose a two-level redundancy framework for dependable composite services. At component service level, we study three redundancy mechanisms including active, passive
and hybrid redundancy. Especially we focus on how to determine the number of backup component services so as to comply with the cost constraints. Meanwhile we propose an adaptive control mechanism using Kalman filter and Markov model to guarantee consistent dependability of a composite service. At business process level, we propose to use structural redundancy to deal with the possible fragile business process structure. In practice, one important problem is how to integrate the two redundancy mechanisms to improve composite service dependability. In fact, the two mechanisms can complement with each other to form a complete redundancy solution. In KAF, one assumption is that there are abundant component services to select. That is not true in practical service oriented systems. Given an expected dependability value Ok, it is possible that a real system cannot meet the dependability requirement. In such situation, an alternative approach is to change the business process structure, i.e. to adopt structural redundancy. When a composite service is added a redundant execution path, more component services can be used to improve the composite service dependability. Our present work provides a general framework and some relevant ideas to address the dependability issue in service-oriented applications. As a matter of fact, there are still a few open problems to research along this direction. For component service redundancy, more comprehensive and finer grained QoS models for composite services are needed to be further investigated, e.g., to define parameters in more flexible way. On the other hand, richer composition models other than supporting the general mode of sequence, parallel, choice and iteration, e.g., more complex relationships, such as compensation and transaction, should be considered. While for structural redundancy, our present work requires that a business process segment should have the same structure with its backups. When service resources become more abundant, there can be heterogeneous business process segment satisfying the same function.
313
Achieving Dependable Composite Services through Two-Level Redundancy
Therefore, how to discover and match these heterogeneous business process segments is an open problem for structural redundancy method.
REFERENCES Aalst, W. d., & Hee, K. v. (2002). Workflow Management Models, Methods, and Systems. London, Massachusetts, England: The MIT Press Cambridge. Aalst, W. M. P. d. (2000). Workflow Verification: Finding Control-Flow Errors Using Petri-NetBased Techniques. In Business Process Management: Models, Techniques, and Empirical Studies (Vol. 1806, pp. 161–183). Berlin: Springer-Verlag. Avizienis, A., Randell, B., & Landwehr, C. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. doi:10.1109/TDSC.2004.2 Baresi, L., & Guinea, S. (2006). Towards Dynamic Web Services. Paper presented at the Proceedings of the 28th International Conference on Software Engineering (ICSE). Barrett, P. A., Hilborne, A. M., Veríssimo, P., Rodrigues, L., Bond, P. G., Seaton, D. T., et al. (1990). The Delta-4 Extra Performance Architecture (XPA). Paper presented at the The 20th International Symposium on Fault-Tolerant Computing (FTCS). Birman, K., Renesse, R. v., & Vogels, W. (2004). Adding High Availability and Autonomic Behavior to Web Services. Paper presented at the International Conference on Software Engineering (ICSE 2004). Budhiraja, N., Marzullo, K., & Schneider, F. B. (1992). Primary-backup protocols: Lower bounds and optimal implementations. Paper presented at the of the Third IFIP Conference on Dependable Computing for Critical Applications.
314
Cardoso, J. (2002). Quality of Service and Semantic Composition of Workflows. Unpublished PHD thesis, University of Georgia. Casati, F., Ilnicki, S., Jin, L.-J., Krishnamoorthy, V., & Shan, M.-C. (2000). eFlow: A Platform for Developing and Managing Composite e-Services. Paper presented at the Academia/Industry Working Conference on Research Challenges (AIWORC’00). Cheng, A., Esparza, J., & Palsberg, J. (1993). Complexity Results for 1-safe Nets. Foundations of Software Technology and Theoretical computer. Science, 761, 326–337. Goyeva-Popstojanova, K., Mathur, A. P., & Trivedi, K. S. (2001). Comparison of Architecture-Based Software Reliability Models. Paper presented at the 12th International Symposium on Software Reliability Engineering. Guinea, S. (2005). Self-healing web service compositions. Paper presented at the ICSE2005. Guo, H., Huai, J., Li, H., Deng, T., Li, Y., & Du, Z. (2007). ANGEL: Optimal Configuration for High Available Service Composition. Paper presented at the 2007 IEEE International Conference on Web Services (ICWS) Guo, H., Huai, J., Li, Y., & Deng, T. (2008). KAF: Kalman Filter Based Adaptive Maintenance for Dependability of Composite Service. Paper presented at the International Conference on Advanced Information Systems Engineering (CAiSE). Hamadi, R., & Benatallah, B. (2003). A Petri Net-based Model for Web Service Composition. Paper presented at the Fourteenth Australasian Database Conference (ADC2003). Harney, J., & Doshi, P. (2006). Adaptive web processes using value of changed information. Paper presented at the International Conference on Service Oriented Computing (ICSOC).
Achieving Dependable Composite Services through Two-Level Redundancy
Harney, J., & Doshi, P. (2007). Speeding up adaptation of web service compositions using expiration times. Paper presented at the International World Wide Web Conference (WWW).
Schneider, F. B. (1990). Implementing FaultTolerant Services Using the State Machine Approach: A Tutorial. ACM Computing Surveys, 22, 299–319. doi:10.1145/98163.98167
Haykin, S. (2002). Adaptive Filter Theory (4th ed.). Pearson Education.
Solanki, M., Cau, A., & Zedan, H. (2006). ASDL: A Wide Spectrum Language For Designing Web Services. Paper presented at the 15th International Conference on the World Wide Web (WWW’06).
Kai-Yuan, C. (1995). Base of Software Reliability Engineering. Beijing: Tsinghua University Press. Liu, Y., Ngu, A. H., & Zeng, L. Z. (2004). QoS computation and policing in dynamic web service selection. Paper presented at the International World Wide Web Conference (WWW). Mennie, D., & Pagurek, B. (2000). An Architecture to Support Dynamic Composition of Service Components. Paper presented at the Workshop on Component-Oriented Programming (WCOP) Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience. Ramamoorthy, C. V., & Bastani, F. B. (1982). Software Reliability - Status and Perspectives. IEEE Transactions on Software Engineering, SE-8(4), 354–371. doi:10.1109/TSE.1982.235728 Ran, S. (2003). A model for web services discovery with QoS ACM SIGecom Exchanges, 4. Rinderle, S., Reichert, M., & Dadam, P. (2004). Correctness criteria for dynamic changes in workflow systems--a survey. Data & Knowledge Engineering, 50, 9–34. doi:10.1016/j. datak.2004.01.002 Salas, J., Pérez-Sorrosal, F., Patiño-Martínez, M., & Jiménez-Peris, R. (2006). WS-Replication: A Framework for Highly Available Web Services. Paper presented at the the 15th International Conference on the World Wide Web (WWW’06).
Sun, H., Wang, X., Zhou, B., & Zou, P. (2003). Research and Implementation of Dynamic Web Services Composition Paper presented at the Advanced Parallel Processing Technologies (APPT). Tartanoglu, F., Issarny, V., Romanovsky, A., & Levy, N. (2003). Coordinated forward error recovery for composite Web services. Paper presented at the IEEE Symposium on Reliable Distributed Systems (SRDS). Tsai, W. T., Zhang, D., Chen, Y., Huang, H., Paul, R., & Liao, N. (2004). A Software Reliability Model for Web Services. Paper presented at the 8th IASTED International Conference on Software Engineering and Applications. Verma, K., Sivashanmugam, K., Sheth, A., Patil, A., Oundhakar, S., & Miller, J. (2005). METEOR-S WSDI: A Scalable P2P Infrastructure of Registries for Semantic Publication and Discovery of Web Services. Information Technology Management, 6(1), 17–39. doi:10.1007/s10799-004-7773-4 Ye, X., & Shen, Y. (2005). A Middleware for Replicated Web Services. Paper presented at the Proceedings of the IEEE International Conference on Web Services (ICWS ‘ 05). Ye, X., Zhou, J., & Song, X. (2003). On reachability graphs of Petri nets. Computers & Electrical Engineering, 29(2), 263–272. doi:10.1016/ S0045-7906(01)00034-9
315
Achieving Dependable Composite Services through Two-Level Redundancy
Zeng, J., Huai, J., Sun, H., Deng, T., & Li, X. (2009). LiveMig: An Approach to Live Instance Migration in Composite Service Evolution. Paper presented at the 2009 IEEE International Conference on Web Services (ICWS).
Zeng, J., Sun, H., Liu, X., Deng, T., & Huai, J. (2010). Dynamic Evolution Mechanism for Trustworthy Software Based on Service Composition. [in Chinese]. Journal of Software, 21(2), 261–276. doi:10.3724/SP.J.1001.2010.03735 Zhang, L.-J., Zhang, J., & Cai, H. (2007). Services Computing. Beijing: Tsinghua University Press.
316
317
Chapter 14
Building Web Services with Time Requirements Nuno Laranjeiro University of Coimbra, Portugal Marco Vieira University of Coimbra, Portugal Henrique Madeira University of Coimbra, Portugal
ABSTRACT Developing web services with timing requirements is a difficult task, as existing technology does not provide standard mechanisms to support real-time execution, or even to detect and predict timing violations. However, in business-critical environments, an operation that does not conclude on due time may be completely useless, and may result in service abandonment, reputation, or monetary losses. This chapter presents a framework that allows deploying web services with temporal failure detection and prediction capabilities. Detection is based on timing restrictions defined at execution time and historical data is used for failure prediction according to prediction modules. Additional modules can be added to the framework to provide more advanced failure detection and prediction capabilities. The framework enables providers to easily develop and deploy time-aware web services, with the failure detection code decoupled from the application logic, and allows consumers to express their timeliness requirements.
INTRODUCTION Web services offer a clear interface connecting providers and consumers and are more and more considered adequate strategic vehicles for the exchange and distribution of data. Web service consumers and providers interact by means of DOI: 10.4018/978-1-60960-794-4.ch014
SOAP messages, using a protocol that, jointly with WSDL and UDDI, is the core of the web services technology (Curbera et al., 2002). The development of web services that can handle timeliness requirements is a complex task. In fact, current technology, development models, and programming tools for web services do not offer simple support to assure timeliness attributes during execution. Despite the fact that some
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Building Web Services with Time Requirements
transactional models offer support for detecting operations that take longer than the desired time (Elmagarmid, 1992), they frequently require a huge programming effort. In reality, programmers must choose the proper middleware (which includes a transaction manager that must be adequate to the deployment environment specificities), create extra code to manage the transactions, state their attributes, and implement the support for timing requirements. Actually, transactions are well fit to support typical transactional behaviors, but they are not a simple mechanism to create and deploy time-aware services. Thus, transactions support for timing failures detection is poor and, additionally, they do not provide support for timing failures prediction. Although techniques and tools to easily create time-aware web services are nowadays lacking, there has been a steady increase on the number of real applications having to satisfy this kind of requirements. Frequently, programmers handle these issues by implementing improvised solutions to support timing failures (obviously, this can be costly and failure prone). Time, as a concept, has been completely missing from standard web services development models. Furthermore, relevant characteristics such as the detection and prediction of temporal failures have been completely disregarded, particularly if we consider that services are typically deployed over unreliable and open environments. In these conditions, services typically exhibit high or very unpredictable execution times. High execution times are frequently linked with the serialization procedure required in each web service invocation. In this process, a high amount of protocol information has to be sent in each message, even if the message’s contents is very small (i.e., the SOAP protocol involves the use of a large quantity of data to encapsulate the application data to be transmitted). This serialization procedure is very important, particularly considering that one web service can also act as a client of another web service, thus doubling
318
the end-to-end serialization effort. Unpredictable execution times are mostly associated with the use of unreliable, sometimes not fast, transport channels (i.e., the Internet) for simple client to server communication and also server-to-server communication. Such features make it difficult for programmers to deal with timeliness requirements. Considering timing requirements, two outcomes are possible from a web service execution: either the service can produce an answer on time, or not. In any case the client must wait for the execution to finish or for the violation of the deadline (in this case, some timing failure detection mechanism must be developed). Despite this, frequently it is possible to forecast the occurrence of timing failures ahead of time. Indeed, execution history can be frequently used to predict, with a certain confidence degree, if a response will be obtained on time. This is not only of utmost importance for client applications, that can use alternative services or retry, but also for servers. These, in turn, can use historical information to adequately administer the resources that are allocated to each operation (e.g., an operation that is predicted to be useless can be aborted or executed under a degraded mode). In this chapter we discuss the problem of timing requirements in web services. Besides defining the concepts, we present an extensible server-side framework that offers detection and prediction functionalities (wsTFDP: Web Services Timing Failures Detection and Prediction). This framework provides a ready to use timing failure detection mechanism that, based on transparent code instrumentation, is able to collect historical data that can be used for prediction. The framework also provides a prediction component that implements Dijkstra’s shortest path computation algorithm, which makes use of historical execution time values to determine if execution can terminate on time. This framework can be easily extended with multiple components that implement different prediction algorithms. By using wsTFDP, developers are able to easily plug in their
Building Web Services with Time Requirements
preferred prediction algorithm and fine-tune it in the way that best fits the deployment environment. In this chapter we also present a concrete example of the framework usage and an experimental campaign conducted to demonstrate the effectiveness of timing failures detection and prediction in web services. This framework is ready to be used by developers and system administrators for building web services with timing requirements. The framework also provides a base setup for future research on web services timing failures, by enabling the integration of different prediction algorithms. The chapter is organized as follows. Next section presents background on web services and related work on timing failures and prediction algorithms. We present, in the follow-up sections, the architecture of the framework and detailed explanations of its detection and prediction modules. We then describe, in practice, how developers can use the framework in practice and present an experimental demonstration. Finally, we summarize the work presented and conclude the chapter.
BACKGROUND AND RELATED WORK The web services framework is divided into three major areas: communication protocols, service
descriptions, and service discovery. The main specifications for each area, SOAP, WSDL, and UDDI, are all XML-based. XML is now firmly established as a language that enables information and data encoding, platform independence, and internationalization (Curbera et al., 2002). SOAP (Simple Object Access Protocol) is a protocol for messaging that can be used along with existing transport protocols, such as HTTP, SMTP, and XMPP. WSDL (Web Services Description Language) is used to describe a web service as a collection of communication endpoints that can exchange particular types of messages. That is, a WSDL document describes the interface of the service and provides users (e.g., the web service’s clients) with a point of contact. Finally, UDDI (Universal Description, Discovery, and Integration) offers a unified way to find service providers through a centralized registry (Curbera et al., 2002). Figure 1 presents a typical web services environment. In each interaction the consumer (client) sends a request SOAP message to the provider (the server). After processing the request, the server sends a response message to the client with the results. A web service may include several operations (in practice, each operation is a method with none, or more input parameters). In real-time systems, the correctness of the system depends not only of the logical result of the computation but also on the time at which the
Figure 1. Typical web services environment
319
Building Web Services with Time Requirements
results are produced (Stankovic, 1988). Nowadays, web services are frequently used in time-critical business and enterprise applications (Amazon, 2010), and in fact, in these environments a correct output that is not produced before a deadline may be completely useless. Real-time systems can have their time limit restrictions specified in two ways, which differ in terms of the amount of flexibility that is placed on the software system regarding the fulfillment of the timing requirements. Hard real-time deadlines are those whose correctness depends on the logical result of computation as well as the time in which those results are produced. Indeed, a failure occurs when a hard deadline is missed in a particular system. On the other hand, soft real-time deadlines also have a dependency on time constraints, however, when a system is unable to meet a given time limit, a failure does not necessarily occur (Pekilis & Seviora, 1997). This applies to many kinds of application types such as videoconference for example, where some job loss is permissible. Here, some frame or packet delays (or even losses) are permissible and may even go unnoticed by the user (Sha et al., 2004). This kind of behavior may be desired or sufficient for web service clients, and must also be supported by providers. Fundamental requirements of time-critical systems are supported by two important properties: dependability and predictability (Halang, Gumzej, Colnaric, & Druzovec, 2000). Several classic dependability issues, such as availability or reliability, have been researched in servicebased environments (Long, Carroll, & Park, 1990; Looker, Munro, & Jie Xu, 2005; May, 2009; Salas, Perez-Sorrosal, Patiño-Martínez, & Jiménez-Peris, 2006). However, additional aspects, directly related to temporal properties, such as accurate temporal failure detection and prediction, remain to be explored. The problem of timing failure detection in database-centric applications is discussed in (Vieira, Costa, & Madeira, 2006). The authors
320
propose a transaction programming approach to help developers in programming database applications with time constraints. Three classes of transactions are considered concerning temporal requirements: transactions with no temporal requirements (typical ACID transactions), transactions with strict temporal requirements, and transactions with relaxed temporal requirements. The approach proposed implements these classes of transactions by allowing concurrent detection of timing failures during transaction execution. Timing failure detection can be performed at the database clients’ interface, in the database server, or in a distributed manner. A performance benchmark for real-time database applications is used to validate the approach and to show the advantage of timing failures detection. The results presented show the usefulness of a new transaction programming approach aimed at supporting timing specifications for the execution of transactions. (Pekilis & Seviora, 1997) propose that the detection of temporal failures in real-time software should be done by a separate unit in charge of observing the inputs and outputs of the target software. The authors claim that automatic detection of such failures is influenced by state dependencies, which require the unit to track a target’s state and the elapsed time. A black-box approach is proposed for detecting real-time failures and quality of service degradations of session-oriented real-time software. The approach combines the target software’s formal behavioral specification with its response time specifications into a new model - the timepost -model. The temporal failure detection unit interprets this model, and uses it for target state tracking and for determining the start and stop times of timing measurements. The approach was evaluated experimentally by testing a small private branch telephone exchange capable of supporting multiple telephones and simultaneous calls. Comparing to usual whitebox approaches it was found that very complex and specific timing intervals could be tracked and measured.
Building Web Services with Time Requirements
Several works have been proposed that focus prediction using data-driven approaches. Among them are support vector machines (SVM) that in this context have been used mainly for software reliability prediction. An SVM is a learning system that uses a hypothesis space of linear functions in a high dimensional feature space, which is trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory (Bo & Xiang, 2007). A relevant aspect is the presentation of approach issues that affect the prediction accuracy. These issues include: whether all historical failure data should be used and what type of failure data is more appropriate to use in terms of prediction accuracy. The prediction accuracy of software reliability prediction models based on SVM and artificial neural networks are also compared in this work. Local prediction schemes only use the most recent information (and ignore information bearing on far away data). As a result, the accuracy of local prediction schemes may be limited. Considering this, it is proposed in (Shun-Feng Su, Chan-Ben Lin, & Yen-Tseng Hsu, 2002), a novel prediction approach termed as the Markov–Fourier gray Model. The approach builds a gray model from a set of the most recent data and a Fourier series is used to fit the residuals produced by this gray model. Then, the Markov matrices are employed to encode possible global information generated also by the residuals. The authors conclude that the global information encoded in the Markov matrices can provide useful information for predictions. A framework that incorporates multiple prediction algorithms to enable navigation of autonomous vehicles in real-life, on-road traffic situations, is presented in (Kootbally, Madhavan, & Schlenoff, 2006). At the lower levels, the authors use estimation theoretic short-term predictions over sensor data to predict the future location of moving objects with an associated confidence measure. At the higher levels, the authors use a
long-term situation-based probabilistic prediction using spatiotemporal relations and situation recognition. Interesting conclusions include the identification of the different time periods where the two algorithms provide better estimates and thus demonstrate the ability to use the results of the short-term prediction algorithm to strengthen/ weaken the estimates of the long-term prediction algorithm at different time periods. The importance of the time dimension in realtime systems is emphasized in (Halang et al., 2000). In fact, this relation is expressed by two fundamental user requirements: timeliness and simultaneity, which must be fulfilled even in high load conditions. The authors also advocate that predictability and dependability supplement the former two requirements, and highlight the importance of several qualitative performance criteria (over quantitative criteria) from which we would like to emphasize the following: timeliness – the ability to meet all deadlines; permanent readiness; behavioral predictability; simplicity; graceful degradation upon malfunctions; and portability. Although it is possible to find several works on timing failures and the use of prediction algorithms in multiple circumstances, these are problems that still remain open in the highly dynamic web services context. In this chapter we focus precisely on the issues associated with timing failures detection and prediction in web services.
FRAMEWORK FOR DEPLOYING WEB SERVICES WITH TIMING REQUIREMENTS The goal of the framework presented in this chapter (wsTFDP: Web Services Timing Failures Detection and Prediction) is to provide client applications with the possibility of invoking web services in a timely manner. This implies the creation of a mechanism that is able to abort, not execute, or gracefully degrade the execution of operations that are unable to match the client’s
321
Building Web Services with Time Requirements
timing requirements. Besides failure detection, wsTFDP enables collecting accurate runtime data that can be used for failure predicting. To be useful in real web services development scenarios, a timing failures detection and prediction mechanism must achieve a key set of quality attributes. Thus, before describing the mechanism we should understand the key objectives: •
•
•
•
322
Effective detection: it must provide low detection latency (i.e., it must be able to detect failures on due time) and must achieve a very low number of false-positives (i.e., the mechanism should be able to detect all failures that are indeed temporal failures). The implementation goal, defined based on experimental knowledge of web servicebased environments, was to keep the detection latency below 100ms and the detection false-positives rate under 5%. High prediction accuracy: there must be a very low number of false-positives during the prediction process. In practice, such low rate means that, in few cases, wsTFDP forecasts a failure that, in fact, will not happen. The false-negative rate (i.e., not predicted failures) must be kept low as well. Obviously, it is preferable to fail less when forecasting failures. Based on empirical knowledge, we aimed to offer a mechanism that is able to keep at least one of the mentioned rates under 5% (in average). We chose to keep a single limit as our goal, because lowering one of these properties typically results in increasing the other. Low overhead: a service that is able to detect and predict timing failures executes more work than a basic service. For our mechanism, the goal was to keep overhead under a maximum of 100ms, in terms of response time. Easy to use: Developers tend to avoid some characteristics like the complexity
•
•
or non-portability of code, tools, or deployment. It is important that a time-aware service does not require large changes to existing code, or essentially developers (consumers or providers), will refrain from using the mechanism. Similarly, when creating new services, the development model must be kept as close as possible to common practices. Generic: the system must be generic as that stimulates reuse in distinct environments. The goal is to provide features so that the mechanism can be used outside typical web service applications. (e.g., in Remote Method Invocation methods (RMI)). Extensible: the last objective is to provide extension features so that a programmer can add additional prediction components implementing more precise prediction algorithms.
The wsTFDP framework, depicted in Figure 2, is a server-side mechanism that, at runtime, transparently detects timing failures; collects performance data; and predicts, at specific points, if a service invocation can conclude on time. The framework is currently implemented in Java (an implementation using a different language can easily be provided) and uses a widely known Aspect Oriented Programming (AOP) (Kiczales et al., 1997) framework – AspectJ (Eclipse Foundation, 2008) to instrument web services’ bytecode in a completely transparent way to the developer. All logic aspects related to the timing failure detection and prediction are in fact an isolated package that can be merged into any application by using AspectJ. This procedure consists of compiling the candidate program and the wsTFDP component into a single time-aware Java application (using AspectJ’s weaver). All wsTFDP logic is injected at particular points (which are described below) in the target program. The use of AOP involves
Building Web Services with Time Requirements
understanding some important concepts (SpringSource, 2010), namely: • •
•
•
Aspect: a concern that cuts across multiple objects. Join point: a point during the execution of a program (e.g., the execution of a method or the handling of an exception). Advice: action taken by an aspect at a particular joinpoint. Types of advice include: ‘before’, ‘after’, and ‘around’ (i.e., before and after). Pointcut: a predicate that matches join points. An advice is associated with a pointcut expression and runs at any join point matched by the pointcut (e.g., the execution of a method with a certain name).
Using AOP allow adding cross-cutting concerns into any program, non-intrusively. For instance, it allows defining a single detection component to be used in distinct different web service operations in a particular application. The mechanism described in this chapter uses an around advice (it enables us to have control before and after our join point) and a pointcut that basically matches all methods marked with @WebMethod and that have a parameter of type TimeRestriction. This TimeRestriction parameter is defined by the wsTFDP mechanism and is used by service consumers to specify timeout values (see usage details on Section ‘Practical Usage of wsTFDP’). At the service side, and with this configuration, we can intercept all web service calls for which we want to detect or predict timing failures. At the interception moment multiple procedures are taken by wsTFDP to determine if execution is on time or if a deadline violation has or will predictably occur. The whole procedure is described in detail in the following paragraphs and summarized in Figure 2. Incoming client requests must include a desired maximum service execution time and a confidence value for prediction. Such requests
are handled as follows (see also Figure 2): the service’s entry point is instrumented to start a detection component (1) that waits a response over a blocking queue (2). Control is then transferred to a prediction manager component (3), which executes the configured prediction module (4) over a set of historical metrics (metrics are collected and managed by a metrics collector (5) and returns a predictable execution time for the client request. Based on this information, the prediction manager decides if the application’s business logic should be executed, or if, considering the client’s expected deadline it is probable that a timing violation will occur (according to the historical data and to the confidence value defined by the client). This decision essentially results in two outcomes: if a timing violation is predicted then a well-known exception is placed in the blocking queue, otherwise the execution of the service’s business logic is started. In both cases an object will be placed in the blocking queue, and a signal will be delivered to the detection component (which is blocked over the queue). At that point, the detection component retrieves the object placed in the queue and delivers the result to the client application. It is important to emphasize that the detection component can be run without the prediction module (by changing the framework’s configuration). In practice, prediction will be disabled at all points, but the detection component will continue to operate normally, indicating any occurring failure on time (see Section ‘Detecting Timing Failures’ for a detailed description of the detection component). Service applications may include requests to other time-demanding services, such as external web services (e.g., payment gateways, business partners, etc.). So, when service execution is allowed (6), our framework transparently intersects those types of calls (7) to predict, based on the client-set values and the available remaining time (and before allowing execution to proceed), the probability of occurrence of a timing failure dur-
323
Building Web Services with Time Requirements
Figure 2. wsTFDP extensible architecture
ing the external service execution. This process is equal to the one applied at the service’s entry point; however, by having intermediate prediction points, it is possible, after execution has started, to eagerly inform the client that the call may violate the time limit. For instance, an event may unexpectedly delay the processing of the operation beyond the client’s limit (e.g., the service may use an external service that is responding slowly). Keep also in mind that, at any time, the detection module may deliver a ‘TimeExceeded’ exception to the client if it detects that the desired time has expired. An important aspect is that the framework’s architecture enables connecting more accurate, or more performing, predicting modules as desired by the developers. For instance, it is easy to plug in one of Weka’s (Frank et al., 2010) multiple data analysis algorithms. Currently, the general procedure consists of adding the predictor’s jar file to the project and configuring the predictor module in wsTFDP project’s descriptor. At each prediction point the available historical data can be used in a similar manner as in the case of the default Dijkstra’s shortest path (ShP) module implemented by the ShP predictor. The following sections detail the implemented detection and prediction modules.
324
DETECTING TIMING FAILURES Figure 3 illustrates the internal design of the failure detection component and the sequence of events that occur at the time of interception of a web service call. The horizontal solid lines represent a thread that is in a runnable state and is performing actual work. Dashed lines represent a thread that is waiting for some event. At the time of interception, each thread present at the container (parent) spawns a new thread (child). This new thread is responsible for performing the actual web service work and setting the final result in its blocking queue. Right after the child thread is started, the parent attempts to retrieve the final result from the child’s blocking queue. As there is no available result at that starting point, the parent thread waits for a given time period (which is the timeout specified by the service consumer) for an element to become available on the queue. During this time period, no polling is done on the blocking queue (reducing the impact of the mechanism). Instead of polling, the parent waits for one of the following events to occur: •
A signal to execute the object removal. This occurs when a put operation is executed on the blocking queue. This put operation has the effect of signaling any waiting thread
Building Web Services with Time Requirements
Figure 3. Temporal failure detection mechanism
•
(on that particular queue) to instantly stop waiting and proceed with the removal of the object (this object is essentially the result of executing the web service). The waiting time is depleted. After the time expiration, the container (parent) thread continues to execute, leaving the waiting state, and ignoring any results later placed in the blocking queue. The result of this process is an exception that signals the occurrence of a timing failure that is thrown to the service consumer.
We can expect two distinct results from the execution of the child thread: we may obtain a regular result (i.e., the result defined as the return type in the signature of the method) or get an exception (it can be a declared or runtime exception). In any case, an object is set in the blocking queue (exceptions are caught by the child thread and also set in the queue as normal objects). Whenever the parent takes an object from the blocking queue, the type of the object is verified. For a normal object, the execution proceeds and the result is delivered to the consumer. However, if the object retrieved from the queue is an instance of Exception, then the parent thread re-throws it. This enables us to maintain the correctness of the application (that would not be kept if the child thread could throw the exception itself).
An important aspect is that when a temporal failure occurs the container thread throws an exception to the client. wsTFD offers two types of exceptions (it is the responsibility of the provider to choose the exception type that better fits his business model): •
•
TimeExceededException: this is a checked exception, i.e., if the provider decides to mark a given public method with a ‘throws’ clause, then the client will be forced to handle a possible exception at compile time (it will have to enclose the service invocation with a try/catch block). TimeExceededRuntimeException: this is an unchecked exception, i.e., the client does not need to explicitly handle a possible exception that may result from invoking a particular web service operation. This is true even if the provider adds a ‘throws’ clause to its public operation signature.
PREDICTING TIMING FAILURES With the goal of predicting timing failures, our mechanism performs the following steps (detailed in the next subsections):
325
Building Web Services with Time Requirements
1. Analyzes the service code and builds a graph to represent its logical structure; 2. Gathers time-related metrics in a transparent way during runtime; 3. Uses historical data to predict, with a degree of confidence chosen by the client, if a given execution will (or will not) conclude on due time.
DESCRIBING THE WEB SERVICE STRUCTURE wsTFDP is able to analyze the logical structure of services in order to organize a graph structure for each service operation being provided to consumers (the mechanism performs this function automatically). This graph is composed of vertexes (also known as nodes) and edges that connect nodes. Each edge is essentially an unidirectional link between two nodes and has an associated cost (i.e., going from one node to another involves a predefined cost) (Dijkstra, 1959). To build this data structure we have to understand the data to be used as nodes, edges, and costs: •
•
•
326
Nodes: specific instants in a service’s execution where timing failures should be predicted. Natural candidates are the invocation of the target service itself and all nested service invocations (if any) performed by the service. Also, it can be important to support user-identified critical points that will enable the programmer to instruct wsTFDP to add particular code parts to the graph. Edges: these naturally represent the available connections between nodes, and are automatically defined by wsFTP based on a runtime analysis of each service. Cost: for our goals, the cost involved in travelling between two nodes is related to the execution time (in milliseconds).
In Figure 4 we can find an example of a simple web service that was changed to be timeaware and we also observe the respective graph organization (code that is specific to wsTFDP is presented in bold). The arrows linking the nodes represent unidirectional relations between those specific nodes. This means, for instance, that it is possible to move from the ‘Service C’ node to the ‘DB query’ (database query) node. However, it is not possible to move in the opposite direction, which obviously represents exactly what the developer intended when writing this specific code. This service uses two wsTFDP operations: check and ignore. The check operation indicates that that code point is critical and must be a node on the graph and, as a consequence, is a spot for predicting timing failures. The ignore operation indicates that the next service call must not be used as a node when building the graph (and as a consequence it will not be used for failure prediction). The developer has the responsibility to identify critical execution points, which should also compose the runtime graph. For instance, the ‘Service D’ call (represented by the invokeServiceD(); statement) is not part of the graph as it was explicitly ignored by the developer. A regular profiling tool can help achieving this task easily. The organization of the graph structure takes place at runtime, in particular when the service is being executed. To build the graph we are using AOP (AspectJ, in particular (Eclipse Foundation, 2008)) during service execution. In the Java programming language, a web service can be executed by invoking a generated method that is marked with the @WebMethod JAX-WS annotation (Sun Microsystems, Inc., 2010). To build the graph, wsTFDP assumes that the service provider is using any implementation form of JAXWS, which is increasingly becoming more frequent (e.g., JBoss provides a JAX-WS compliant implementation and the same happens for many middleware vendors). In the case of wsTFDP, we instruct AspectJ to intercept calls to @WebMethod annotated operations and it then uses the interception
Building Web Services with Time Requirements
Figure 4. Transformation from source code into a graph structure by wsTFPD
information to construct the graph that represents the service. As previously mentioned, our mechanism intercepts all external service calls (i.e., service nested calls) by default. Despite this, the wsTFDP can be configured to intercept only particular (i.e., marked by the programmer) external service calls, ignoring the remaining. Also note that the mechanism is currently able to identify only distinct code points (i.e., checks should not be placed inside a loop), thus future work is needed to study different methods to identify code points. Similarly, the mechanism does not manage multiple graphs per operation (for instance, to support operations that spawn new threads). It can predict at the point where threads are created, but the support for operations that run tasks in parallel is out of its scope. Thus, the graph should correspond to the main thread’s execution path. It is important to emphasize that wsTFDP targets simple services and composite services that are programmatically defined and built using the JAX-WS. It does not target other kinds of composition frameworks or business process engines, such as BPEL (Juric, 2006). However,
the concepts presented are, in our opinion, generic enough to be integrated in web service composition frameworks, requiring only a technical effort.
MANAGING METRICS At runtime, the detection component gathers time-related metrics, which are stored in the graph’s edges. In each execution, wsTFDP stores information that represents how long it takes to move from a preceding node to the current node. In reality, each edge keeps a list of historical time values sorted from the shortest to the longest. This list is updated in each service execution with the detected value. Evidently, the list size must be limited, and thus, the provider should configure it depending on the specificities of the services being used. A small list may be sufficient for environments that experience reduced variations in execution times. On the other hand, other kinds of environments may require more extensive lists in order to capture the normal execution time variation of the service. If the developer
327
Building Web Services with Time Requirements
wants to favor performance then reduced lists are preferable, since the sorting algorithm, required by our approach, will perform faster in smaller lists (see details on the prediction procedure in the next section). Notice that during the first services executions the graph can be rather incomplete. Typically, web services can be quite complex and several calls (for instance, with distinct input parameters) may be required to explore all possible paths, and thus construct a more complete graph. The more pathways are explored, the more precise prediction will become.
that has started must not be artificially aborted, as the application may not able be handle this situation). These two operating modes can be quite important for providers since they may lighten the load at the server when timing failures occur. It is the responsibility of the provider to decide which mode should be used for the application that is being deployed. When considering a prediction process there is always some degree of confidence associated. In our mechanism, both providers and consumers can set this confidence degree in the following ways: •
PREDICTION PROCESS At each node, our mechanism subtracts the elapsed time (the time starts counting from the moment execution started) from the amount of time specified by the consumer application (this client-set time represents the total desired time for execution). Dijkstra’s shortest path computation algorithm (Dijkstra, 1959) is then used to determine if execution can terminate on time. This algorithm uses the best (fastest) time values stored on the graph’s edges to perform such calculation. Choosing the best values provides the mechanism with a more confident way to assure timing failures. For example, if at a particular point, the best historic values indicate that it is impossible to conclude execution on due time, then it is probable that execution will, in fact, not complete on time (it never happened before). In such conditions it is reasonably safe to return a prediction exception to the consumer. When our mechanism forecasts a timing failure, there are two possible outcomes: the service stops the operation or it continues executing under a degraded mode. The former should be used in services where artificially aborting a specific operation does not cause any harm to the normal service execution. The latter mode can be useful for services that require consistency (i.e., an operation
328
•
At runtime, the list that maintains execution times can achieve its maximum for a particular edge; it is indispensable to remove elements to accommodate new data. The web service provider can configure two distinct removal strategies. When accurate prediction is critical, the provider should choose to drop the longest time duration values. This will reduce the number of false-positives since the algorithm always makes use of each edge’s fastest values. Alternatively, it is possible to do a random removal. In such case wsTFDP may lose accuracy (which frequently is not highly relevant to clients), but can better capture the web service typical behavior. Note also that a small size for the execution time list can capture local (in time) changes in a better way, where longer lists can capture global information more adequately. Thus, the provider should adjust the history size in the way that better reflects the environment. The consumer can discard a percentage of the smallest execution times that reside in the collected history (i.e., the fastest values are not used for the prediction process, but are not eliminated from the overall history). This can be quite helpful when the goal is to express a confidence value over the collected values. For instance, if the
Building Web Services with Time Requirements
service consumer considers that 5% of the fastest executions occurred in very specific circumstances that do not represent the regular behavior of the service and its associated resources, then those 5% of values should be ignored. This translates to a 95% confidence degree for predicting timing failures (see the next section to understand how to express this confidence value at the client). Similarly to the detection mechanism, we provide both checked (i.e., declared) and unchecked (i.e., runtime) prediction exceptions (FailurePredictionException and FailurePredictionRuntimeException, respectively). These exceptions are hierarchically organized and extend a superclass TimeFailureException, which is also provided in unchecked and checked versions. This superclass enables clients to handle both detected and predicted failures in a single catch block. The procedure to predict and detect timing failures is fairly simple. Nevertheless, despite being a simple approach, it is able to produce very good results, which can be used to select services and manage resources, with obvious advantages for clients and servers, respectively. Transactions with strict and relaxed temporal requirements are considered in (Vieira et al., 2006). wsTFDP provides strict timing requirements detection and prediction. We include, however, a way to relax the prediction process by including a factor that can loosen up the use of historical behavior metrics. This provides a way to adjust the prediction process more accordingly to the current environment. As proposed in (Pekilis & Seviora, 1997), the detection and prediction process is conducted by a component that remains isolated (i.e., in a separate execution thread), observing inputs, outputs, and also execution path. As in (Bo & Xiang, 2007), we considered several issues regarding the selection and use of all historical data available, or on the other hand, the use of part of the available data. Our mecha-
nism provides the client with the possibility of discarding historical data elements, accordingly to the execution goals. Additionally, the service provider can adjust the maximum size of historical data and adjust the data removal strategy. These techniques aim to adjust the collected data and, as proposed in (Kootbally et al., 2006; ShunFeng Su et al., 2002), provide a way to use not only short-term (local) data, but also long-term (global) information in the prediction process. As explained, we provide features that are typically available in real-time systems and indeed we focused on important criteria that previous works indicate as important in such systems, such as timeliness, predictability, simplicity, graceful degradation upon malfunctions, and portability (Halang et al., 2000).
PRACTICAL USAGE OF WSTFDP Creating a web service that is able to detect and predict timing failures is an easy task, it essentially consists in executing the following steps: 1. Add the wsTFDP library (or source code) to the project. The library and source code are available at (Laranjeiro & Vieira, 2009). 2. Add a TimeRestriction parameter to the web service operations for which timing failure detection and prediction is wanted. This object holds a numeric value that is set by clients to specify the desired service duration (in milliseconds). 3. Compile the project using an AOP compiler (AspectJ, in the case of our prototype). For example, when using Maven (Apache Software Foundation, 2008) as a building tool, compilation and packaging can be done by executing the following command from the command line: ‘mvn package’. As observed in Figure 4, the code changes (presented in bold) that relate to our mechanism
329
Building Web Services with Time Requirements
are quite minimal. Essentially, the provider only needs to add an extra parameter of type TimeRestriction the web service and our framework will perform all required tasks automatically. An additional aspect is that, although wsTFDP targets web services, it can also be used to intercept other kinds of methods (the goal was to fulfill the ‘to be generic’ requirement presented previously), such as, for example, Remote Method Invocation methods (RMI). In order to intercept this kind of methods,, the developer must mark such methods with a @Interceptable annotation (provided by the wsTFDP library). Obviously, the signature of these methods must also to contain a TimeRestriction parameter. There are no special procedures required to invoke a time-aware service. The only difference between a regular service and this kind of services is, from the client standpoint, the use of an additional parameter – the TimeRestriction object. This has to specify the desired execution time and a prediction confidence value. Figure5 shows an example of the client-side code required to call a time-aware web service. The example represents an implementation of the ‘New Customer’ web service, which is defined by the TPC-App performance benchmark (Transaction Processing Performance Council, 2008) and was changed to be time-aware. In this particular case, the client is setting a maximum time of 1000ms for the service execution. Additionally, a 95% confidence factor for historical data is also being requested. The
changes that are specific to wsTFDP-related are presented in bold.
EXPERIMENTAL DEMO In this section we present the experimental evaluation performed to demonstrate the effectiveness of wsTFDP. The experiments performed try to give answer to the following questions: • • •
•
EXPERIMENTAL SETUP TPC-App (Transaction Processing Performance Council, 2008) is a performance benchmark for web services and application servers widely accepted as representative of real environments. A subset of the services specified by this benchmark (Change Payment Method, New Customer, New Product, and Product Detail) was selected for testing our mechanism. Some of the services (Payment Method and New Customer services) include an
Figure 5. Client code for invoking a time-aware service
330
Can developers easily use the mechanism? Does the mechanism introduce a noticeable delay in services? How fast can the mechanism detect failures and how does this detection latency evolve under higher loads? Is it able to provide low false-positive and false-negative rates during the detection and prediction process?
Building Web Services with Time Requirements
external service invocation that simulates a payment gateway. We used this external service to simulate what frequently happens when services are invoked over the Internet. In practice, and based in our experimental knowledge of typical service execution times, we introduced a random variable delay of 1000 to 2000 ms (uniform distribution) in this external service invocation. The experimental setup consisted in deploying the main test components into three distinct computers connected by a Fast Ethernet network (see Table 1). In short, the three main test components are: 1. A web service provider application that provides the set of web services used in the experimental evaluation; 2. A database server on top of the Oracle 10g Database Management System (DBMS) used by the TPC-App benchmark; 3. A workload emulator that simulates the concurrent execution of business transactions by multiple clients (i.e., that performs web services invocations). To analyze wsTFDP’s effectiveness, we conducted 48 individual tests, each one using distinct configurations. Basically, we have 4 experiments (A, B, C, and D), which have 4 sets of tests each. Each of the tests had a duration of 20 minutes and was executed 3 times. For all tests, our mechanism was configured to predict at all service invocation points (this includes not only the service entry point but all existing external services invocations). Tests were executed using different loads (2, 4,
8, and 16 simultaneous clients) and the system state was always restored before each execution, in order to keep all tests using the same starting conditions. We used nanosecond precision to measure the execution time.
PERFORMANCE OVERHEAD Two initial experiments were conducted to assess the performance overhead. One experiment was run without wsTFDP and the other using wsTFDP (baseline experiment A and experiment B, respectively). Each of these experiments included 4 sets (correspond to a client-side configuration of 2, 4, 8, and 16 emulated business clients) of 3 tests (3 executions of a given configuration). The goal was to use different service loads to test the system and get significant measurements. For experiment B a high deadline value was defined for the execution, the goal was to prevent services from being stopped due to timing failures. With these conditions we were able to evaluate the performance overhead of wsTFDP, comparing to the normal conditions used in experiment A. Figure 6.a) presents the observed results. The average execution time (of the 3 runs) in each of the 4 sets (2, 4, 8 and 16 clients) was calculated for experiments A and B (for all four web services). After this step, we collected the maximum, minimum and average differences between the experiments in each distinct set. Finally, we calculated the average of the collected differences of each of the four services in a single value per set. With the increase of the
Table 1. Systems used for the experiments Node
Software
Server
Windows server 2003 R2 Enterprise x64 Edition service pack 2 & JBoss 4.2.2.GA
Dual Core Pentium 4 3Ghz 1.46GB RAM
Hardware
DBMS
Windows server 2003 R2 Enterprise x64 Edition service pack 2 & Oracle 10g
Quad Core Intel Xeon 5300 Series 7.84 GB RAM
Client
Windows XP pro SP2
Dual Core Pentium 4, 3GHz, 1.96GB RAM
331
Building Web Services with Time Requirements
Figure 6. wsTFDP mechanism overhead a) and detection latency b) per client load
number of clients, we detected an increase of the overhead of the detection and prediction process. Despite this, the global overhead remains rather low, with an observed maximum of nearly 56 ms, fitting our initial goal of keeping it under 100ms (for moderately loaded environments). Note that the introduced overhead is not exponential, it is approximately linear to the number of clients being emulated (the increase in the number of clients being tested is in fact exponential). Additionally, this overhead is tightly associated with the complexity of running Dijkstra’s algorithm over the graph. It is well known that when the graph is not fully connected, which is frequently the case in this type of applications (i.e., each node is not connected to all remaining nodes), the time complexity is O ((m + n) log n) where ‘m’ is the number of edges and ‘n’ is the number of vertices in the graph (for algorithms that make use of a Priority Queue as in our case) (Dijkstra, 1959).
DETECTION LATENCY The goal of the third experiment (experiment C) was to assess the detection latency (i.e., the time
332
between the failure occurrence and its detection). As we were expecting very low values for these experiments, we decided to add extraneous load to the server by creating two threads in the server application, running in a continuous loop. This pushed the CPU usage of our dual core machine to 100%. Using the baseline values as reference, we configured the multiple clients to specify random service deadlines that ranged from the average execution times to twice this value. The objective was to combine services that finish on due time with services that are unable to comply with the client deadline (hence being terminated by wsTFDP and enabling us to measure the detection latency). Figure6.b) shows the results for four sets of tests (2, 4, 8, and 16 clients) that were executed under the previously mentioned stress conditions. As expected, the latency values increased as we added load to the server. The maximum latency detected was of 52 ms, thus fulfilling our initial goal. The collected results are, generally, very good, specifically if we consider the demanding environment used in this experiment (i.e., the extra load added to the server).
Building Web Services with Time Requirements
Figure 7. The observed false positive rate under different client loads
FALSE-POSITIVES AND FALSE-NEGATIVES In experiment D, we evaluated the false-positive (i.e., the number of times our mechanism forecasted a failure that did not actually toke place) and false-negative (i.e., the number of times wsTFDP detected a failure that was not predicted) rates. Notice that, these two metrics can be viewed as conflicting. In fact, when we configure wsTFDP to minimize the number of wrong predictions we typically increase the number of failures that can take place without being forecasted. Finding a balance between both metrics can be a difficult task and the final decision of choosing a balanced approach or, on the other hand, favoring a particular metric, belongs to the framework user. In our case, a single experiment was executed and we tried to minimize the false-positive rate, assuming that a incorrect prediction can have a higher cost when compared to a failure that is not predicted (as the web service produced work that, if a timing failure is predicted, will be of no use to the client). Notice that, even if a timing failure is not predicted it will certainly be detected (we observed a detection coverage of 100% in all experiments). We executed some tests to verify which configuration parameters could provide low falsepositive rates. The goal was not to execute an
exhaustive configuration tuning, but simply to select a good enough configuration that provided a solid demonstration of wsTDFP. When deployed in real scenarios, users may need to execute a more thorough experiment depending on their particular requirements. The same client-side selection of deadline values was maintained (it ranged again from the observed average to twice the average of these values) and we empirically observed that, for our objectives, a confidence value of 80%, set by the client, provided the best results (we tested all lower and higher values in 5% intervals, with worse results). At the service provider, the main configuration performed was related to the management of the graphs edges values. We chose a maximum list size of 100 elements and a random element removal strategy (more details on the removal strategies can be found in Section ‘Prediction Process’). In Figure 7 we can observe the detected false-positive rates for the executed tests. Using this moderate configuration tuning we were able to maintain the average false-positive rate under 5%, fulfilling our opening objectives. This is an excellent result, considering that configuration tweaking was not exhaustive. Additionally, it is interesting to observe that the falsepositive rate generally decreases as the number of clients increases. This can be explained by the
333
Building Web Services with Time Requirements
fact that, under heavier loads, more operations are executed and, as a consequence, the historical values belong to a time-frame that is smaller and represents the current environment in a more accurate way. Considering all tests executed in experiment D, we observed 26% for the false-negatives. As our main objective was to have a low false-positive rate (which for sure increases the false-negatives rate), these values are rather acceptable. Users that prefer more equilibrated results can configure the prediction mechanism by, for instance, increasing the history size, decreasing the client confidence value or testing the multiple combinations of the different configurable parameters.
EASINESS OF USE In order to verify how easy to use our mechanism is, we asked an independent developer (with about 2 years of experience in developing Java web applications and services) to add wsTFDP to our reference TPC-App implementation, thus creating a set of time-aware services. The TPCApp set of services were already using Maven as building tool, so the developer’s tasks were essentially reduced to merging the application’s project descriptor (a Maven specific configuration file) with the one provided by our mechanism, and also adding a new TimeRestrictionParameter to each of the TPC-App web services. The main task performed in the merging process was the replacement of the regular Java compiler with AspectJ’s weaver. The whole procedure was finished in less than 10 minutes, which is an indication that it is a fairly easy procedure.
CONCLUSION AND FUTURE RESEARCH ISSUES Creating a detection and prediction mechanism that is generic and non-intrusive can be a quite
334
complex task. Typically, timing failure detection/ prediction schemes are implemented and coupled to specific applications (i.e., these schemes are, in fact, part of the application itself). This obviously can produce a large impact on particular quality attributes. This chapter discussed the problem of timing failure detection and prediction in web service environments and proposes a programming approach to help developers in programming web services with time constraints. We propose a technique that can detect and predict timing failures in a transparent way. Results collected from the execution of several experiments show that our mechanism can be easily used and does not introduce a high overhead on the target system. The mechanism is also able to perform fast failure detection and can accurately predict timing failures. wsTFDP is publicly available and ready to be used by developers. It was designed to require as little effort as possible from programmers, not over-deviating them from typical web services development tasks. The capabilities introduced by wsTFDP are currently not provided by standards in the web services technology. The provided solution can be adequate for developers, helping in decoupling application logic from a necessary timing failures detection and prediction mechanism. Future research can include studying ways for providing time-aware services without performing changes (even if they are minor) to the regular interface of the service, that is, without adding extra parameters to the service interface. Moreover, it may be interesting to test the performance of more complex techniques to predict timing failures in different services. Additionally, some technical aspects can be tuned in the failure detection/prediction scheme. Regarding the detection mechanism, the thread creation model can be revised for a higher performance solution under high service loads. Concerning the prediction mechanism, path probabilities can be added so that the prediction process can decide using extra data. Also, input content can be used
Building Web Services with Time Requirements
as a way to understand the execution path and, hence, positively influence the final prediction decision. Finally, there are two aspects that future work can tackle. First, it can be important to add support for prediction points inside loops and multi-graph management (to support services that spawn new execution threads). Another aspect is that it might be useful, in some circumstances, to provide a querying mechanism (running at the server) to be used by clients, as a way to understand the typical performance of a particular web service operation.
REFERENCES Amazon. (2010). Amazon Web Services Solutions Catalog. Retrieved September 15, 2008, from http://aws.amazon.com/products/ Apache Software Foundation. (2008). Maven. Retrieved February 14, 2008, from http://maven. apache.org/ Bo, Y., & Xiang, L. (2007). A study on software reliability prediction based on support vector machines. In IEEE International Conference on Industrial Engineering and Engineering Management (pp. 1176-1180). Presented at the IEEE International Conference on Industrial Engineering and Engineering Management. doi:10.1109/ IEEM.2007.4419377 Curbera, F., Duftler, M., Khalaf, R., Nagy, W., Mukhi, N., & Weerawarana, S. (2002). Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI. Internet Computing, IEEE, 6(2), 86–93..doi:10.1109/4236.991449 Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik, 1(1), 269–271. doi:10.1007/BF01386390
Eclipse Foundation. (2008). The AspectJ Project. Retrieved December 8, 2007, from http://www. eclipse.org/aspectj/ Elmagarmid, A. K. (1992). Database Transaction Models for Advanced Applications. Morgan Kaufmann. Frank, E., Holmes, G., Mayo, M., Pfahringer, B., Smith, T., & Witten, I. (2010). Weka 3 - Data Mining with Open Source Machine Learning Software in Java. Retrieved July 4, 2008, from http://www.cs.waikato.ac.nz/ml/weka/ Halang, W. A., Gumzej, R., Colnaric, M., & Druzovec, M. (2000). Measuring the Performance of Real-Time Systems. The International Journal of Time-Critical Computing Systems, 18(1), 59–68. doi:.doi:10.1023/A:1008102611034 Juric, M. B. (2006). Business Process Execution Language for Web Services BPEL and BPEL4WS 2nd Edition. Packt Publishing. Retrieved from http://portal.acm.org/citation.cfm?id=1199048& coll=Portal&dl=GUIDE&CFID=26887380&CF TOKEN=61353912 Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J. M., & Irwin, J. (1997). Aspect-Oriented Programming. In 11th European Conference on Object-oriented Programming. Jyväskylä, Finland. Kootbally, Z., Madhavan, R., & Schlenoff, C. (2006). Prediction in Dynamic Environments via Identification of Critical Time Points. In Military Communications Conference, 2006. MILCOM 2006 (pp. 1-7). Presented at the Military Communications Conference, 2006. MILCOM 2006. doi:10.1109/MILCOM.2006.302047 Laranjeiro, N., & Vieira, M. (2009, March). wsTFDP: An Extensible Framework for Timing Failures Prediction in Web Services. Retrieved February 14, 2008, from http://eden.dei. uc.pt/~cnl/papers/2010-pdsc-wsTFDP.zip
335
Building Web Services with Time Requirements
Long, D. D. E., Carroll, J. L., & Park, C. J. (1990). A Study of the Reliability of Internet Sites. University of California at Santa Cruz. Retrieved from http://portal.acm.org/citation.cfm?id=902727&c oll=Portal&dl=GUIDE&CFID=24761187&CFT TOKEN=74702970
Shun-Feng Su, Chan-Ben Lin, & Yen-Tseng Hsu. (2002). A high precision global prediction approach based on local prediction approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 32(4), 416-425. doi:10.1109/TSMCC.2002.806745
Looker, N., Munro, M., & Jie Xu. (2005). Increasing Web Service Dependability Through Consensus Voting. In 29th Annual International Computer Software and Applications Conference (COMPSAC) (Vol. 2, pp. 66-69). Presented at the 29th Annual International Computer Software and Applications Conference (COMPSAC). doi:10.1109/COMPSAC.2005.88
SpringSource. (2010). Aspect Oriented Programming with Spring. Retrieved July 6, 2008, from http://static.springframework.org/spring/ docs/2.5.x/reference/aop.html
May, N. (2009). A Redundancy Protocol for Service-Oriented Architectures. In Service-Oriented Computing – ICSOC 2008 Workshops (Vol. 5472, pp. 211-220). Springer-Verlag. doi:10.1007/9783-642-01247-1_22 Pekilis, B., & Seviora, R. (1997). Detection of response time failures of real-time software. In The Eighth International Symposium On Software Reliability Engineering (pp. 38-47). Presented at the The Eighth International Symposium On Software Reliability Engineering. doi:10.1109/ ISSRE.1997.630846 Salas, J., Perez-Sorrosal, F., Patiño-Martínez, M., & Jiménez-Peris, R. (2006). WS-replication: a framework for highly available web services. Proceedings of the 15th international conference on World Wide Web, 357-366. Sha, L., & Abdelzaher, T., årzén, K., Cervin, A., Baker, T., Burns, A., Buttazzo, G., et al. (2004). Real Time Scheduling Theory: A Historical Perspective. Real-Time Systems, 28(2), 101–155.. doi:10.1023/B:TIME.0000045315.61234.1e
336
Stankovic, J. A. (1988). Misconceptions about real-time computing: a serious problem fornextgeneration systems. Computer, 21(10), 10–19. doi:10.1109/2.7053 Sun Microsystems, Inc. (2010). jax-ws: JAX-WS Reference Implementation. Retrieved February 14, 2008, from https://jax-ws.dev.java.net/ - Transaction Processing Performance Council. (2008). TPC BenchmarkTM App (Application Server) Standard Specification, Version 1.3. Retrieved July 5, 2008, from http://www.tpc.org/tpc_app/ Vieira, M., Costa, A. C., & Madeira, H. (2006). Towards Timely ACID Transactions in DBMS. In 12th Pacific Rim International Symposium on Dependable Computing (PRDC 2006) (pp. 381- 382). Presented at the 12th Pacific Rim International Symposium on Dependable Computing (PRDC 2006). doi:10.1109/PRDC.2006.63
ADDITIONAL READING Abbott, R. K., & Garcia-Molina, H. (1992). Scheduling real-time transactions: A performance evaluation. [TODS]. ACM Transactions on Database Systems, 17(3), 560. doi:10.1145/132271.132276
Building Web Services with Time Requirements
Andler, S. F., Hansson, J., Eriksson, J., Mellin, J., Berndtsson, M., & Eftring, B. (1996). DeeDS towards a distributed and active real-time database system. SIGMOD Record, 25(1), 51. doi:10.1145/381854.381881
Halang, W. A., Gumzej, R., Colnaric, M., & Druzovec, M. (2000). Measuring the Performance of Real-Time Systems. The International Journal of Time-Critical Computing Systems, 18(1), 59–68. doi:.doi:10.1023/A:1008102611034
Aussagues, C., & David, V. (1998). Guaranteeing timeliness in safety critical real-time systems. In 15th Workshop on Distributed Computer Control Systems (DCCS) (pp. 9-11). Presented at the 15th Workshop on Distributed Computer Control Systems (DCCS).
Haritsa, J. R., Carey, M. J., & Livny, M. (1990). On being optimistic about real-time constraints. In Proceedings of the ninth ACM SIGACTSIGMOD-SIGART symposium on Principles of database systems (pp. 331–343).
Bestavros, A. (1996). Advances in real-time database systems research. SIGMOD Record, 25(1), 3–7. doi:10.1145/381854.381860
Haritsa, J. R., Carey, M. J., & Livny, M. (1992). Data access scheduling in firm real-time database systems. Real-Time Systems, 4(3), 203–241. doi:10.1007/BF00365312
Casimiro, A., Vieira, M., & Madeira, H. (2007). Middleware Support for Time-Elastic Database Applications. In Supplemental Volume of the 2007 International Conference on Dependable Systems and Networks (pp. 406–407).
Harmon, M., Baker, T., & Whalley, D. (1992). A retargetable technique for predicting execution time. In Real-Time Systems Symposium, 1992 (pp. 68-77). Presented at the Real-Time Systems Symposium, 1992.
Chan-gun Lee, Mok, A., & Konana, P. (2007). Monitoring of Timing Constraints with Confidence Threshold Requirements. Computers, IEEE Transactions on, 56(7), 977-991. doi:10.1109/ TC.2007.1026
Huang, J., Stankovic, J., Towsley, D., & Ramamritham, K. (1990). Real-time transaction processing: design, implementation and performance evaluation. University of Massachusetts COINS TR, 90–43.
Chen, X., Mohapatra, P., & Chen, H. (2001). An admission control scheme for predictable server response time for web accesses. In Proceedings of the 10th international conference on World Wide Web (pp. 545-554). Hong Kong, Hong Kong: ACM. doi:10.1145/371920.372156
Joseph, M. (1996). Real-time Systems: Specification, Verification, and Analysis. Prentice Hall.
Cristian, F., & Fetzer, C. (1999). The Timed Asynchronous Distributed System Model. IEEE Transactions on Parallel and Distributed Systems, 10(6), 642–657. doi:10.1109/71.774912 Gergeleit, M., Mock, M., Nett, E., & Reumann, J. (1997). Integrating time-aware CORBA objects into O-O real-time computations. In ObjectOriented Real-Time Dependable Systems, 1997. Proceedings of Third International Workshop on (pp. 83-90). doi:10.1109/WORDS.1997.609929
Kao, B., & Garcia-Molina, H. (2008). An Overview of Real-Time Database Systems1. Kim, Y. K., & Son, S. H. (1996). Supporting predictability in real-time database systems. In rtas (p. 38). Lam, K. Y., & Kuo, T. W. (2001). Real-Time Database Systems: Architecture and Techniques. Springer Netherlands. Lin, K. J. (1989). Consistency issues in real-time database systems. In Proceedings of the TwentySecond Annual Hawaii International Conference on System Sciences, 1989. Vol. II: Software Track (pp. 654–661).
337
Building Web Services with Time Requirements
Molina-Jimenez, C., & Shrivastava, S. (2006). Maintaining Consistency between Loosely Coupled Services in the Presence of Timing Constraints and Validation Errors. In 4th European Conference on Web Services (ECOWS) (pp. 148-160). Presented at the 4th European Conference on Web Services (ECOWS). Ozsoyoglu, G., & Snodgrass, R. T. (1995). Temporal and real-time databases: A survey. IEEE Transactions on Knowledge and Data Engineering, 7(4), 513–532. doi:10.1109/69.404027 Pang, H. H., Carey, M. J., & Livny, M. (1995). Multiclass query scheduling in real-time database systems. IEEE Transactions on Knowledge and Data Engineering, 7(4), 533–551. doi:10.1109/69.404028 Ramamritham, K. (1993). Real-time databases. Distributed and Parallel Databases, 1(2), 199– 226. doi:10.1007/BF01264051 Singhal, M. (1988). Issues and approaches to design of real-time database systems. SIGMOD Record, 17(1), 33. doi:10.1145/44203.44205 Stankovic, J. A., Son, S. H., & Hansson, J. (1999). Misconceptions about real-time databases. Computer, 32(6), 29–36. doi:10.1109/2.769440 Taubenfeld, G. (2006). Computing in the Presence of Timing Failures. In Distributed Computing Systems, 2006. ICDCS 2006. 26th IEEE International Conference on (p. 16). Presented at the Distributed Computing Systems, 2006. ICDCS 2006. 26th IEEE International Conference on. doi:10.1109/ICDCS.2006.21 Tokuda, H., Nakajima, T., & Rao, P. (n.d.). RealTime Mach: Towards a Predictable Real-Time System. time, 1(e2), e3. Verissimo, P., & Casimiro, A. (2002). The timely computing base model and architecture. Computers. IEEE Transactions on, 51(8), 916–930. doi:10.1109/TC.2002.1024739
338
KEY TERMS AND DEFINITIONS SOAP Web Services: Self-contained and interoperable application components that communicate using open protocols (e.g., SOAP – Simple Object Access Protocol). Web services can be discovered using UDDI (Universal Description, Discovery and Integration) and make use of a description language that describes the service interface, including available operations and parameters. Real-Time Computing: Hardware or software systems that operate under time constraints. That is, the correctness of the service delivered by the system depends on the time at which the results are produced. Failure: An event that occurs when the delivered service deviates from correct service. A service fails either because it does not comply with the functional specification, or because this specification did not adequately describe the system function. Timing Failure: The time of arrival or the duration of the information delivered at the service interface (i.e., the timing of service delivery) deviates from implementing the system function. Aspect Oriented Programming: Complements Object-Oriented Programming (OOP) by providing another way of thinking about program structure. The key unit of modularity in OOP is the class, whereas in AOP the unit of modularity is the aspect. Aspects enable the modularization of concerns such as transaction management that cut across multiple types and objects. Such concerns are often termed crosscutting concerns in AOP literature. Bytecode Instrumentation: A technique for manipulating compiled application files, with the goal of inserting calls to existing functions or completely new code. While the original source code maintains itself unaltered, its corresponding bytecode is changed according to the new functions to be executed. Frequently the goal is to add
Building Web Services with Time Requirements
a functionality that does not exist in the original application and should repeat itself in multiple locations of the application (e.g., application profiling). Dijkstra’s Shortest Path Computation Algorithm: An algorithm conceived by Edsger
Dijkstra in 1959. It is a graph search algorithm that solved the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree. This algorithm is often used in routing.
339
340
Chapter 15
Dependability and Security on Wireless SelfOrganized Networks:
Properties, Requirements, Approaches and Future Directions Michele Nogueira Universidade Federal do Paraná, Brazil Aldri Santos Universidade Federal do Paraná, Brazil Guy Pujolle Sorbonne Universités, France
ABSTRACT Wireless communication technologies have been improved every day, increasing the dependence of people on distributed systems. Such dependence increases the necessity of guaranteeing dependable and secure services, particularly, for applications related to commercial, financial and medial domains. However, on wireless self-organized network context, providing simultaneously reliability and security is a demanding task due to the network characteristics. This chapter provides an overview of survivability concepts, reviews security threats in wireless self-organized networks (WSONs) and describes existing solutions for survivable service computing on wireless network context. Finally, this chapter presents conclusions and future directions.
INTRODUCTION Improvements on wireless networking have highlighted the importance of distributed systems in DOI: 10.4018/978-1-60960-794-4.ch015
our everyday lives. Network access becomes more and more ubiquitous through portable devices and wireless communications, making people dependent on them. This raising dependence claims for simultaneous high level of reliability, availability and security on those systems, par-
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Dependability and Security on Wireless Self-Organized Networks
ticularly, on commercial, financial, medical and social transactions supported by pervasive and wireless distributed systems. Wireless self-organized networks (WSONs), that include ad hoc, mesh and sensor networks, are examples of pervasive distributed systems (Figueiredo, 2010). These networks are composed of heterogeneous portable devices communicating among themselves in a wireless multi-hop manner. Wireless self-organized networks can autonomously adapt to changes in their environment, such as device position, traffic pattern and interference. Each device can dynamically reconfigure its topology, coverage and channel allocation in accordance with changes. Further, no centralized entity controls the network, requiring a decentralized management approach. Security is crucial to WSONs, particularly for security-sensitive applications on military, homeland security, financial and health care domains. Security threats have taken advantage of protocol faults and network characteristics. These networks pose nontrivial challenges to security design due to their characteristics, such as shared wireless medium, highly dynamic network topology, multihop communication and low physical protection of portable devices. Further, the absence of central entities increases the complexity of security management operations, as access control, node authentication and cryptographic key distribution. In general, existing security solutions for WSONs employ preventive or reactive security mechanisms, detecting intrusions and thwarting attacks by cryptography, authentication and access control mechanisms (Hu, 2003; Lou, 2004; Marti, 2000; Papadimitratos, 2003). Each security mechanism addresses specific issues having limitations to cope with different types of attacks and intrusions. Preventive defenses, for example, are vulnerable to malicious devices that already participate in network operations, whereas reactive defenses work efficiently only against well-known attacks or intrusions. Due to these limitations, researchers have developed intrusion tolerant
solutions, as a third defense line, to mitigate the impact of attacks and intrusions by fault-tolerance techniques, typically redundancy and recovery mechanisms. However, security solutions remain still focused on one issue or layer of the protocol stack, being ineffective to ensure essential services of WSONs. Network characteristics and constraints on defense lines reinforce the necessity of new approaches to guarantee integrity, confidentiality, authentication and, particularly, availability of network services. Such requirements motivate the design of survivable network services for WSONs. Survivability is the capability of a network to support essential services even in face of attacks and intrusions (Laprie, 2004). This chapter presents a general discussion about survivability in wireless self-organized networks, its concepts and properties. Its goal lies in emphasizing open issues, survivability requirements for WSONs and its effects regarding, particularly, network characteristics. Further, this work surveys main solutions that have applied survivability concepts on WSONs, such as architectures of network management, routing protocols and key management systems. Finally, future trends are emphasized.
SECURITY THREATS WSONs are susceptible to many threats. Network characteristics, such as dynamic topology, decentralized network architecture, shared wireless medium, low physical security of nodes and multi-hop communication result in different vulnerabilities and make difficult to maintain essential network services (Aad, 2004). The decentralized architecture, for example, requires cooperation among nodes by one-hop or multiple hop connectivity without having guarantees that all nodes will cooperate as expected. The network autonomy, high dynamism of the topology and the lack of access control facilitate the participation of malicious or selfish nodes in network operations.
341
Dependability and Security on Wireless Self-Organized Networks
Further, wireless communication is vulnerable to interference and interceptions, and the low physical security of nodes increases their possibilities of being tampered with. Threats on WSONs are categorized as attacks or intrusions. An attack is any action that explores a weakness of the network in order to compromise the integrity, confidentiality, availability and nonrepudiation of information or network services. An intrusion also exploits weaknesses of the network, but it results from a successful attack. These two threats are classified as malicious faults in the dependability domain (Laprie, 2004), consisting in malicious or selfish actions that intend to alter the functioning of a network or its nodes. Malicious faults result in errors, meaning intentional deviations of the correct service operation in order to achieve malicious fault goals. Particularly, malicious fault goals are: • • •
to disrupt or halt services, causing the denial of services; to access or alter confidential information; to improperly modify the behavior of services and protocols.
Attackers are malicious entities (humans, nodes, services or software) that produce attacks or intrusions. Attackers attempt to exceed any authority they might have, seeking to achieve malicious fault goals. They take advantage of vulnerabilities produced by network characteristics, or weaknesses in network protocols, software or hardware. Examples of attackers include hackers, vandals, malicious software, and malicious or selfish nodes. Three main classifications for attacks exist. The first one is based on the attack means, being categorized as passive or active. Passive attacks intend to steal information and eavesdrop on the communication without provoking any disruption of data or services. Examples of passive attacks are eavesdropping, traffic analysis and traffic monitoring. Active attacks on the other hand
342
involve service interruption, data modification or fabrication causing errors, network overload, or blocking nodes of effectively using network services. Examples of active attacks include jamming, impersonating, modification, denial of service (DoS), and message replay. Attacks can also be classified as external and internal according to the participation of the attacker in the network. External ones are produced by attackers that are not legally part of the network, whereas internal attacks are generated by nodes participating in the network operations. An attacker is not legally part of the network if it does not have the cryptographic material necessary to be authenticated or to participate of specific services such as routing. Finally, attacks can be categorized based on which layers of the network protocol stack will be affected. Figure 1 summarizes the main attacks on WSONs according to network layers. However, we emphasize that some of them are considered multi-layer such as DoS and Sybil ones, acting on more than one layer. Some of these attacks are also classified as Byzantine or misbehaviors. In these attacks, attackers have full control over a number of authenticated nodes and behave arbitrarily to disrupt the network (Awerbuch, 2005). Blackhole, wormhole and rushing attacks are examples of Byzantine ones. Further, nodes can also present a selfish behavior in which they take advantage of their participation in the network, but they refuse to expend their own resources for cooperating with network operations (Awerbuch, 2004).
Preventive, Reactive and Tolerant Security Mechanisms Security mechanism is a process designed to detect, prevent or recover the network from attacks or intrusions. It can be independent of any particular protocol layer or be implemented in a specific one. In WSON, security mechanisms follow two defense lines: preventive and reac-
Dependability and Security on Wireless Self-Organized Networks
Figure 1. Attacks by network layers
tive. The former provides mechanisms to avoid attacks, such as firewalls and cryptography. The latter consists in taking actions on demand to detect misbehaviors and react against intrusions, such as intrusion detection systems (IDS) (Huang, 2003) or reputation systems (Buchegger, 2008). These last ones address malicious and selfish issues by using nodes that track their neighbor’s behavior and exchange this information with others in order to compute a reputation value about
their neighbors. Reputation values are then used by nodes to decide with whom to cooperate and which nodes to avoid. Nodes with a good reputation are favored. Albeit the efforts to improve preventive and reactive defense lines, these two defenses are not sufficient to put all attacks and intrusions off (Deswarte, 2006). Preventive defenses are vulnerable to internal attacks. Nodes participating legally in the network can perform network
343
Dependability and Security on Wireless Self-Organized Networks
operations, as well as encryption and decryption, allowing malicious nodes to harm the integrity, confidentiality, availability and non-repudiation of data and services. Reactive defenses work efficiently only against well-known intrusions. Intrusion detection systems, for example, require extensive evidence gathering and comprehensive analysis in order to detect intrusions based on anomalies or predetermined intrusion patterns. Anomalies are deviations of the network operation in relation to a behavior considered normal. Such behaviors are defined by baselines, being difficult in practice to determine them due to the dynamically changing topology and volatile physical environment of self-organizing wireless networks. Predetermined patterns are characteristics of known attacks and intrusions used for their detection. These patterns must be constantly updated to assure the IDS efficiency. Moreover, unknown or new attacks and intrusions will not be detected. Due to these limitations on both defense lines, research groups have built security mechanisms towards a third defense line called intrusion tolerance (IT) (Deswarte, 2006). This defense line complements the other ones and its goal is to mitigate the effects of malicious or selfish actions by the use of fault tolerance mechanisms in the security domain (Deswarte, 2006; Sterbenz, Figure 2. Classification of defense lines
344
2002). Intrusion tolerance emerged with Fraga and Powell’s initiative (Fraga, 1985), however, the development of such systems only had more attention in the last decade with the MAFTIA (Malicious-and Accidental-Fault Tolerance will be Internet Applications) (Randell, 2010) and OASIS (Organically Assured and Survivable Information System) (Duren, 2004) projects. Examples of fault tolerance mechanisms are data redundancy and replication. Figure 2 illustrates the organization of these three defense lines.
Survivability Concepts Dependability is the ability of a system (node, network or service) to avoid failures that are more severe and frequent than it is acceptable. A failure is an event that occurs as consequence of errors-deviations of the correct service operation. Dependability is an integrating concept that encompasses as main attributes: availability, reliability, safety, integrity and maintainability. These attributes represent, respectively, the system’s readiness for correct service, the continuity of correct service, the absence of severe consequences on the user and the environment, the nonexistence of improper system alteration and the ability to undergo modifications and repairs.
Dependability and Security on Wireless Self-Organized Networks
Security is a domain well developed with its own terms and concepts. Differently of dependability, it is a combination of the following attributes: confidentiality, integrity, availability, non repudiability, auditability and authenticity. These attributes represent the prevention of the unauthorized disclosure, deletion, withholding or amendment of data and information. Laprie et al. have analyzed the relationship among the attributes of both domains, supporting the integration between dependability and security (Laprie, 2004). In this work, survivability represents the new field of study resulted from the integration of security and dependability. Survivability allows systems to survive, limit damage, recover and operate robustly, particularly, in face of attacks and intrusions (Ellison, 1997). It is a system capability of providing essential services in face of malicious threats, and to recover compromised services in a timely manner after the occurrence of intrusions. Survivability is important due to the characteristics of attacks and intrusions. These threats present different conditions and features in comparison with failures and accidents since they have intentional human causes. They are controlled by an intelligent adversary’s mind, being hard to forecast when they will happen, as well as their effects. New attacks and intrusions have a high probability to occur exploiting different vulnerabilities of the system. Hence, attacks and intrusions can only be efficiently treated when analyzed separately, requiring attributes, such as those proposed by survivability, to address known forms of malicious faults and create defenses against new ones. Requirements and key properties have been defined for survivability. Requirements are detailed in the next section, whereas survivability properties are summarized as resistance, recognition, recovery and adaptability (Sterbenz, 2002). Resistance is the capability of a system to repel attacks. Firewalls and cryptography are examples of mechanisms used to reach it. Recognition is the
system’s capability of detecting attacks and evaluating the extent of damage. Examples of recognition mechanisms are intrusion detection systems using techniques such as pattern matching and internal system integrity verification. Recovery is the capability of restoring disrupted information or functionality within time constraints, limiting the damage and maintaining essential services. Conventional strategies applied for achieving recovery are data replication and redundancy. Finally, adaptability is the system’s capability of adapting to emerging threats and quickly incorporating lessons learned from failures (Ellison, 1997; Sterbenz, 2002). Examples of adaptation techniques are topology control by radio power management, active networking and cognitive radio technology. The application of active networking technology intends to allow the dynamic selection of MAC or network layer parameters, and the dynamic negotiation of algorithms and entire protocols based on application requirements or the communication environment (Sterbenz, 2002). Figure 3 illustrates the interaction among these key properties.
Survivability Requirements for WSON Survivability requirements can vary substantially depending on system scope, criticality, and the consequences of failure and service interruption (Ellison, 1997; Linger, 1998). Self-organizing wireless networks introduce diverse functions, operations and services influenced by the context and applications. In a critical situation where parts of the system are compromised by attacks or intrusions, priority is given to maintain network connectivity in three levels: link-layer, routing and end-to-end communication. Hence, to reach network survivability some requirements must be provided. Requirements mean the functions or features required to improve network capability of delivering essential services in face of attacks or intrusions, and to recover services.
345
Dependability and Security on Wireless Self-Organized Networks
Figure 3. Survivability key properties
Survivability requirements for WSONs are categorized in two groups: those requirements related to essential services and those requirements related to network characteristics. The first group is composed of requirements resulted from essential service characteristics. These requirements are summarized as: •
•
•
•
346
Heterogeneity – a survivability approach must consider the heterogeneity of nodes, communication technology and capacity of node’s resources. Self-configuration – the network must be able to change dynamically the parameter values of the connections, nodes and protocols, as well as of security mechanisms, such as cryptographic key length, firewall rules and reputation system thresholds; Self-adaptation – capability of the network to adjust itself in response to mobility, environment and the requirements of activities as Quality of Service (QoS) level; Efficiency – survivable approach should support the efficient use of node and network resources, such as node energy and
•
•
•
•
•
network bandwidth when a malicious fault is suspected or happening; Access control – mechanisms must control the access of nodes in the network, as well as monitor their activities; Protection – security mechanisms need to be managed and combined in order to protect the communication at all layers of the protocol stack; Integrity, confidentiality, authentication and non-repudiation – security principles must be assured for communication; Redundancy – the network must tolerate and mitigate attacks by means of intrusion tolerance techniques, such as double protocols or cryptography operations, simultaneous use of multiple routes and others; Robustness – the network must proceed services during eventual disconnection and along with partial segments of paths.
The second group includes certain survivability requirements resulted from network characteristics such as:
Dependability and Security on Wireless Self-Organized Networks
Figure 4. Survivability requirements
•
•
•
•
•
•
Decentralization – survivability must be provided by a decentralized approach in order to avoid central point of attacks; Self-organization – mechanisms for supporting survivability must be selforganized without requiring human intervention in face of changes on network conditions; Scalability – mechanisms to provide survivability must consider the variability on the total number of nodes and the dynamic topology; Self-management – survivable mechanisms must guarantee network functionality and efficiency on all network conditions; Self-diagnosis – the network must monitor itself and find faulty, unavailable, misbehaving or malicious nodes; Self-healing – survivable approaches must prevent connectivity disruptions and recover the network from problems that might have happened. They must also find an alternative way of using resources and reconfiguring nodes, network or protocols for keeping them in normal operation;
•
Self-optimization – mechanisms must optimize the use of network resources, minimizing latency and maintaining the quality of service.
In Figure 4, each essential service as link-layer connectivity, routing and communication is associated to three different layers, respectively, link, network and application layers. Separately, they are not sufficient to achieve a complete survivable system due to the characteristics of attacks and intrusions. Hence, a cross-layer approach can make security mechanisms more robust, resistant and survivable. The routing layer, for example, can use energy or bandwidth information provided by the link layer in order to take better choices and to be more adaptive. The routing layer can inform others of detected attacks and then, those layers can start an alert procedure. In summary, the survivability existent on the layers must mutually support themselves. Based on these previous considerations and on the survivability key properties presented, we have identified three view planes for survivable systems, as illustrated in Figure 5. In the first one (key properties), we have the properties that must
347
Dependability and Security on Wireless Self-Organized Networks
Figure 5. Planes of view for survivability
be achieved by the system. In the second one (requirements), we highlight the requirements that survivable systems need to reach. Finally, in the third one (protocol layers), we emphasize that all network layers need to be addressed by the system. We argue that a holistic survivable system must consider these three planes.
SURVIVABLE APPROACHES Architectures In these last few years, research interests in survivability have increased. Initially addressed by military area, the first survivability architectures have been proposed in order to improve both security and dependability of information systems, distributed services and storage systems in the Internet domain (Ghosh, 1999; John, 2002; Keromytis, 2003; Kreidl, 2004; Medhi, 2000; Wang, 2003; Wylie, 2000). Although the importance of all architectures in the survivability development, we emphasize Willow (Wylie, 2000), SITAR (Wang, 2003) and SABER (Keromytis, 2003) architectures due to their completeness in terms of sur-
348
vivability properties. Further, these architectures exemplify, respectively, the centralized, partially distributed and fully distributed architectures. The Willow architecture (Wylie, 2000) is designed to enhance the survivability of critical information systems. This architecture proposes the merging of different mechanisms aiming to avoid, eliminate and tolerate faults. All of these mechanisms are based on a reconfiguration approach in which nodes of the network can together monitor and respond to faults. Each node and network operations are monitored continuously. However, the analysis of their operation is performed by central nodes, called servers, restricting the efficiency of the architecture. SITAR (Wang, 2003) is a survivable architecture for distributed services whose goal is to provide the minimal level of services despite the presence of active attacks. This architecture is composed of different components such as proxy servers, monitors, audit control module and adaptive regeneration module. These components are transparent for the clients and servers of the service and each component has a backup in order to guarantee its operation. The architecture controls
Dependability and Security on Wireless Self-Organized Networks
all requests and responses, and can be centralized or partially distributed. The SABER architecture (Keromytis, 2003) integrates also different mechanisms to improve the survivability of Internet services. SABER proposes a multi-layer approach in order to block, evade and react to a variety of attacks in an automated and coordinated fashion. The SABER architecture is composed of DoS resistant module, IDS and anomaly detection, migration process and automated soft-patching system. All of these components are controlled by an infrastructure of coordination. This infrastructure provides the communication and correlation among the components in a decentralized fashion. Albeit various survivable architectures exist, few of them were developed for self-organizing wireless networks. Aura and Maki (Aura, 2002), for example, proposed a distributed architecture towards a survivable access control in ad hoc networks. The survivability is achieved creating secure groups of nodes, managing their membership and proving group membership. Operations of security are based on public key certificates, being all the architecture based on cryptography. Groups are formed to grant access rights to nodes. Then, the survivability of this scheme is reached by the existence of multiple groups and by their independence. If a group does not exist anymore, another group can execute access control operations. Despite authors claim to propose an architecture, the solution is a specific survivable scheme for access control, not presenting a set of rules, concepts or models. Moreover, we identified that only resistance and recovery survivability properties are reached by this scheme. A survivable architecture for wireless sensor networks (WSNs) was proposed by Tipper et al.(Qian, 2007). The architecture aims to provide critical services in spite of physical and network based security attacks, accidents or failures. However, the architecture is limited to identify a set of requirements related to security and survivability,
such as energy efficiency, reliability, availability, integrity, confidentiality and authentication.
Routing This section describes some initiatives on building survivable routing for WSONs. Despite they do not present a proposal with all survivability properties, their characteristics are more correlated to survivability goals than just preventive or reactive defense lines. We focus on propositions that aggregate more than one defense line and apply at least one technique of tolerance, such as redundancy. We categorize the initiatives in two main groups: route discovery and data forwarding. The former lies in approaches trying to make the route discovery phase of routing protocols more resistant and tolerant to attacks and intrusion. The latter is composed of initiatives specialized on data forwarding using preventive or reactive security mechanisms and one technique of tolerance.
Route Discovery Routing is essential for the operation of WSONs. Many routing protocols have been proposed in the literature including proactive (table-driven), reactive (demand-driven) and hybrid solutions. Most of them have assumed these networks as a trust environment in which nodes can trust and cooperate with themselves. However, these networks are vulnerable to attacks and intrusions as discussed in this chapter. Due to vulnerabilities on routing service, secure routing protocols have been proposed (Hu, 2003; Hu, 2005), such as SRP (secure routing protocol), SAODV (Secure Ad hoc On-Demand Distance Vector), SAR (security aware routing). These protocols are mostly based on authentication and encryption algorithms, being inefficient to put all intruders and attacks off. Therefore, some research groups have built intrusion tolerant routing approaches, such as TIARA (Techniques for
349
Dependability and Security on Wireless Self-Organized Networks
Intrusion-resistant Ad Hoc Routing Algorithms) (Ramanujan, 2003), BFTR (Best-Effort Fault Tolerant Routing) (Xue, 2004), ODSBR (An On-Demand Secure Byzantine Routing Protocol) (Awerbuch, 2005) and BA (Boudriga’s Approach) (Boudriga, 2005). TIARA defines a set of design techniques to mitigate the impact of Denial of Service (DoS) attacks and can be applied on routing protocols to allow the acceptable network operation in the presence of these attacks. The main techniques established by TIARA are: flow-based route access control (FLAC), distributed wireless firewall, multipath routing, flow monitoring, sourceinitiated flow routing, fast authentication, the use of sequence numbers and referral-based resource allocation. For its effective implementation, TIARA should be adapted to a routing protocol, being incorporated more easily into on-demand protocols, such as DSR and AODV. In the FLAC technique, a distributed wireless firewall and a limited resource allocation are applied together to control packet flows and to prevent attacks based on resource overload. Each node participating in the ad hoc network contains an access control list, where authorized flows are defined. A threshold is defined for allocating limited amount of network resources for a given flow. Many routes are discovered and maintained, but only one route is chosen to data forwarding. The flow monitoring technique checks the network failures sending periodic control messages, called flow status packets. If a path failure is identified, an alternative path found in the discovery phase will be selected. The authentication process in TIARA consists in placing the path label of the packet in a secret position. Each node can define a different position for the label within the packet being its authentication information. Best-effort fault-tolerant routing (BFTR) is a source routing algorithm exploring path redundancies of ad hoc networks. Its goal is to maintain packet routing service with high delivery ratio and low overhead under misbehaving nodes. BFTR
350
never attempts to conclude whether the path or any node along it is good or bad. It considers existing statistics to choose the most feasible path, such as each one with the highest packet delivery ratio in the immediate past. By means of existing statistics and receiver’s feedback, different types of attacks can be indistinctly detected, such as packet dropping, corruption, or misrouting. BFTR is based on DSR flooding to retrieve a set of paths between source and destination nodes, whenever necessary, and it chooses initially the shortest path to send packets. If a route failure is reported, the protocol will discard the current routing path and proceed with the next shortest path in the route cache. The algorithm considers that the behavior of any good node is to delivery packets correctly with high delivery ratio. Hence, a good path consists of nodes with high delivery ratio. Any path with low delivery ratio is thus discarded and replaced by the next shortest path. BFTR requires no security support from intermediate nodes. The source and destination nodes of connections are assumed well-behaved. A previous trust relationship between end nodes is required, being possible the authentication between them during data communication. ODSBR is a routing protocol providing a correct routing service even in the presence of Byzantine attacks (Holmer, 2008). ODSBR operates using three sequential phases: (i) least weight route discovery, (ii) Byzantine fault localization and (iii) link weight management. The first phase is based on double secure flooding and aims to find lowest cost paths. Double flooding means that route discovery protocol floods with route request and response messages in order to ensure path setting up. In this phase, cryptography operations guarantee secure authentication and digital signature. The second phase discovers faulty links on the paths by means of an adaptive probing technique. This technique uses periodic secure acknowledgments from intermediate nodes along the route and the integrity of the packets is assured by cryptography. The last phase of ODSBR
Dependability and Security on Wireless Self-Organized Networks
protocol manages the weight assigned to a faulty link. Each faulty link has a weight to identify bad links, being this information stored at a weight list and used by the first phase of the protocol. Results have shown the good performance of ODSBR in many scenarios for different metrics. However, some important points are not evaluated or well defined. For example, ODSBR assumes the use of RSA cryptography and digital signatures without considering open issues as public key distribution, node pair key initialization or the iteration among nodes to guarantee authenticity. These operations are essential for the good ODSBR functionality influencing results. Moreover, it is based also on acknowledgments that could not be assured due to mobility and dynamic topology. Boudriga et al. (Boudriga, 2005) propose a new approach for building intrusion tolerant MANETs. It consists of a multi-level trust model and a network layer mechanism for resource allocation and recovery. The multi-level trust model assumes that the network is divided into two virtual sets: the resource’s domain and the user’s domain. Each resource assigns a unique trust level for each type of activity that it is involved with and each location where it appears. Based on this trust level and on the activity, users or applications allocate resources by a distributed scheme. It allocates available resources attempting to maximize the usage and minimize costs. For each application, only a fraction of a resource is allocated at a given node. Intrusion tolerance is reached through a distributed firewall mechanism, a technique for detecting and recovering intruder-induced path failures, a trust relation between all nodes, an IPsec-based packet authentication, and a wireless router module that enable survivability mechanisms to DoS attacks. The distributed firewall aims to protect the MANET against flooding attacks and each node maintains a firewall table containing the list of all packets passing through it and successfully accepted by their destination. After a handshake between the sender and the receiver of a related
flow, the entries in a firewall table will be maintained automatically and refreshed when failures, intrusion occurrences or other abnormal behavior are detected. Based on those entries, the node can forbid any flood of spurious traffic. Three parameters are managed by the nodes to detect anomalies as packet loss rate, duplicate packet rate and authentication failure rate.
Data Forwarding Some works have proposed secure routing mechanisms to defend against more than attack (Khalil, 2005; Just, 2003; Kargl, 2004; Djenouri, 2007). Despite of those protocols ensure the correctness of the route discovery, they cannot guarantee secure and undisrupted delivery of data. Intelligent attackers can easily gain unauthorized access to the network, follow the rules of the route discovery, place themselves on a route, and later redirect, drop or modify traffics, or inject data packets. In a nutshell, an adversary can hide its malicious behavior for a period of time and then attack unexpectedly, complicating its detection. For these reasons, mechanisms to provide data confidentiality, data availability and data integrity are necessary for guaranteeing secure data forwarding. Several mechanisms have been proposed for securing data forwarding. Lightweight cryptographic mechanisms as Message Authentication Code (MAC) (Krawczyk, 1997), for example, are used to data integrity. Nuglets (Buttyan, 2003), Friends and Foes (Miranda, 2003), Sprite (Zhong, 2003) and others propose mechanisms to stimulate node participation in data forwarding, trying to guarantee data availability. CORE (Michiardi, 2002) and CONFIDANT (Buchegger, 2002) are examples of reputation systems that provide information to distinguish between a trustworthy node and a bad node. This information also encourages nodes to participate in the network in a trustworthy manner.
351
Dependability and Security on Wireless Self-Organized Networks
Some solutions to provide data confidentiality and data availability have attempted to apply techniques as redundancy and message protection to be more resilient to attacks. In SPREAD (Lou, 2004), SMT (Papadimitratos, 2003) and SDMP (Choudhury, 2006), for example, the message is divided into multiple pieces by a message division algorithm. These pieces are simultaneously sent from the source to the destination over multiple paths. In (Berman, 2006), a cross-layer approach is investigated to improve data confidentiality and data availability, using directional antennas and intelligent multipath routing with data redundancy. The Secure Protocol for Reliable Data Delivery (SPREAD) scheme proposes the use of some techniques to enhance data confidentiality and data availability. Initially, messages are split into multiple pieces by the source node, using the threshold secret sharing scheme (Shamir, 1979). Each piece is encrypted and sent out via multiple independent paths. Encryption between neighboring nodes with a different key is assumed, as well as the existence of an efficient key management scheme. SPREAD focuses on three main operations: to divide the message, to select multiple paths and to allocate message pieces into paths. Messages are split by the threshold secret sharing algorithm and each piece is allocated into a selected path aiming to minimize the probability of harm. SPREAD selects multiple independent paths taking into account security factors, such as the probability of one path being compromised. The goal of SPREAD is to achieve an optimal share allocation way, where the attacker should damage all the paths to recover the message. The goal of the secure message transmission (SMT) protocol is to ensure data confidentiality, data integrity, and data availability, safeguarding the end-to-end transmission against malicious behavior of intermediary nodes. SMT exploits four main characteristics: end-to-end secure and secure feedback mechanism, dispersion of the transmitted data, simultaneous usage of multiple paths, and adaptation to the network changing
352
conditions. It requires a security association (SA) between the two end communicating nodes, so no link encryption is needed. This trust relationship is indispensable for providing data integrity and authentication of end nodes, necessary for any secure communication scheme. The two end nodes make use of a set of node-disjoint paths, called Active Path Set (APS), being a subset of all existing paths between them. Data message is broken into several small pieces by the information dispersal scheme. Data redundancy is added to allow recovery, being also divided into pieces. All pieces are sent through different routes existent in APS, enhancing statistically the confidentiality and availability of exchanged messages. At the destination, the dispersed message is successfully reconstructed only if a sufficient number of pieces are received. Each piece carries a Message Authentication Code (MAC), allowing its integrity verification by the destination. The destination validates the incoming pieces and acknowledges the successfully received ones thought a feedback to the source. The feedback mechanism is also protected by cryptography and is dispersed to provide fault tolerance. Each path of APS has a reliability rate calculated by the number of successful and unsuccessful transmissions on this path. SMT uses this rate to manage the paths in APS, trying to determine and maintain a maximally secure path-set, and adjusting its parameters to remain effective and efficient. The Secured Data based Multi Path (SDMP) protocol exploits also multiple paths between network nodes to increase the robustness and data confidentiality. The protocol assumes Wired Equivalent Privacy (WEP) link encryption/decryption of all the frames between neighboring nodes, which provide link layer confidentiality and authentication. SDMP can work with any routing protocol which provides topology discovery and supports the use of multipath for routing. SDMP distinguishes between two types of path: signaling and data. Signaling type requires only one
Dependability and Security on Wireless Self-Organized Networks
path of the path-set existent between source and destination nodes, being the other paths available for data transmission. The protocol divides the message into pieces using the Diversity Coding approach (Ayanoglu, 1993). Each piece has a unique identifier and all of them are combined in pairs through an XOR operation related to a random integer number. Each pair is sent along a different path. All information necessary for message reconstruction at the destination is sent by the signaling path. Unless the attacker can gain access to all of the transmitted parts, the probability of message reconstruction is low. That is, to compromise the confidentiality of the original message, the attacker must get within eavesdropping range of the source/destination, or simultaneously listen to the used paths and decrypt the WEP encryption of each transmitted part. However, it is possible to deduce parts of the original message from only a few of the transmitted pieces, especially since one piece of the original message is always sent in its original form on one of the paths. In contrast to previous solutions, a cross-layer approach is investigated in (Berman, 2006). The solution uses directional antennas and intelligent multipath routing to enhance end-to-end data confidentiality and data availability. Unlike an omnidirectional antenna that transmits or receives radio waves uniformly in all directions, a directional antenna transmits or receives radio waves in one particular direction. Directional antennas make eavesdropping more difficult and reduce the areas covered by packet transmissions, minimizing the overlap of message pieces sent by multiple paths. Thus, the use of directional antennas is justified by the reduction on the likelihood that an adversary is able to simultaneously gather all of the message pieces at the source or destination nodes. A self-adaptive transmission power control mechanism is used together with directional antennas to reduce the message interception probability. This mechanism allows the transmitter to employ only enough transmission power in
order to reach the intended receiver, minimizing the radiation pattern for a given radio transmission and the possibility of an attacker to intercept the message transmission. Dynamically the transmission power is adjusted depending of the data packet type exchanged between neighboring nodes. Multipath routing is also used. Thus, messages are divided based on threshold secret sharing algorithm, and then the shares are sent by multiple node-disjoint paths. Two intelligent routing schemes are proposed to reduce message interception probability. The former minimizes the physical distance of hops and the latter minimizes the path-set correlation factor.
Public-Key Infrastructures Many approaches for public key distribution have been proposed towards survivability (Capkun, 2003; Zhou, 1999). The first proposals adapt traditional key management systems to the characteristics of self-organizing wireless networks. Those first proposals rely on the existence of a certificate authority (CA) which issues digital certificates binding an identity to its public key. In the threshold CA scheme (Zhou, 1999), for instance, Zhou et al. designed a fault-tolerant CA, employing both secret sharing and threshold cryptography techniques to reach redundancy. However, the system must have a central trust entity to bootstrap the key management service. Further, it is still vulnerable to attacks when the number of compromised nodes is greater than a threshold. Hence, for improving its weaknesses, many other PKIs have been proposed (Merwe, 2007). Self-organized key management (PGP-like) (Capkun, 2003) is another initiative towards survivability. It modifies Pretty Good Privacy (PGP), however the CA functionality is fully distributed. All nodes have the same role being responsible for generating their pair of keys and issuing certificates to the public keys of nodes they trust. The public key validation process consists in finding a chain
353
Dependability and Security on Wireless Self-Organized Networks
of certificates in the local repositories of a node. This chain of certificates lies in a sequence of public key certificates originated in a certificate produced by the node wanting to validate the key until reaching the certificate of the key. Although PGP-like seems to be more satisfactory to the selforganization, it is vulnerable to some attacks, such as Sybil, lack of cooperation and masquerade. Some PKI proposals employ the concept of clusters or multicast communication. In a nutshell, they create groups of nodes generating keys per group in order to provide access control. These proposals present characteristics, such as resiliency, fault-tolerance or scalability that can improve survivability (Wu, 2007). However, they focus mainly on efficiency neither dealing with a complete survivable system, nor reaching all survivable requirements and properties. Despite of the previous initiatives, none of them has been designed with survivability in mind. For the best of our knowledge, only few works have proposed survivable key management systems, such as that proposed by Chorzempa et al. (Chorzempa, 2007). In that work, a survivable and efficient key management system for wireless sensor network is presented focusing on robustness and recovery. Methods for distributing, maintaining and recovering session keys are defined to operate even in case of compromised nodes. However, only recovery property is handled without considering together resistance, recognition and adaptability properties.
CONCLUSION AND FUTURE TRENDS The use of wireless technologies and self-organized networks has increased and, consequently, security and reliability issues become more and more important. Traditional security defense lines are not sufficient for neither such networks, nor new applications related to medical, financial and commercial domains. Since wireless self-
354
organized networks (WSONs) present different characteristics and properties, new approaches are required. This chapter introduced survivability concepts and its correlation with preventive, reactive and tolerance defense lines on the wireless network context. We detailed the key properties of survivability, as resistance, recognition, recovery and adaptability, and we analyzed survivability requirements for WSONs. Those requirements comprise self-organization, self-control, selfconfiguration, self-management, access control, protection, authentication, scalability, redundancy and others. This chapter overviewed existing survivable initiatives, categorizing them in three groups: architectures, routing service and cryptography management service. Further, those initiatives were analyzed considering survivability requirements and properties. Based on this investigation, we can concluded that (i) security solutions for WSONs still apply a few set of preventive and reactive techniques; (ii) solutions focus either on attacks or only one layer of the stack protocol; (iii) adaptability property is almost unexplored; (iv) requirements as heterogeneity, efficiency, robustness and self-management are not yet reached. Based on this conclusion, we argue that survivable service computing in WSONs will be achieved (even in the presence of attacks or intrusions) by means of the self-adaptation and cooperation among three defense lines, preventive, reactive and tolerance. Further, survivable solutions should consider multi-layers and multi-attacks, Some limitations can be identified in the current state of the art related to survivability in WSONs. First, existing survivable architectures for WSONs are conceptual and generic assigning guidelines for reaching survivable services. Since, the main goal of a research is always to develop solutions as practical as possible, a survivable architecture should be exploited towards the definition of a framework, its functions, and implementation of an API (Application Program Interface) in order to allow its practical use. Further, the framework
Dependability and Security on Wireless Self-Organized Networks
must consider differences in case studies and must be flexible allowing the addition of new functions when necessary. Another limitation in existing solutions is related to metrics for survivability evaluation. Actually, how to measure survivability is a topic that produces many discussions in this field. Hence, the definition of metrics specifically with this proposal is a request. We envisage still a long path with discussions about this topic considering the self-organizing wireless networks characteristics. Survivable solutions must deal with communication issues related to the diffusion of information, for example. Due to the decentralized characteristic of WSONs, we believe that solutions must be proposed in order to minimize the impact on performance and improve their efficiency. Further, solutions considering survivability requirements and goals must be proposed, particularly, the network situation must be addressed. Creating a strategy of communication to adapt or change the used techniques with the network situation is a good direction of research. The development of adaptive solutions, algorithms and protocols considering survivability aspects is expect to support network survivability.
REFERENCES Aad, I., Hubaux, J.-P., & Knightly, E. W. (2004). Denial of service resilience in ad hoc networks. In Proceedings of the ACM Annual International Conference on Mobile Computing and Networking (MobiCom) (pp. 202-215), New York, NY, USA, 2004. ACM Press. Akyildiz, I. F., & Wang, X. (2005). A survey on wireless mesh networks. IEEE Communications Magazine, 43(9), 23–30. doi:10.1109/ MCOM.2005.1509968
Aura, T., & Maki, S. (2002). Towards a survivable security architecture for ad-hoc networks. In Proceedings of the International Workshop on Security Protocols (pp. 63–73), London, UK, 2002. Springer-Verlag. Awerbuch, B., Curtmola, R., Holmer, D., NitaRotaru, C., & Herbert, R. (2004). Mitigating byzantine attacks in ad hoc wireless networks. Technical report. Center for Networking and Distributed Systems, Computer Science Department, Johns Hopkins University. Awerbuch, B., Curtmola, R., Holmer, D., NitaRotaru, C., & Herbert, R. (2005). On the survivability of routing protocols in ad hoc wireless networks. In Proceedings of the First International Conference on Security and Privacy for Emerging Areas in Communications Networks (pp. 327-338),Washington, DC, USA, 2005. IEEE Computer Society. Ayanoglu, E., Chih-Lin, I., Gitlin, R. D., & Mazo, J. E. (1993). Diversity coding for transparent self-healing and fault-tolerant communication networks. IEEE Transactions on Communications, 41(11), 1677–1686. doi:10.1109/26.241748 Berman, V., & Mukherjee, B. (2006). Data security in MANETs using multipath routing and directional transmission. In Proceedings of the IEEE International Conference on Communications (ICC) (pp. 2322-2328), v. 5. IEEE Computer Society. Bessani, A.N., Sousa, P., Correia, M., Neves, N.F., & Verissimo, P. (2008). The CRUTIAL way of critical infrastructure protection. IEEE Security & Privacy, 6(6). Boudriga, N. A., & Obaidat, M. S. (2005). Fault and intrusion tolerance in wireless ad hoc networks. In Proceedings of the IEEE Wireless Communications and Networking Conference (pp. 2281-2286), v. 4. IEEE Computer Society.
355
Dependability and Security on Wireless Self-Organized Networks
Buchegger, S., & Le Boudec, J. Y. (2002). Performance analysis of the CONFIDANT protocol: Cooperation of nodes - fairness in dynamic ad-hoc networks. In Proceedings of the ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), Lausanne, CH, 2002. IEEE Computer Society. Buchegger, S., Mundinger, J., & Le Boudec, J. Y. (2008). Reputation systems for self-organized networks. IEEE Technology and Society Magazine, 27(1), 41–47. doi:10.1109/MTS.2008.918039 Buttyan, L., & Hubaux, J. P. (2003). Stimulating cooperation in self-organizing mobile ad hoc networks. Mobile Network, 8(5), 579–592. doi:10.1023/A:1025146013151 Capkun, S., Buttyan, L., & Hubaux, J. P. (2003). Self-organized public-key management for mobile ad hoc networks. IEEE Transactions on Mobile Computing, 2(1), 52–64. doi:10.1109/ TMC.2003.1195151 Chorzempa, M., Park, J. M., & Eltoweissy, M. (2007). Key management for long-lived sensor networks in hostile environments. Computer Communications, 30(9), 1964–1979. doi:10.1016/j. comcom.2007.02.022 Choudhury, R., Yang, X., Ramanathan, R., & Vaidya, N. H. (2006). On designing MAC protocols for wireless networks using directional antennas. IEEE Transactions on Mobile Computing, 5(5), 477–491. doi:10.1109/TMC.2006.69 Deswarte, Y., & Powell, D. (2006). Internet security: an intrusion-tolerance approach. Proceedings of the IEEE, 94(2), 432–441. doi:10.1109/ JPROC.2005.862320 Djenouri, D., & Badache, N. (2007). Struggling against selfishness and blackhole attacks in MANETs. Wireless Communications and Mobile Computing.
356
Duren, M. (2004). Organically assured and survivable information systems (OASIS). USA: Wetstone Technologies. Technical Report, Air Force Research Laboratory. Ellison, R., Fisher, D., Linger, R., Lipson, H., Longstaff, T., & Mead, N. (1997). Survivable network systems: an emerging discipline. Technical report. Pittsburgh, PA: Software Engineering Institute, Carnegie Mellon University. Figueiredo, C. M. S., & Loureiro, A. A. (2010). On the Design of Self-Organizing Ad hoc Networks. Designing Solutions-Based Ubiquitous and Pervasive Computing: New Issues and Trends (pp. 248–262). Hershey, PA: IGI Global. doi:10.4018/978-1-61520-843-2.ch013 Fraga, J., & Powell, D. (1985). A fault- and intrusion-tolerant file system. In Proceedings of the Third International Conference on Computer Security (pp. 203-218). Ghosh, A. K., & Voas, J. M. (1999). Inoculating software for survivability. Communications of the ACM, 42(7), 38–44. doi:10.1145/306549.306563 Holmer, D., Nita-Rotaru, C., & Hubens, H. (2008). ODSBR: an on-demand secure byzantine resilient routing protocol for wireless ad hoc networks. [ACM Press.]. ACM Transactions on Information and System Security, 10(4), 1–35. Hu, Y., Johnson, D., & Perring, A. (2003). SEAD: secure efficient distance vector routing for mobile wireless ad hoc networks. Journal Ad Hoc Networks, 1, 175–192. doi:10.1016/S15708705(03)00019-2 Hu, Y., Johnson, D., & Perring, A. (2005). Ariadne: a secure on demand routing protocol for ad hoc networks. Wireless Networks, 11(1-2), 21–38. doi:10.1007/s11276-004-4744-y Huang, Y., & Lee, W. (2003). A cooperative intrusion detection system for ad hoc. In Proceedings of the First ACM Workshop on Security of Ad Hoc and Sensor Networks (pp. 135-147). ACM Press.
Dependability and Security on Wireless Self-Organized Networks
John, K., Dennis, C., Alexander, H., Antonio, W., Jonathan, C., Premkumar, H., & Michael, D. (2002). The willow architecture: comprehensive survivability for large-scale distributed applications. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE Computer Society. Just, M., Kranakis, E., & Wan, T. (2003). Resisting malicious packet dropping in wireless ad hoc networks. In Proceedings of the International Conference on Ad-Hoc Networks and Wireless (ADHOC-NOW) (pp. 151-163). Kargl, F., Klenk, A., Schlott, S., & Weber, M. (2004). Advanced detection of selfish or malicious nodes in ad hoc networks. In Proceedings of the First European Workshop on Security in Ad Hoc and Sensor Networks (ESAS) (pp. 152165), Springer. Keromytis, A. D., Parekh, J., Gross, P. N., Kaiser, G., Misra, V., Nieh, J., et al. (2003). A holistic approach to service survivability. In Proceedings of the ACM Workshop on Survivable and SelfRegenerative Systems (pp. 11-22), ACM Press. Khalil, I., Bagchi, S., & Nita-Rotaru, C. (2005). DICAS: detection, diagnosis and isolation of control attacks in sensor networks. In Proceedings of the International ICST Conference on Security and Privacy in Communication Networks (SECURECOMM) (pp. 89-100), IEEE Computer Society. Krawczyk, M., Bellare, M., & Canetti, R. (1997). HMAC: Keyed-hashing for message authentication. Request for Comments (RFC 2104). Internet Engineering Task Force. Kreidl, O. P., & Frazier, T. M. (2004). Feedback control applied to survivability: a host-based autonomic defense system. IEEE Transactions on Reliability, 53(1), 148–166. doi:10.1109/ TR.2004.824833
Laprie, J. C., Randell, B., Avizienis, A., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. doi:10.1109/TDSC.2004.2 Linger, R. C., Mead, N. R., & Lipson, H. F. (1998). Requirements definition for survivable network systems. In Proceedings of the Third International Conference on Requirements Engineering (pp. 1-14), IEEE Computer Society. Lou, W., Liu, W., & Fang, Y. (2004). SPREAD: enhancing data confidentiality in mobile ad hoc networks. In Proceedings of the IEEE INFOCOM (pp. 2404-2411), IEEE Computer Society. Marti, S., Giuli, T. J., Lai, K., & Baker, M. (2000). Mitigating routing misbehavior in mobile ad hoc networks. In Proceedings of the ACM Annual International Conference on Mobile Computing and Networking (MobiCom) (pp. 255-265), ACM Press. Maughan, D., Schertler, M., Schneider, M., & Turner, J. (1998). Internet security association and key management protocol (ISAKMP). Request for Comments (RFC2408). Internet Engineering Task Force. Medhi, D., & Tipper, D. (2000). Multi-layered network survivability – models, analysis, architecture, framework and implementation: an overview. In Proceedings of the DARPA Information Survivability Conference and Exposition (DISCEX) (pp. 173–186). Merwe, J. V., Dawoud, D., & McDonald, S. (2007). A survey on peer-to peer key management for mobile ad hoc networks. ACM Computing Surveys, 39(1), 1–45. doi:10.1145/1216370.1216371 Michiardi, P., & Molva, R. (2002). CORE: a collaborative reputation mechanism to enforce node cooperation in mobile ad hoc networks. [Kluwer.]. IFIP, TC6(TC11), 107–121.
357
Dependability and Security on Wireless Self-Organized Networks
Miranda, H., & Rodrigues, L. (2003). Friends and Foes: Preventing selfishness in open mobile ad hoc networks. In Proceedings of the IEEE International Conference on Distributed Computing Systems Workshop (ICDCSW) (pp. 440), IEEE Computer Society.
Wang, F., & Uppalli, R. (2003). SITAR: a scalable intrusion-tolerant architecture for distributed services. In Proceedings of the DARPA Information Survivability Conference and Exposition (DISCEX) (pp. 153–155).
Papadimitratos, P., & Haas, Z. J. (2003). Secure data transmission in mobile ad hoc networks. In Proceedings of the ACM Workshop on Wireless Security (WiSe) (pp. 41–50). ACM Press.
Wu, B., Wu, J., Fernandez, E. B., Ilyas, M., & Magliveras, S. (2007). Secure and efficient key management in mobile ad hoc networks. Journal of Network and Computer Applications, 30(3), 937–954. doi:10.1016/j.jnca.2005.07.008
Qian, Y., Lu, K., & Tipper, D. (2007). A design for secure and survivable wireless sensor networks. IEEE Wireless Communications, 14(5), 30–37. doi:10.1109/MWC.2007.4396940
Wylie, J. J., Bigrigg, M. W., Strunk, J. D., Gnager, G. R., Kiliccote, H., & Khosla, P. K. (2000). Survivable information storage systems. IEEE Computer, 33(8), 61–68.
Ramanujan, R., Kudige, S., & Nguyen, T. (2003). Techniques for intrusion resistant ad hoc routing algorithms TIARA. In Proceedings of the DARPA Information Survivability Conference and Exposition (DISCEX) (pp. 98). IEEE Computer Society.
Xue, Y., & Nahrstedt, K. (2004). Providing faulttolerant ad hoc routing service in adversarial environments. Wireless Personal Communications: An International Journal, 29(3-4), 367–388. doi:10.1023/B:WIRE.0000047071.75971.cd
Randell, B., Stroud, R., Verissimo, P., Neves, N., O’Halloran, C., Creese, S., et al. (2010). Malicious- and accidental-fault tolerance for internet applications. Retrieved June 30, 2010, from www. laas.fr/TSF/cabernet/maftia/
Zhong, S., Chen, J., & Yang, Y. R. (2003). Sprite: a simple, cheat-proof, credit-based system for mobile ad-hoc networks. In Proceedings of the IEEE INFOCOM (pp. 1987-1997). IEEE Computer Society.
Shamir, A. (1979). How to share a secret. ACM Communications, 22(11), 612–613. doi:10.1145/359168.359176
Zhou, L., & Haas, Z. J. (1999). Securing ad hoc networks. IEEE Network, 13(6), 24–30. doi:10.1109/65.806983
Sterbenz, J. P. G., Krishnan, R., Hain, R., Levin, D., Jackson, A. W., & Zao, J. (2002). Survivable mobile wireless networks: issues, challenges, and research directions. In Proceedings of the ACM Workshop on Wireless Security (WiSe) (pp. 31–40). ACM Press.
358
Section 4
Security
360
Chapter 16
Engineering Secure Web Services Douglas Rodrigues Universidade de São Paulo, Brazil Julio Cezar Estrella Universidade de São Paulo, Brazil Francisco José Monaco Universidade de São Paulo, Brazil
Kalinka Regina Lucas Jaquie Castelo Branco Universidade de São Paulo, Brazil Nuno Antunes Universidade de Coimbra, Portugal Marco Vieira Universidade de Coimbra, Portugal
ABSTRACT Web services are key components in the implementation of Service Oriented Architectures (SOA), which must satisfy proper security requirements in order to be able to support critical business processes. Research works show that a large number of web services are deployed with significant security flaws, ranging from code vulnerabilities to the incorrect use of security standards and protocols. This chapter discusses state of the art techniques and tools for the deployment of secure web services, including standards and protocols for the deployment of secure services, and security assessment approaches. The chapter also discusses how relevant security aspects can be correlated into practical engineering approaches.
INTRODUCTION The increasing use of Service-Orient Architectures (SOA) in critical applications demands for dependable and cost-effective techniques to ensure high security. Web services (WS), the cornerstone of the current SOA technology, are widely used for linking suppliers and clients in different sectors such as banking and financial services, transportaDOI: 10.4018/978-1-60960-794-4.ch016
tion, manufacturing, to name a few. However, the problem of engineering secure web services is a non-trivial task as several studies show that a large number of WS implementations are deployed with security flows that range from code vulnerability to inadequate use of standards and protocols. Engineering secure web services requires developers to clearly identify and understand security requirements. To implement those requirements adequate security standards and protocols have to be applied. While essential WS standards such
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Engineering Secure Web Services
as XML (eXtensible Markup Language), SOAP (Simple Object Access Protocol), UDDI (Universal Description, Discovery and Integration), WSDL (Web Service Definition Language) approach the basic concepts of interoperable services, the design of secure WS requires complementary rules to be added. Two very well-known examples are the OASIS (Organization for the Advancement of Structured Information Standards) standards WS-Security and SAML (Security Access Markup Language). The former aims at SOAP message security and provides integrity and confidentiality features. The latter focuses on exchanging security information. Developers must therefore understand the main security specifications for Web Services, which include cryptographic algorithms and techniques that implement digital signatures (e.g., WS-Security, WS-Conversation, XMLSignature, XML-Encryption, XACML (OASIS eXtensible Access Control Markup Language), SAML (Security Assertion Markup Language)). Applying adequate security standards and protocols is not sufficient for guaranteeing secure web services. In fact, software design and coding defects are a major source of vulnerabilities and can put at stake any security countermeasures. For example, interface and communication faults related to problems in the interaction among software components/modules are particularly relevant in service-oriented environments, as services must provide a secure interface to the client applications, even in the presence of malicious inputs. WS developers need to be aware of techniques and tools that help them to assess how secure a service is, such as black-box testing (e.g., robustness testing and penetration testing), and white-box analysis (e.g., code inspection and static code analysis). This chapter reviews the existing standards, protocols, and tools for developing secure web services, presenting also the most frequent attacks performed against web services and the countermeasures that could be used to avoid them. Additionally, the chapter presents several
techniques and tools for assessing the security of web services, which allow checking the effectiveness of the underlying security mechanisms and coding practices.
SECURITY STANDARDS AND PROTOCOLS FOR WEB SERVICES Enabling information security in the Internet is a mandatory step for fostering business on the Web, especially if we consider systems based on Web services and SOA architectures. In its native form, Web services do not take into account security requirements, which, in most cases, are superficially met by developing security standards in the context of XML-based SOAP messages. Multi-hop message routing between multiple Web services is commonly used to achieve scalability and also to bridge different protocols. Some technologies such as TLS/SSL – Transport Layer Security/Secure Sockets Layer were initially developed to guarantee the confidentiality between two parties (Dierks and Allen, 1999), (Freier et al., 1996), but they do not provide end-to-end security. To address this challenge, diverse security principles must be applied to different contexts, taking into account both point-to-point and end-to-end settings, as well as the associated considerations concerning the privacy of user information shared in this process. To enable security in this new environment, novel mechanisms have to be put on top of the ones already available at the transport and network layers of the TCP/IP stack. Standards such as XML, SOAP, UDDI and WSDL address the basics of interoperable services, but for secure Web services and SOA other rules must be added and approved (currently a de facto security standard for SOA architectures is not available).
Security at the Network Layer The standard method for providing privacy, integrity and authenticity of information transferred
361
Engineering Secure Web Services
across IP networks is the IPSec protocol (Kent & Seo, 2005). Other Internet security protocols such as SSL and TLS operate from the transport layer (layer 4) to the application layer (layer 5) of the TCP/IP. IPSec is more flexible as it can be used to protect both TCP and UDP protocols, but it adds complexity and overall processing overhead, as one cannot rely on TCP to manage reliability and fragmentation. IPSec is an extension of the IP protocol that aims to be the standard method for providing privacy (increasing the reliability of the information), data integrity (ensuring that the content that reaches the destination is the same as in the origin) and authenticity (ensuring that information is authentic and users are who they say they are) when transferring information over IP networks. IPSec combines several different technologies, including: • •
•
•
The Diffie-Hellman mechanism for key exchange. Public-key cryptography to sign the exchange of Diffie-Hellman key, ensuring the identity of both parties and preventing attacks like the man-in-the-middle (the attacker is seen as the other party in each direction of the communication). Encryption algorithms for large volumes of data, such as DES (Data Encryption Standard). Algorithms for hash calculation (remainder of a division, fixed-size) with the use of keys, with HMAC, combined with traditional hash algorithms such as MD5 or SHA for authentication packages and digital certificates signed by a Certification Authority (Thayer et al., 1998).
Another important aspect is that IPSec is the encryption protocol for Internet tunneling, encryption and authentication. Regarding the unit that is protected, two modes are possible: in transport mode it is able to protect falls on the payload and
362
in the tunnel mode it is able to protect the entire IP packet, including the IP header. In the transport mode, only the payload is encrypted and there are no changes in routing, since the IP header is not modified or encrypted. On the other hand, when the authentication header is used, IP addresses cannot be translated, as this invalidates the hash value (Kent and Atkinson, 1998), (Kent, 2005). The transport and application layers are always secured by the hash, thus they cannot undergo any change. The transport mode is used for host-tohost communications. In tunnel mode, the entire IP packet is encrypted and a new IP packet is used to encapsulate and distribute it. Tunneling is used for network-to-network communications (secure tunnels between routers) or host-to-network and host-to-host communications over the Internet. Although IPSec tries to ensure security at the network layer, providing mechanisms for data encryption and authentication, there is no endto-end security guarantee in the context of SOAP messages. The security mechanism provided by IPSec need to be combined with other mechanisms for effectively securing SOAP messages transmission.
Security at the Transport Layer The transport layer has specific security technologies that are very useful for ensuring the security of the communication on layers below of the application layer (O’Neill, 2003). The most used of these technologies is SSL (Secure Sockets Layer) (Freier et al., 1996) that was developed by Netscape to ensure confidentiality and authentication in HTTP (Hypertext Transfer Protocol) interactions, so that the algorithms used are negotiated between the two participants of the communication. On the other hand, TLS (Transport Layer Security) (Dierks and Allen, 1999) is an extended version of SSL, adopted as an Internet standard and widely used in most Web browsers and e-commerce applications. In this section, the SSL and TLS protocols
Engineering Secure Web Services
are referenced only as SSL since they basically operate the same way. SSL is composed of two main layers, as illustrated in Figure 1. The first layer is the SSL Record protocol, which implements a secure channel by encrypting and authenticating messages transmitted by any connection-oriented protocol. The second layer consists of the SSL Handshake, SSL Change Cipher and SSL Alert protocols, which establish and maintain an SSL session (i.e., a secure channel between a client and a server (Coulouris et al., 2005)). The SSL Record protocol is aimed at transporting data at the application level in a transparent manner between two processes, ensuring confidentiality, authenticity and integrity. The goal of the SSL Handshake protocol is to establish an SSL session, starting an exchange of messages in plain text, reporting the options and parameters necessary to make the process of encryption and authentication, allowing a client and a server to perform authentication of each other and negotiate the different parameters needed for a secure session (i.e., the common cipher suites to be used for authentication, session key exchange, hashing and data encryption). This allows a secure channel to be completely configured, providing encrypted and authenticated communication in
both directions (Coulouris et al., 2005). Briefly, the handshake operation is based on a non-encrypted communication at the earliest exchanges, followed by using public key cryptography and, finally, after the establishment of a shared secret key, exchange for symmetric key encryption. Each of these exchanges is optional and must be preceded by a negotiation. Finally, the SSL Change Cipher protocol allows dynamic updates of the ciphers suites used in a connection, while the SSL Alert protocol is used to send alert messages to the participants of the communication (Ravi et al., 2004). Technologies such as SSL are designed to provide point-to-point security, where at each step of the communication route, the message is decrypted and a new SSL connection is created. Thus it does not ensure end-to-end security, which is mandatory in a web services environment. In fact, SOAP messages can travel by several Web service intermediaries before reaching the final recipient. Therefore, if encryption is used only at the transport layer, the information will be accessible to the intermediaries, through which the messages travel, creating security gaps between a secure session and another (Mashood and Wikramanayake, 2007).
Figure 1. SSL protocol stack
The SOAP protocol does not implement any security mechanisms, relying on other networking mechanisms for these functions. SSL could be a solution, though as it does not provide end-to-end security and data integrity, it is not applicable in many situations. Thus, new specifications and security mechanisms have been proposed, including XML Signature, XML Encryption and WS-Security.
Security at the Application Layer
XML Encryption Standardized by W3C, XML Encryption (W3C, 2002) defines a way to encrypt data in a struc-
363
Engineering Secure Web Services
tured manner and represents the result as an XML document, ensuring confidentiality requirements. XML Encryption aims to provide security for end-to-end applications that need to exchange data in XML format in a secure way (Siddiqui, 2002). It provides a feature set that is not included in SSL/TLS and allows encryption of data at different levels of granularity (i.e., allows selecting the pieces of data that have to be encrypt). This feature is useful for encrypting specific element of a XML document, allowing the rest of it to remain with its original content. Unlike SSL, which encrypts entire groups of data to transport them, XML Encryption transforms only the data that really has security requirements. This is advantageous as encryption requires time and computational resources and therefore should be used with criteria (Nagappan et al., 2003). Another advantage of XML Encryption is that it allows the establishment of secure sessions considering more than one application, which is not possible with SSL. Furthermore, in addition to XML data, XML Encryption can be used to encrypt other data types.
XML Signature XML Signature is a W3C standard (W3C, 2008b) that specifies a process for generating and validating digital signatures expressed in XML, ensuring the integrity and authenticity not only of XML documents, but also of any other type of digital document (Mogollon, 2008). An important property of XML Signature is the ability to sign only specific portions of the XML document (Yue-Sheng et al., 2009). This property is useful because it allows an XML document to have additions or changes in other parts of it during his lifetime, without invalidating the signed part. However, if the signature is applied to the whole XML document, then any modification of the data invalidates the document.
364
WS-Security WS-Security was first proposed by IBM and Microsoft and is currently being standardized by OASIS (OASIS, 2006a), with the goal of proposing a set of extensions to SOAP messages that allow implementing secure Web services. Thus, the WS-Security defines methods for embedding security into SOAP messages, including, for example, credential exchange, message integrity and confidentiality (Jensen et al., 2007). In practice, the goal of WS-Security is to ensure safety at end-to-end message level, regarding three main aspects (Chou and Yurov, 2005): •
•
•
Confidentiality of the message: the SOAP message can be fully or partially encrypted using the XML Encryption specification, and it must contain information related to the cryptography performed. Message integrity: the SOAP message can be signed using XML Signature and it will contain the information related to the signature. Credentials security: security credentials with authentication information may be included in the SOAP message. These credentials are also known as access authentication, identification or security tokens.
Security credentials are defined in WS-Security and apply mechanisms and methods used for authentication and authorization. Passwords, for example, are a type of unsigned security credentials, while an X.509 certificate is an example of security credential that is digitally signed by a specific authority. The main security credentials are (Mogollon, 2008): •
•
UsernameToken: defines how the username and password information is included in the SOAP message. BinarySecurityToken: defines binary security credentials such as X.509 certifi-
Engineering Secure Web Services
•
cates, credentials or other formats other than XML require a special encoding format for inclusion in the SOAP message. XML Tokens: set of credentials that define how the XML security credentials can be incorporated into the SOAP message header.
WS-Policy WS-Policy is a W3C standard (W3C, 2007c) that provides a general purpose model for describing policies through a flexible and extensible grammar, which allows specifying a range of requirements and abilities to the Web services environment. Thus, WS-Policy allows organizations involved in interactions to define their communication choices by means of a representation of the service policy (Sidharth and Liu, 2008). In the WS-Policy specification, a policy is defined as a set of policy alternatives, and each alternative is a set of policy assertions. Some of these policy assertions represent requirements and traditional capabilities, such as the selection protocol and authentication method. WS-Policy describes a way to represent a policy in order to make policy assertions interoperable, thereby allowing entities to process policies in a standardized manner. The root element of the typical form is called Policy. The ExactlyOne element defines a set of policy alternatives and the All element describes the policy alternative (Mihindukulasooriya, 2008).
secure Web services implementations (Merrill and Grimshaw, 2009).
Others Security Specifications for Web Services The WS-Security was the first security specification for Web services to be released by IBM and Microsoft (Microsoft, 2002). As shown in Figure 2 specifications are being developed from the bottom up. SOAP is the basis of a protocol stack meant to be independent of the transport layer. WS-Security is on top of SOAP as it provides a mean for encryption and digital signatures of SOAP messages using XML Encryption and XML Signature. Looking to this stack of specifications from top down, each specification depends on its predecessors, thus creating a context for increased security in Web services. The following bullets present a brief explanation of other specifications not introduced before in the chapter (Microsoft, 2002): •
•
•
WS-SecurityPolicy The WS-SecurityPolicy is a standard of OASIS (OASIS, 2007a) that provides policy assertions related to security. These assertions can be used with WS-Policy and in conjunction with WSSecurity, WS-SecureConversation and WS-Trust. These assertions allow specifying the requirements, the capabilities and the limitations of
•
WS-Trust: OASIS standard (OASIS, 2007c) that defines how trust relationships are established, allowing Web services to interoperate in a secure manner. WS-Privacy: proposal, not yet standardized, that aims to endow communicating the privacy policies set by the organizations that are deploying Web services (Nordbotten, 2009). WS-SecureConversation: OASIS standard (OASIS, 2007d) that provides mechanisms for establishing and identifying a security context (i.e., it allows creating sessions where several SOAP messages can be exchanged without the need to access individual authentication and authorization data) (Holgersson and Soderstrom, 2005). WS-Federation: proposal under standardization by OASIS (OASIS, 2008b) that aims to describe how to manage and broker the trust relationship in a heterogeneous
365
Engineering Secure Web Services
Figure 2. Security specifications for Web services
•
federated environment, including support for federated identities. Federation in this case involves an intermediary between different security specifications (O’Neill, 2003). WS-Authorization: proposal, not yet standardized, that aims to describe how the access policies for a Web service are specified and managed (Nordbotten, 2009).
Developing Secure Applications The security mechanisms described so far cannot mitigate the attacks targeting the web services code. Also, attackers are moving their focus to web services code and so, code developed disregarding security concerns represent a major risk to an infrastructure. Vulnerabilities like SQL Injection and XPath Injection are particularly relevant in web services’ context (Vieira, Antunes, & Madeira, 2009). Therefore, developers need to apply the best coding practices to avoid this kind of vulnerabilities (Stuttard and Pinto, 2007). Input Validation consists in forcing the input parameters of a web service operation to be within the correspondent valid domain or to interrupt the execution when a value outside of the domain is provided. This can be achieved by using filtering strategies and is considered to be a good practice than can avoid many problems in web services’ code. However, input validation is frequently
366
not enough because the domain of an input may allow the existence of vulnerabilities. In the case of SQL Injection vulnerabilities, a quote is the character used as a string delimiter in most SQL statements and so, it can be used to perform a SQL Injection attack. In some cases, however, the domain of a string input must allow the presence of quotes, in such a way that we cannot exclude all the values that contain quotes, meaning then that additional security must be delegated to the database statement itself. The single and double quote characters exist in the majority of SQL Injection attacks. Thus, some programming languages provide mechanisms for escaping this type of characters in such way that they can be used within an SQL expression rather than delimiting values in the statement. This kind of techniques has two main problems. First, it can be circumvented in some situations by using more elaborated injection techniques like combining quotes (‘) and slashes (\). Second, the introduction of characters for escaping increases the length of the string and thus can cause data truncation when the resulting string is higher than allowed by database. Prepared Statements (also named Parameterized Queries) are the most easy and efficient way of avoiding SQL Injection vulnerabilities. When a prepared statement is created (or prepared) its structure is sent to the database. The variable parts of the query are marked using question marks (?)
Engineering Secure Web Services
or labels. Afterwards, for each time that the query needs to be executed, the values must be binded to the corresponding variable part. No matter what is the content of the data, the expression will always be used as a value. Consequently, it is impossible to modify the structure of the query. To help ensuring the correct usage of the data, many languages allow typed bindings.
ATTACKS IN WEB SERVICES Most security issues raised in the recent years are related to Web applications. This is due to two key aspects: the complexity of web-based systems (which are currently used for electronic commerce, Internet banking, government application, etc) and the use of software development cycles that do not take into account the specificities of security attributes. There are many types of security attacks in the Information Technology (IT) area. In SOA applications, we can find the same types of attack found in the IT area, however, there are some specific attacks that exploit vulnerabilities in the protocols. Most of the work in intrusion detection has addressed the security aspects of services at the network layer, since the integrity of this layer is vital for the continuity of operations. However, attacks at the application level are gaining great relevance, especially with the popularity of Webbased systems. The most common vulnerability at the application layer is code injection (OWASP, 2010). This vulnerability was in the sixth position of the OWASP ranking in 2004, but moved to the second position in 2006 and then to the first in 2010. Despite of the many surveys about vulnerabilities available in the literature, Web service developers still lack knowledge about the security check and the type of protection that should be put in place. In this section we present the most important types of attacks in SOA. For each attack type we describe an abstract attack methodology, the
potential impact, and the countermeasures that can be applied. Attacks are organized in a way that allows exploring all kinds of vulnerabilities, namely: denial of service, brute force, spoofing, flooding and injection (including malicious input).
Denial of Service Attacks Denial of service is an attack that exploits a vulnerability to exhaust the resources of the machine. Since the evaluation of an integrity function is a resource-intensive task once the whole XML document is loaded to a memory, a DoS attack can be conducted by using deep and large XML fragments. The main DoS attacks in Web services are: •
•
Oversize Payload: this attack targets at harming service availability by exhausting the resources of the service (memory, processes, network, etc.) and is typically conducted by submitting very large request messages (Jensen et al., 2007). When a SOAP message is processed, it is read, parsed and transformed into a memory object. Obviously, the amount of CPU processing a memory allocation depends on the size of this message. The countermeasures to avoid this type of attacks consist of implementing restrictions in the XML infoset or checking the size of the messages before processing (and drop the messages that pass the established threshold). Coercive Parsing: SOAP messages’ parsing becomes vulnerable when a large number of namespaces is allowed. The attackers can explore this to cause a denial of service by using a continuous sequence of opening tags (based on complex and nested XML documents), forcing the parser to allocate a huge amount of resources. This attack is difficult to prevent, asXML does not limit the number of namespace declarations (Jensen et al., 2007).
367
Engineering Secure Web Services
•
•
•
368
Oversize Cryptography: the WS-Security protocol provides security features to SOA applications. However, the flexibility of this protocol, which can use cryptography in the header and in the body of a SOAP message, allows attackers to use an encrypted key chain where each key is necessary to decrypt a subsequent block. This type of attacks consumes a large amount of CPU and memory. To prevent it, the usage of WS-Security elements must be restricted. The approach consists of accepting only the security elements explicitly required by the security policy, specified using the WS-SecurityPolicy protocol (Jensen et al., 2007), (Lindstrom, 2004). Attack Obfuscation: similarly to the Oversize Cryptography attack, this attack is possible due to the characteristics of WS-Security. In practice, this attack explores the confidentiality of unreliable data. In practice, by encrypting a message it may be possible to obfuscate an Oversize Payload attack (or other kind of attack) in the message body. This is quite hard to detect. A potential countermeasure is to perform message validation (stepwise decryption and validation) on the decrypted content, reducing memory consumption and early malicious messages detection (Lai et al., 2008), (Jensen et al., 2007). Unvalidated Redirects and Forwards: this kind of vulnerability is related with the access to phishing or malware sites that can result in a denial of service attack or an unauthorized access to the client resources. The attackers use this vulnerability to redirect clients (victims) to unauthorized pages. A potential countermeasure consists of disabling redirects and forwards (OWASP, 2010).
Brute Force Attacks Brute force is an attack type that is based on a vast amount of seemingly legitimate transactions, which typically exhaust the target system. In the Web services context, the attacks that best represent this class are: •
•
Insecure Cryptographic Storage: modern cryptography provides a set of mathematical and computational techniques that meet different requirements of information security. However, the deployment of cryptographic mechanisms requires care, as they may compromise the security of the solution. It is usual to find systems that are only concerned with protecting information in transit and that ignore the protection of data during storage. In this case, a vulnerability in the server may be sufficient to maliciously access the information. A key vulnerability is related with the protection of the cryptographic keys used. Examples of countermeasures that must be applied are: information should be classified and coded (especially information that needs to satisfy confidentiality requirements); storing keys should be avoided and key should be managed appropriately; proven cryptographic algorithms should be used. This is essential for managing secure information (PCI, 2009), (Knudsen, 2005), (Pramstaller et al., 2005), (OWASP, 2010). Broken Authentication and Session Management: vulnerabilities in the authentication and session management processes allow illegitimate access to the application, either by providing valid credentials (from another user) or by hijacking a session already in progress. The problem becomes much more critical if the compromised account has administrative privileges. Many Web applications submit access credentials over the HTTP protocol,
Engineering Secure Web Services
without any protection. This information can easily intercepted and used for malicious user authentication (OWASP, 2010).
Spoofing Attacks Spoofing is an attack type in which the attacker manages to successfully masquerade his identity by falsifying data and thereby gaining an illegitimate advantage. In Web services there are several attacks in this class: •
•
•
SOAPAction Spoofing: the use of HTTP header does not require XML processing, so, the SOAPAction field in HTTP header is used as the service operation identification. This enables attacks known as manin-the-middle. This attack is possible because it is possible to put different values in the header and in the body of a SOAP message. Another attack allowed by this SOAPAction vulnerability is the spoofing attack that is executed by the Web service client and tries to bypass an HTTP gateway (Mashood and Wikramanayake, 2007), (Jensen et al., 2007). Determine the operation based only on the SOAP body content is a good countermeasure for this kind of attack. WSDL Scanning: this attack can be compared to the SQL injection attack. It provides a scanning of all the WSDL operations. Once this document is open an external client could know all the internal operations and invoke them. A countermeasure to avoid this is to provide a separate WSDL to external clients that contain only the external operations, avoiding the malicious user from knowing the omitted operations (Jensen et al., 2007). Insufficient Transport Layer Protection: most applications fail to encrypt network traffic. Usually, they use SSL/TLS during authentication, but not elsewhere, expos-
•
•
•
ing all transmitted data and session IDs to interception. This vulnerability allows a man-in-the-middle attack and the access to confidential information. To fend against this attack, SSL/TLS should be configured to embrace the entire site (OWASP, 2010). WS-Spoofing: this attack forces the BPEL engine to run a full process followed by a complete rollback. Using WS-Addressing for asynchronous Web service a malicious user could use an arbitrary invalid endpoint URL as callback endpoint reference leading to a workload problem and CPU consumption. Used as a flooding attack, this causes heavy load on the BPEL engine. As a countermeasure the system should use a secure way to validate the caller endpoint (Jensen et al., 2007). Workflow Engine Hijacking: this kind of attack uses the WS-Addressing but the attackers point to the target system URL. In practice, the target system (the Web service) receives a heavy amount of requests, providing a denial of service. As a countermeasure the system could use a secure way to validate the caller endpoint (Jensen et al., 2007). Metadata Spoofing: all the information in a Web service application is exchanged in metadata. This metadata is exchanged using the HTTP protocol, which allows spoofing these metadata. WSDL spoofing and security policy spoofing are the most representative attacks of this class. In the first, the problem is the modification of the network endpoints (allowing the man-inthe-middle attack), while the second allows the security policy to be changed. As a countermeasure, all the metadata should be carefully checked for authenticity (although the mechanisms for securing metadata are not yet standardized) (Lai et al., 2008), (Jensen et al., 2007), (Lindstrom, 2004).
369
Engineering Secure Web Services
•
Security Misconfiguration: security depends on having a secure configuration defined for the application and for the entire Web service platform. The entire application stack and web service platform must have a secure configuration. A vulnerable configuration allows an attacker to use an automated scanner for detecting missing patches and use default account information for hijacking the target system. To fend the system against this kind of attack strong network architecture must be provided. This architecture must enforce the separation of components and facilitate the implementation of good security configurations. Automatic and periodic scanning must be performed for detecting misconfigurations or missing patches (OWASP, 2010).
•
Flooding Attacks Flooding is an attack that attempts to cause a failure in a system by providing more input than the entity can process. The attacks that better represent this class are: •
•
370
Instantiation Flood: BPEL creates a process instance when a new message arrives. The process immediately stars its execution based. This allows a kind of attack that creates new processes that increase the load of the engine. Fending such flooding attacks can only be achieved by the identification and rejection of semantically invalid requests. This kind of attack is considered non trivial to avoid as it is really difficult to accurately separate invalid messages from the valid ones (Jensen et al., 2007), (Lindstrom, 2004). Indirect Flooding: this attack uses the BPEL engine as an intermediary to conduct a flooding attack to the target system by making a lot of requests (calls). In
this case, the BPEL engine will undergo a heavy load itself and also cause an equally heavy load on the target system. This kind of attack could not be fended using WSSecurity once the connection between the target system and BPEL is a trustful connection (Lai et al., 2008), (Jensen et al., 2007), (Lindstrom, 2004). Fending such attacks needs identification and rejection of attack messages (identification and rejection of semantically invalid requests). BPEL State Deviation: this attack consists of using messages that are correct regarding their structure but that are not properly correlated to any existing process. These messages are discarded within the BPEL engine, but they may cause a heavy load due to the many ports that can be requested. A countermeasure that can be used to fend this kind of attack is the use of a specific firewall to fend both correlationinvalid and state-invalid messages (Knap and Mlýnková, 2009), (Jensen et al., 2007).
Injection Attacks Injection attacks (including malicious input) consists of the insertion, injection or input of malicious code in an application (or in a data field of the application). In the case of code injection, forging a malicious user HTTP request may allow inserting executable code (e.g., commands, SQL, LDAP and others) in the application parameters (Cannings et al., 2008). The relevance of such attacks can be found in the number of vulnerable applications and the number of observed incidents in recent years (OWASP, 2010). The most relevant injection attacks are: •
XML Injection: the goal of XML Injection attack class is to modify the XML structure of a SOAP message by inserting a new XML fragment containing XML tags. This may lead to unauthorized access to
Engineering Secure Web Services
•
•
protected resources. A countermeasure is to apply a strict schema validation on the SOAP message aiming at discovering (and rejecting) invalid data types and messages (Jensen et al., 2007), (Knap and Mlýnková, 2009). SQL Injection: SQL Injection is the most common vulnerability in Web applications nowadays (OWASP, 2010). This attack consists of the injection of a SQL code and parameters in the input parameters of the application with the aim chaining the SQL commands submitted to the database. In many cases, this allows attackers to perform any operation on the database, being only limited by the privileges of the database user that performs the access. However, as in many cases administrative accounts are directly accessed by web services, there are no restrictions on what can be done in practice. Moreover, with a little elaboration, attackers can interact with the operating system, access other existing databases, read/write data files and execute arbitrary commands. Preventing SQL injection attacks consists of considering all input information as malicious. Then, before processing, this information needs to be validated to guarantee that only known good values are accepted. Cross Site Scripting (XSS): also known as XSS, is another key problem. This attack allows using a vulnerable application to carry malicious code, usually written in JavaScript, to the browser of another user. Given the relationship of trust established between the browser and the server, the browser assumes that the code received is legitimate and therefore allows access to sensitive information such as the session identifier. With this, a malicious user can hijack the session of the person attacked (OWASP, 2010), (Uto and Melo, 2009), (Stuttard and Pinto, 2007), (Howard et
al., 2005), (Fogie et al., 2007). This vulnerability occurs when a Web application does not validate the information received from external entities (users or other applications) and includes that information in dynamically generated pages. Depending on how the malicious payload reaches the victim, XSS is ranked in three subclasses: ◦⊦ Reflected XSS: in this class, code is sent via the URL in the HTTP header or as part of the request exploring a parameter that is displayed on the resulting page without filtering. Usually requires the user to be tricked into clicking in a link specially built with malicious content. ◦⊦ Stored XSS: stored or persistent XSS is based on malicious code that is stored by the application, typically in a database, and displayed to all users who access that resource. It is the most dangerous type because it can affect a larger number of users and does not require the user to follow a malicious link. ◦⊦ DOM based XSS: in this class the payload is executed as a modified result of the DOM in the victim browser used by the original client side script (the client side code runs in an unexpected manner). It is thus suggested that this kind of XSS does not rely on server side embedding as it takes advantage of insecure reference and use (in a client side code) of DOM objects that are not fully controlled by the server provided page. As a countermeasure, all information provided by users should be seen as malicious and conveniently validated and filtered to make sure that it complies with valid values. It is important to mention that this approach is superior to the use of blacklists, as it is hardly possible to enumer-
371
Engineering Secure Web Services
ate all the possible harmful entries. In addition, restricting the size of the input fields is also an approach for preventing XSS. •
Cross Site Request Forgery: this attack takes advantage of a user session already established in the vulnerable application to perform automated actions without the knowledge and consent of the victim. Examples of operations that can be performed range from simple closure of the session to the transfer of funds in a banking application. This attack is also known as XSRF, Session Riding and One-Click attack. The main countermeasure against CSRF attacks is to use information related to each session in the URLs and check them when a request is made. This prevents attackers from knowing in advance the format of the URL (which is necessary to mount the attack) (Uto & Melo, 2009), (Stuttard and Pinto, 2007), (Howard et al., 2005), (Fogie et al., 2007).
ASSESSING THE SECURITY OF WEB SERVICES Several studies (Vieira et al., 2009), (Fogie et al., 2007), (Jensen et al., 2007) show that a large number of Web services are deployed with security flaws that range from code vulnerabilities (e.g., code injection vulnerabilities) to the incorrect use of security standards and protocols. To minimize security issues, developers should search web applications and services for vulnerabilities, for which there are two main approaches: white-box analysis and black-box testing. Other techniques, generically named as gray-box, combine black-box testing and white-box testing to achieve better results.
372
“White-Box” Analysis White-box analysis consists of the examination of the web service without executing it. This can be done in one of two ways: manually during reviews and inspections or automatically by using automated analysis tools. Inspections, initially proposed by Michael Fagan in the mid 1970’s (Fagan, 1976), are a technique that consists of the manual analysis of documents, including source code, searching for problems. It is a formal technique based on a well-defined set of steps that have to be carefully undertaken. The main advantage of inspections is that they allow uncovering problems in the early phases of development (where the cost of fixing the problem is lower). An inspection requires several experts, each one having a well-defined role, namely: author (author of the document under inspection), moderator (in charge of coordinating the inspection process), reader (responsible for reading and presenting his interpretation of the document during the inspection meeting), note keeper (in charge of taking notes during the inspection meeting), and inspectors (all the members of the team, including the ones mentioned before). During the inspection process, this team has to perform the following generic steps: 1. Planning: starts when the author delivers the document to be inspected to the moderator. The moderator analyses that document and decides if it is ready to undergo the inspection process. If not, then the document is immediately returned to the author for improvement. This step includes also selecting the remaining members of the inspection team. 2. Overview: after delivering the document (and other documents needed to understand it) to the experts, the author presents in detail the goal and structure of the document to be inspected. By the end of this meeting the
Engineering Secure Web Services
3.
4.
5.
6.
inspection team must be familiar with the job to be performed. Preparation: the inspectors analyze individually the document in order to prepare themselves for the inspection meeting. Inspection meeting: in this meeting the reader reads and explains his interpretation of the document to the remaining inspectors. Discussion is allowed to clarify the interpretation and to disclose existing issues. The outcome of the inspection meeting is a list of issues that need to be fixed and one of three verdicts: accept, minor corrections, and re-inspection. Revision: the author modifies the document following the recommendations of the inspection team. Follow-up: the moderator checks if all the problems detected by the inspectors were adequately fixed. The moderator may decide to conduct another inspection meeting if there were considerable changes in the document or if the changes made by the author differ from the recommendations of the experts.
A code inspection is the process by which a programmer delivers the code to his peers and they systematically examine it, searching for programming mistakes that can introduce bugs. A security inspection is an inspection that is specially targeted to find security vulnerabilities. Inspections are the most effective way of making sure that a service has a minimum number of vulnerabilities (Curphey et al., 2002) and are a crucial procedure when developing software to critical systems. Nevertheless, they are usually very long, expensive and require inspectors to have a deep knowledge on web security. A less expensive alternative to code inspections is code reviews (Freedman & Weinberg, 2000) Code reviews are a simplified version of code inspections and can be considered when analyzing less critical services. Reviews are also a manual approach, but they do not include the
formal inspection meeting. The reviewers perform the code review individually and the moderator is in charge of filtering and merging the outcomes from the several experts. In what concerns the roles and the remaining steps reviews are very similar to inspections. Although also a very effective approach, it is still quite expensive. Code walkthroughs (Freedman & Weinberg, 2000) are an informal approach that consists of manually analyzing the code by following the code paths as determined by predefined input conditions. In practice, the developer, in conjunction with other experts, simulate the code execution, in a way similar to debugging. Although less formal, walkthroughs are also effective on detecting security issues, as far as the input conditions are adequately chosen. However, they still impose the cost of having more than one expert manually analyzing the code. The solution to reduce the cost of white-box analysis is to rely on automated tools, such as static code analyzers. In fact, the use of automated code analysis tools is seen as an easier and faster way to find bugs and vulnerabilities in web applications. Static code analysis tools vet software code, either in source or binary form, in an attempt to identify common implementation-level bugs (Stuttard and Pinto, 2007). The analysis performed by existing tools varies depending on their sophistication, ranging from tools that consider only individual statements and declarations to others that consider dependencies between lines of code. Among other usages (e.g., model checking and data flow analysis), these tools provide an automatic way for highlighting possible coding errors. The main problem of this approach is that exhaustive source code analysis may be difficult and cannot find many security flaws due to the complexity of the code and the lack of a dynamic (runtime) view. Some well-known tools include FindBugs (University of Maryland, 2009), Yasca (Scovetta, 2008), IntelliJ IDEA (JetBrains, 2009), Fortify 360 (Fortify Software, 2008), Pixy (Jovanovic, Kruegel, & Kirda, 2006).
373
Engineering Secure Web Services
“Black-Box” Testing Black-box testing consists of the analysis of the program execution from an external point-of-view. In short, it consists of exercising the software and comparing the execution outcome with the expected result. Testing is probably the most used technique for verification and validation of software. There are several levels for applying black-box testing, ranging from unit testing to integration testing and system testing. The testing approach can also be more formalized (based on models and well defined tests specifications) or less formalized (e.g., when considering informal “smoke testing”). The tests specification should define the coverage criteria (i.e., the criteria that guides the definition of the tests in terms of what is expected to be covered) and should be elaborated before development. Test-driven development (TDD) (Beck, 2003) is an agile software development technique based on predefined test cases that define desired improvements or new functions (i.e., automated unit tests that specify code requirements and that are implemented before writing the code itself). TDD begun in 1999, but is nowadays getting a lot of attention from software engineers (Newkirk & Vorontsov, 2004). Development is conducted in short iterations in which the code necessary to pass the tests is developed. Code refactoring is performed to accommodate changes and improve code quality. Test-driven development is particularly suitable for web services as these are based in well-defined interfaces that are quite appropriate for unit testing. The tests specify the requirements and contain assertions that can be true or false. Running the tests allows developers to quickly validate the expected behavior as code development evolves. A large number of unit testing frameworks are available for developers to create and automatically run sets of test cases, e.g., JUnit (http://junit.org/), CppUnit (http:// sourceforge.net/projects/cppunit/), and JUnitEE (http://www.junitee.org/).
374
Robustness testing is a specific form of blackbox testing. The goal is to characterize the behavior of a system in presence of erroneous input conditions. Although it is not directly related to benchmarking (as there is no standard procedure meant to compare different systems/components concerning robustness), authors usually refer to robustness testing as robustness benchmarking. This way, as proposed by (Mukherjee & Siewiorek, 1997), a robustness benchmark is essentially a suite of robustness tests or stimuli. A robustness benchmark stimulates the system in a way that triggers internal errors, and in that way exposes both programming and design errors in the error detection or recovery mechanisms (systems can be differentiated according to the number of errors uncovered). Web services robustness testing is based on erroneous call parameters (Vieira, Laranjeiro, & Madeira, 2007). The robustness tests consist of combinations of exceptional and acceptable input values of parameters of web services operations that can be generated by applying a set of predefined rules according to the data type of each parameter. Penetration testing, a specialization of robustness testing, consists of the analysis of the program execution in the presence of malicious inputs, searching for potential vulnerabilities. In this approach the tester does not know the internals of the web application and it uses fuzzing techniques over the web HTTP requests (Stuttard and Pinto, 2007). The tester needs no knowledge of the implementation details and tests the inputs of the application from the user’s point of view. The number of tests can reach hundreds or even thousands for each vulnerability type. Penetration testing tools provide an automatic way to search for vulnerabilities avoiding the repetitive and tedious task of doing hundreds or even thousands of tests by hand for each vulnerability type. The most common automated security testing tools used in web applications are generally referred to as web security scanners (or web vulnerability scanners). Web security scanners
Engineering Secure Web Services
are often regarded as an easy way to test applications against vulnerabilities. These scanners have a predefined set of tests cases that are adapted to the application to be tested, saving the user from define all the tests to be done. In practice, the user only needs to configure the scanner and let it test the application. Once the test is completed, the scanner reports existing vulnerabilities (if any detected). Most of these scanners are commercial tools, but there are also some free application scanners often with limited use, since they lack most of the functionalities of their commercial counterparts. Two very popular free security scanners that support web services security assessment are Foundstone WSDigger (Foundstone, Inc., 2005) and WSFuzzer (OWASP, 2008). Regarding commercial scanners, three brands lead the market: HP WebInspect (HP, 2008), IBM Rational AppScan (IBM, 2008) and Acunetix Web Vulnerability Scanner (Acunetix, 2008).
“Gray-Box” Testing The main limitation of black-box approaches is that the vulnerability detection is limited by the output of the application. On the other hand, white-box analysis does not take into account the runtime view of the code. Gray-box approaches overcome these limitations by monitoring the internal behavior of the application during the execution of the tests, trying to find anomalies caused by vulnerabilities present in the code. Dynamic program analysis consists of the analysis of the behavior of the software while executing it (Stuttard and Pinto, 2007). The idea is that by analyzing the internal behavior of the code in the presence of realistic inputs it is possible to identify bugs and vulnerabilities. Obviously, the effectiveness of dynamic analysis depends strongly on the input values (similarly to testing), but it takes advantage of the observation of the source code (similarly to static analysis). For improving the effectiveness of dynamic program
analysis, the program must be executed with sufficient test inputs. Code coverage analyzers help guaranteeing an adequate coverage of the source code (Doliner, 2010) (Atlassian, 2010). Runtime anomaly detection tools can be used both for attack detection and vulnerability detection. One of those tools is AMNESIA (Analysis and Monitoring for NEutralizing SQL-Injection Attacks) (Halfond & Orso, 2005) that combines static analysis and runtime monitoring to detect and avoid SQL injection attacks. Static analysis is used to analyze the source code of a given web application building a model of the legitimate queries that such application can generate. When a query that violates the model is detected it is classified as an attack and is prevented from accessing the database. The problem is that the model built during the static code analysis may be incomplete and unrealistic because it lacks a dynamic view of the runtime behavior of the application. Source code or bytecode instrumentation is a technique that can be used to develop runtime anomaly detection tools. In Laranjeiro et al. (2009) it is proposed an anomaly detection approach to secure web services against SQL and XPath Injection attacks. The web service is instrumented in such way that all the SQL/XPath commands used are intercepted before being issued to the data source. The approach consists of two phases. First, in the learning phase, the approach learns the regular patterns of the queries being issued. Then, at runtime, the commands are compared with the patterns learned previously in order to detect and abort potentially harmful requests. The problem of this technique is that it does not include the generation of the requests to use during the learning phase and so it requires the user to exercise the web service in the learning phase. In Antunes et al. (2009) the runtime anomaly detection approach from Laranjeiro et al. (2009) was enhanced with an automated workload and attack load generation technique in order to be able to detect SQL and XPath Injection vulnerabilities in web services. This way, after the instrumentation
375
Engineering Secure Web Services
of the web service a workload is generated using information about the domains of the parameters of the web service operations. Learning takes place while executing the workload to exercise the web service. Afterwards, an attackload is generated and used to attack the web service. Vulnerabilities are detected by comparing the incoming commands during attacks with the valid set of commands previously learned.
CONCLUSION AND FUTURE RESEARCH ISSUES The aforementioned security aspects, specifications, threats and testing methodologies can be correlated into approaches for engineering secure applications (as exemplified in Table 1). Due to the inherent characteristics of the protocols that make up a service-oriented architecture, security becomes a critical issue that needs to be addressed by specific approaches. On the other hand, the evaluation of the impacts on the performance of SOA systems brought by the addition of security mechanisms is also an important aspect,
since the trade-off between the two requirements is usually a non-trivial balance in face of technological restrictions. Improper use of current Web services security specifications, however, can further degrade a SOA system, both in terms of security and in terms of performance. To deal with security vulnerabilities in Web services, it is necessary to rely on existent security testing tools, as addressed in chapters 7 and 11, and to combine the secure techniques already established in network protocols and application protocols, and also to specific mechanisms, including the definition and evaluation of security policies by making use of traditional security techniques already incorporated within it and cryptographic algorithms and digital signatures.
ACKNOWLEDGMENT Authors acknowledge INCT-SEC, CNPq 573963/2008-8, FAPESP 08/57870-926, 09/06355-0 and 06/55207-5.
Table 1. Security and test approaches to detect vulnerabilities Attack Group
Assessments
Countermeasures
Denial of Service
perform black-box tests that attempt to cause denial of service by exploiting existing vulnerabilities.
Disable the redirects and forwards and implement restrictions.
Brute Force
apply black-box stress tests that attempt to break authentication. Conduct white-box analysis to verify how keys are stored and to assess the strength of the cryptographic algorithms being applied.
classify and code the information avoiding keys storage and using good cryptographic algorithms.
Spoofing
apply tests based on metadata mutation. The idea is to break authenticity and then check if that is adequately handled by the system.
ensure that all the metadata contents could be carefully checked for authenticity.
Flooding
using robustness testing, flood the system by issue semantically invalid requests (this can also be seen as a specific form of stress testing).
identify and reject semantically invalid requests. This kind of attack is considered non trivial to avoid as it is really difficult to identify an invalid message.
Injection
perform penetration testing and static code analysis over the services code for detecting erroneous or missing input validation or harmful usage of user inputs.
consider all information provided by users as malicious. Guarantee that inputs comply with valid values for the field or parameter. Make sure that these inputs are not used in the structure of commands, queries, etc.
376
Engineering Secure Web Services
REFERENCES W3C (2002). XML Encryption syntax and processing. Retrieved October 22, 2010, from http:// www.w3.org/tr/xmlenc-core/. W3C (2007c). Web services policy 1.5 - Framework. Retrieved October 22, 2010, from http:// www.w3.org/tr/ws-policy/. W3C (2008b). XML Signature syntax and processing (second edition). Retrieved October 22, 2010, from http://www.w3.org/tr/xmldsig-core/. Antunes, N., Laranjeiro, N., Vieira, M., & Madeira, H. (2009). Effective Detection of SQL/ XPath Injection Vulnerabilities in Web Services. In IEEE International Conference on Services Computing (pp. 260-267). Presented at the 2009 IEEE International Conference on Services Computing, Bangalore, India: IEEE Computer Society. Atlassian, (2010). Clover - Code Coverage Analysis, Retrieved October 22, 2010, from http://www. atlassian.com/software/clover/ Beck, K. (2003). Test-driven development: by example. Addison-Wesley Professional. Cannings, R., Dwivedi, H., & Lackey, Z. (2008). Hacking exposed Web 2.0: Web 2.0 security secrets and solutions. 1. New York, NY, USA: McGraw-Hill, Inc. Chou, D. C., & Yurov, K. (2005). Security development in Web services environment. Computer Standards & Interfaces, 27(3), 233–240. doi:10.1016/S0920-5489(04)00099-6 Coulouris, G., Dollimore, J., & Kindberg, T. (2005). Distributed systems: Concepts and design (4th ed.). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Curphey, M., Endler, D., Hau, W., Taylor, S., Smith, T., Russell, A., McKenna, G., et al. (2002). A guide to building secure Web applications. The Open Web Application Security Project, 1.
Dierks, T., & Allen, C. (1999). The TLS protocol – Version 1.0. IETF RFC 2246. Doliner, M. (2010). Cobertura. Retrieved October 22, 2010, from http://cobertura.sourceforge.net/ Dustdar, S., & Schreiner, W. (2005). A survey on Web services composition. In [Inderscience Publishers, Geneva.]. International Journal of Web and Grid Services., 1(1), 1–30. doi:10.1504/ IJWGS.2005.007545 Fagan, M. E. (1976). Design and code inspections to reduce errors in program development. IBM Systems Journal, 15(3), 182–211. doi:10.1147/ sj.153.0182 Fogie, S., Grossman, J., Hansen, R., Rager, A., & Petkov, P. D. (2007). XSS Attacks: Cross site scripting exploits and defense. Syngress Publishing. Fortify Software. (2008). Fortify 360 Software Security Assurance. Retrieved October 22, 2010, from http://www.fortify.com/products/ fortify-360/ Foundstone, Inc. (2005). Foundstone WSDigger. Foundstone Free Tools. Retrieved October 22, 2010, from http://www.foundstone.com/us/ resources/proddesc/wsdigger.htm Freedman, D. P., & Weinberg, G. M. (2000). Handbook of Walkthroughs, Inspections, and Technical Reviews: Evaluating Programs, Projects, and Products. Dorset House Publishing Co., Inc. Retrieved October 22, from http://portal.acm. org/citation.cfm?id=556043# Freier, A. O., Karlton, P., & Kocher, P. C. (1996). The SSL protocol version 3.0. Internet Draft. Halfond, W. G., & Orso, A. (2005). AMNESIA: analysis and monitoring for NEutralizing SQLinjection attacks. In Proceedings of the 20th IEEE/ ACM international Conference on Automated software engineering (p. 183).
377
Engineering Secure Web Services
Holgersson, J., & Söderstrom, E. (2005). Web service security - Vulnerabilities and threats within the context of WS-Security. p. 138 – 146. Howard, M., LeBlanc, D., & Viega, J. (2005). 19 deadly sins of software security: Programming flaws and how to fix them. New York, NY, USA: McGraw-Hill, Inc. HP. (2008). HP WebInspect. Retrieved October 22, 2010, from https://h10078. www1.hp.com/cda/hpms/display/main/ hpms_content.jsp?zn=bto&cp=1-11-201200%5E9570_4000_100__ IBM. (2008). IBM Rational AppScan. Retrieved October 22, 2010, from http://www-01.ibm.com/ software/awdtools/appscan/ Jensen, M., Gruschka, N., Herkenhoner, R., & Luttenberger, N. (2007). SOA and Web services: New technologies, new standards - New attacks. In: ECOWS ’07: Proceedings of the Fifth European Conference on Web Services, Washington, DC, USA: IEEE Computer Society, p. 35–44. JetBrains. (2009). IntelliJ IDEA. Retrieved October 22, 2010, from http://www.jetbrains.com/ idea/free_java_ide.html Jovanovic, N., Kruegel, C., & Kirda, E. (2006). Pixy: A Static Analysis Tool for Detecting Web Application Vulnerabilities (Short Paper). In Security and Privacy, IEEE Symposium on (pp. 258-263). Berkeley/Oakland, California: IEEE Computer Society. Keidl, M., & Kemper, A. (2004). Towards contextaware adaptable Web services. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers, New York, NY, USA. Kent, S. (2005). RFC 4302 IP authentication header. Internet Engineering Task Force (IETF). Retrieved October 22, 2010, from http://www. ietf.org/rfc/rfc4302.txt.
378
Kent, S., & Atkinson, R. (1998). RFC 2402 IP authentication header. Internet Engineering Task Force (IETF). Retrieved October 22, 2010, from http://www.ietf.org/rfc/rfc2402.txt. Kent, S., & Seo, K. (2005). Security Architecture for the Internet Protocol. Networking Working Group - Request for Comments 4301, December 2005. Retrieved October 22, 2010, from http:// tools.ietf.org/html/rfc4301. Knap, T., & Mlýnková, I. (2009). Towards more secure Web services: Pitfalls of various approaches to XML Signature verification process. In: ICWS ’09: Proceedings of the 2009 IEEE International Conference on Web Services, Washington, DC, USA: IEEE Computer Society, p. 543–550. Knudsen, L. (2005). SMASH - A cryptographic hash function. In Fast Software Encryption: 12th International Workshop, FSE 2005, volume 3557 of Lecture Notes in Computer Science, pages 228-242. Springer. Lai, J., Wu, J., Chen, S., Wu, C., & Yang, C. (2008). Designing a taxonomy of Web attacks. In Proceedings of the 2008 International Conference on Convergence and Hybrid Information Technology. ICHIT. IEEE Computer Society, Washington, DC, 278-282. Laranjeiro, N., Vieira, M., & Madeira, H. (2009). Protecting Database Centric Web Services against SQL/XPath Injection Attacks. In Database and Expert Systems Applications (pp. 271–278). Lindstrom, P. (2004). Attacking and defending Web services. A spire research report. Mashood, M., & Wikramanayake, G. (2007). Architecting secure Web services through policies. International Conference on Industrial and Information Systems. ICIIS 2007, vol., no., pp.5-10. Merrill, D., & Grimshaw, A. (2009). Profiles for conveying the secure communication requirements of Web services. Concurrency and Computation, 21(8), 991–1011. doi:10.1002/cpe.1403
Engineering Secure Web Services
Microsoft. (2002). Security in a Web services world: A proposed architecture and roadmap. Retrieved October 22, 2010, from http://msdn. microsoft.com/en-us/library/ms977312.aspx. Mihindukulasooriya, N. (2008). Understanding WS-Security policy language. WSO2. Retrieved October 22, 2010, from http://wso2.org/ library/3132. Mirkovic, J., & Reiher, P. (2004). A taxonomy of DDoS attack and DDoS defense mechanisms. SIGCOMM Comput. Commun. Rev., 34(2), 39–53. doi:10.1145/997150.997156 Mogollon, M. (2008). Cryptography and security services: Mechanisms and applications. IGI Global. doi:10.4018/978-1-59904-837-6 Mukherjee, A., & Siewiorek, D. P. (1997). Measuring Software Dependability by Robustness Benchmarking. IEEE Transactions on Software Engineering, 23(6). doi:10.1109/32.601075 Nagappan, R., Skoczylas, R., & Sriganesh, R. P. (2003). Developing Java Web services. New York, NY, USA: John Wiley & Sons, Inc. Newkirk, J. W., & Vorontsov, A. A. (2004). TestDriven Development in Microsoft. Net. Microsoft Press. Retrieved October 22, from http://portal. acm.org/citation.cfm?id=983793# Nordbotten, N. A. (2009). XML and Web services security standards. IEEE Communications Surveys Tutorials, v. 11, n. 3, p. 4 –21. O’Neill, M. (2003). Web services security. New York, NY, USA: McGraw-Hill, Inc. OASIS. (2006a). Web services security (WSS) TC. Retrieved October 22, 2010, from http:// www.oasis-open.org/committees/tc_home. php?wg_abbrev=wss. OASIS. (2007a). WS-SecurityPolicy 1.2. Retrieved October 22, 2010, from http://docs. oasis-open.org/ws-sx/ws-securitypolicy/v1.2/ ws-securitypolicy.html.
OASIS. (2007c). WS-Trust 1.3. Retrieved October 22, 2010, from http://docs.oasis-open.org/ws-sx/ ws-trust/200512. OASIS. (2007d). WS-SecureConversation 1.3. Retrieved October 22, 2010, from http://docs. oasis-open.org/ws-sx/ws-secureconversation/ v1.3/ws-secureconversation.html. OASIS. (2008b). Web services federation (WSFED) TC. Retrieved October 22, 2010, from http://www.oasis-open.org/committees/tc_home. php?wg_abbrev=wsfed. Ort, E. (2005). Service-oriented architecture and Web services: Concepts, technologies, and tools. Sun Microsystems. Retrieved October 22, 2010, from http://java.sun.com/developer/technicalArticles/WebServices/soa2/ OWASP. (2008). OWASP WSFuzzer Project. Retrieved October 22, 2010, from http://www. owasp.org/index.php/Category:OWASP_WSFuzzer_Project OWASP. (2010). OWASP top 10 2010. Retrieved October 22, 2010, from http://www.owasp.org/ index.php/ Top_10_2010-Main. PCI. (2009). Payment Card Industry (PCI) data security standard - Requirements and security assessment procedures - Version 1.2.1. PCI Security Standards Council. Pramstaller, N., Rechberger, C., & Rijmen, V. (2005). Breaking a new hash function design strategy called SMASH. In Selected Areas in Cryptography, 12th International Workshop, SAC 2005, volume 3897 of Lecture Notes in Computer Science, pages 234-244. Springer. Ravi, S., Raghunathan, A., Kocher, P., & Hattangady, S. (2004). Security in embedded systems: Design challenges. ACM Transactions on Embedded Computing Systems, 3(3), 461–491. doi:10.1145/1015047.1015049
379
Engineering Secure Web Services
Scovetta, M. (2008). Yet Another Source Code Analyzer. Retrieved October 22, 2010, from www.yasca.org Siddiqui, B. (2002). Exploring XML Encryption, part 1. IBM Corporation. Retrieved October 22, 2010, from http://www.ibm.com/developerworks/ xml/library/x-encrypt/. Sidharth, N., & Liu, J. (2008). Intrusion resistant SOAP messaging with IAPF. In: APSCC ’08: Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference, Washington, DC, USA: IEEE Computer Society, p. 856–862. Stuttard, D., & Pinto, M. (2007). The Web application hacker’s handbook: Discovering and exploiting security flaws. John Wiley & Sons, Inc. Thayer, R., Doraswamy, N., & Glenn, R. (1998). RFC 2411 IP security document roadmap. National Institute of Standards and Technology (NIST). Retrieved October 22, 2010, from http://csrc.nist. gov/archive/ipsec/papers/rfc2411-roadmap.txt University of Maryland. (2009). FindBugs™ Find Bugs in Java Programs. Retrieved October 22, 2010, from http://findbugs.sourceforge.net/
380
Uto, N., & Melo, S. P. (2009). Vulnerabilidades em aplicações Web e mecanismos de proteção. In Minicursos SBSeg 2009 (pp. 237–283). IX Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais. Vieira, M., Antunes, N., & Madeira, H. (2009). Using web security scanners to detect vulnerabilities in web services. In IEEE/IFIP International Conference on Dependable Systems & Networks, 2009. DSN ‘09. (pp. 566-571). Presented at the IEEE/IFIP International Conference on Dependable Systems & Networks, 2009. DSN ‘09. Vieira, M., Laranjeiro, N., & Madeira, H. (2007). Assessing robustness of web-services infrastructures. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN’07 (pp. 131–136). Yue-Sheng, G., Bao-Jian, Z., & Wu, X. (2009). Research and realization of Web services security based on XML Signature. International Conference on Networking and Digital Society, v. 2, p. 116–118.
381
Chapter 17
Approaches to Functional, Structural and Security SOA Testing Cesare Bartolini Consiglio Nazionale delle Ricerche, Italy Antonia Bertolino Consiglio Nazionale delle Ricerche, Italy Francesca Lonetti Consiglio Nazionale delle Ricerche, Italy Eda Marchetti Consiglio Nazionale delle Ricerche, Italy
ABSTRACT In this chapter, we provide an overview of recently proposed approaches and tools for functional and structural testing of SOA services. Typically, these two classes of approaches have been considered separately. However, since they focus on different perspectives, they are generally non-conflicting and could be used in a complementary way. Accordingly, we make an attempt at such a combination, briefly showing the approach and some preliminary results of the experimentation. The combined approach provides encouraging results from the point of view of the achievements and the degree of automation obtained. A very important concern in designing and developing web services is security. In the chapter we also discuss the security testing challenges and the currently proposed solutions.
INTRODUCTION Traditional testing approaches are divided into two major classes: functional and structural. Functional approaches provide the ability to verify the proper behaviour of services in order to assess
and validate their functionality. They treat the applications under test as black boxes, focusing on the externally visible behaviour but ignoring the internal structure. Structural approaches, also known as white-box testing, on the other hand, are a well-known valuable complement to func-
DOI: 10.4018/978-1-60960-794-4.ch017
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Approaches to Functional, Structural and Security SOA Testing
tional ones. Coverage information can provide an indication of the thoroughness of the executed test cases, and can help identify additional test cases to functional ones which exercise unexecuted paths and hence might help detect further faults. The same division is still valid considering SOA: functional approaches are applied in SOA testing either to show the conformance to a userprovided specification, or according to their fault detection ability, assessed on fault models. In this chapter, we overview some of the existing proposals that derive functional test cases using formal specification, contracts or WSDL language. Concerning structural testing of SOA, two different points of view can be identified: coverage measured at the level of a service composition (orchestration or choreography) and coverage of a single service. Generally, validation of service orchestrations is based on the Business Process Execution Language description considered as an extended control flow diagram. Classical techniques of structural coverage (e.g., control flow or dataflow) can be used to guide test generation or to assess test coverage so as to take into consideration the peculiarities of the Business Process Execution Language. Other proposals are instead based on formal specification of the workflows, e.g., Petri Nets and Finite State Processes used for verifying specific service properties. Considering a service choreography, existing research focuses, among others, on service modeling, process flow modeling, violation detection of properties such as atomicity and resource constraints, and XML-based test derivation. If, on the one side, there are several approaches for structural testing of service compositions, there are few proposals for deriving structural coverage measures of the invoked services. The reason for this is that independent web services usually provide just an interface, enough to invoke them and develop some general (black-box) tests, but insufficient for a tester to develop an adequate understanding of the integration quality between the application and independent web services. We
382
describe an approach that addresses this deficit by “whitening” services testing through the addition of an intermediate coverage service. In this chapter, we first provide a survey of some proposed approaches and tools for supporting SOA functional and structural testing. Then, we propose an example of application of a selected black-box approach and a selected white-box approach to a case study for comparative purposes. In SOA services, another important aspect that must be carefully checked is security, since functional and structural testing, albeit successfully executed, do not prevent security weaknesses. Different testing strategies, usually divided into passive and active mechanisms, can be used to provide evidence in security-related issues, i.e., that an application meets its requirements in presence of hostile and malicious inputs. We will survey the most commonly-adopted methodologies and techniques such as fuzz testing, injection, and web services security extensions. An important facet of security information management in web applications is the control of accesses. We will hence include testing methodologies exploiting the specification of access control policies by means of policy languages, such as the eXtensible Access Control Markup Language (XACML) or the Role-Based Access Control (RBAC). We will overview current proposals for access control policy testing dealing with the classification of the possible policy faults and the development of the corresponding fault model, the application of standard or ad hoc conceived test coverage criteria to measure the adequacy of a test suite, and the automated generation of test cases using (for example), change-impact analysis, random heuristics or model-based approaches.
Functional Testing The functional testing of Web Services is a critical issue both in research and in the IT industry. Nowadays business and social welfare are more and more depending on the proper functioning of
Approaches to Functional, Structural and Security SOA Testing
services delivered over the Net. Therefore, WSs need to be thoroughly tested before deployment. In this section we will provide an overview of the recent proposals and tools for the functional testing of Web Services. As a common guideline this kind of testing relies on the functionality provided by the Web Services. Commonly the use of the (formal) specification or the XML Schema datatype available allows the generation of test cases for boundaryvalue analysis, equivalence class testing or random testing. The derived test suite could have different purposes, such as to prove the conformance to a user-provided specification, to show the fault detection ability assessed on fault models, or to verify the interoperability by exploiting the invocation syntax defined in an associated WSDL (WS Description Language) document (WSDL, 2007). In this last case the formalized WSDL description of service operations and of their input and output parameters can be taken as a reference for black box testing at the service interface. Considering conformance testing, some of the current proposals include the Coyote framework, proposed by (Tsai, Paul, Song, & Cao, 2002), which requires the user to build test cases as Message Sequence Charts (Harel, & Thiagarajan, 2004). A contract specification of the services under test is also necessary. In (Jiang, Hou, Shan, Zhang, & Xie, 2005) the authors focused on the fault detection ability of the test data, which is assessed using mutation on contracts. A simpler approach is proposed in (Siblini, & Mansour, 2005), where mutations are defined on the WSDL language. However the mutation operations adopted in both approaches do not correspond to any widely accepted fault-model. Another approach about testing the conformance of a WS instance to an access protocol described by a UML2.0 Protocol State Machine (PSM) is presented in (Bertolino, Frantzen, Polini, & Tretmans, 2006) and (Bertolino, & Polini 2005). This protocol defines how the service provided by a component can be accessed by a client through
its ports and interfaces. The PSM is translated into a Symbolic Transition System (STS), on which existing formal testing theory and tools can be applied for conformance evaluation. For instance, in (Frantzen, & Tretmans, 2006), STSs are used to specify the behavior of communicating WS ports and test data are generated to check the conformity of the effective implementation to such a specification. The advantage of this approach is that it uses a standard notation (UML 2.0 PSM) for the protocol specification. Based on the use of WSDL and XML Schema datatype information, the WSTD-Gen tool, developed by (Li, Zhu, Zhang, & Mitsumori, 2009), generates the test cases by exploiting the information extracted from the WSDL specifications and user knowledge. The tool implements also the possibility of customizing the data types and the selection of the test generation rules. The peculiarity of this approach is that it is focused on proving the non-existence of known errors in the system according to a fault model used, so erroneous test data are generated intentionally. WSDL-based test data generation is considered in other proposed approaches such as (Ma, Du, Zhang, Hu, & Cai, 2008; Sneed, & Huang, 2006; Bai, Dong, Tsai, & Chen, 2005; Heckel, & Mariani, 2005), which mainly focus on the definition of test cases for testing a single web service operation. Some commercial tools for testing WSDL descriptions of Web services are also available such as soapUI (Eviware, n.d.), TestMaker (PushToTest, n.d.), and Parasoft (Parasoft, n.d.). In particular, soapUI is acclaimed on its distribution site as the most used tool for WSs testing. It can automatically produce a skeleton of a WS test case and provide support for its execution and result analysis. The tester’s job is certainly greatly released by the usage of this or similar tools; however the produced test cases are incomplete and lack the input parameter values and the expected outputs. Moreover, soapUI can also measure the coverage of WS operations, but again the generation of diverse test messages for adequately exercising
383
Approaches to Functional, Structural and Security SOA Testing
operations and data combinations is left to the human testers. From our study of the literature, we find it somehow surprising that WS test automation is not pushed further as of today. Indeed, the use of XMLbased syntax of WSDL documents could support fully automated WS test generation by means of traditional syntax-based testing approaches. In this direction, we have recently proposed an approach (Bartolini, Bertolino, Marchetti, & Polini, 2009) which is a sort of “turn-key” generation of WS test suites. This approach relies on a practical yet powerful tool for fully automated generation of WSs test inputs. The key idea is to combine the coverage of WS operations (as provided by soapUI) with well-established strategies for data-driven test input generation. The tool, called WS-TAXI, is obtained by integrating two existing softwares: soapUI, presented above, and TAXI (Bertolino, Gao, Marchetti, & Polini, 2007), which is an application for the automated derivation of XML instances from an XML schema (the interested reader can refer to (Bertolino, Gao, Marchetti, & Polini, 2007) for details on TAXI implementation). The integrated tool is named WS-TAXI. The original notion at the basis of WS-TAXI, in comparison with soapUI and other existing WS test tools, is the inclusion of a systematic strategy for test generation based on basic wellestablished principles of the testing discipline, such as equivalence partitioning, boundary analysis and combinatorial testing. The test cases generated by the approaches or tools surveyed so far are scoped for testing a single operation per time and the possible dependencies among the test cases are not considered. The combination of different test cases according to the behavioral dependencies is considered in (Bai, Dong, Tsai, & Chen, 2005). In particular the use of specifications containing behavioural information is typical of the semantic web service testing or the ontology-based test data generation. The use of semantic models such as OWL-S for test
384
data generation is proposed for instance in (Bai, Lee, Tsai, & Chen, 2008) while ontology-based test data generation is proposed in (Wang, Bai, Li, & Huang, 2007).
Structural Testing As we discussed in the Introduction, most often structural testing of SOA applications is devoted to the validation of Web Service Compositions (WSC). Validation of WSCs has been addressed by some authors suggesting to perform structural coverage testing of a WSC specification. In several approaches, it is assumed that the WSC is provided in BPEL (OASIS WSBPEL, 2007), the Business Process Execution Language, a standard for programming WSCs. In (Yuan, Li, & Sun, 2006), the BPEL description is abstracted as an extended control flow diagram; paths over this diagram can be used to guide test generation or to assess test coverage. A major issue in this approach comes from the parallelism in BPEL which results in a much more complex control flow and a very high number of paths. To cope with this problem, some proposals make several simplifying assumptions on BPEL and consider only a subset of the language. In (García-Fanjul, Tuya, & de la Riva, 2006) a transformation is proposed from BPEL to Promela (similarly to (Cao, Ying, & Du, 2006)). The resulting abstract model is used to generate tests guided by structural coverage criteria (e.g. transition coverage). Similar complexity explosion problems may be encountered in such methods, since the amount of states and transitions of the target model can be very high. Different approaches for conformance testing that exploit model-checking techniques are available. All these works usually focus on the generation of test cases from counterexamples given by the model checker. Considering in particular the usage of SPIN, the work of (García-Fanjul et al. 2006) generates test cases specification for compositions given in BPEL by systematically applying
Approaches to Functional, Structural and Security SOA Testing
the transition coverage criterion. (Zheng, Zhou, & Krause, 2007) transforms each BPEL activity into automata which are successively transformed into Promela, the input format of the SPIN model checker. Similarly also (Fu, Bultan, & Su, 2004) uses the SPIN model checker to verify BPEL but without passing through Promela as in (GarcíaFanjul, Tuya, & de la Riva, 2006). In this case in fact the BPEL is translated to guard conditions which are transformed into Promela. When a formal model of the WSC and of the required properties is provided, a formal proof can be carried out. For instance, Petri Nets can be built from workflows (Narayanan, & McIlraith, 2003) or from BPEL processes (Yang, Tan, Yong, Liu, & Yu, 2006) or from other approaches (Pistore, Marconi, Bertoli, & Traverso, 2005), to verify properties such as reachability. In (Foster, Uchitel, Magee, & Kremer, 2003), the workflow is specified in BPEL and an additional functional specification is provided as a set of Message Sequence Charts (MSC) (Harel & Thiagarajan, 2004). These specifications are translated in the Finite State Processes (FSP) notation. Model-checking is performed to detect execution scenarios allowed in the MSC description and not executable in the workflow, and vice versa. The complexity of the involved models and model-checking algorithms is the main concern that in practice makes these approaches hardly applicable to real-world WSCs. The above investigations use models of the composition behavior and of properties or scenarios expressing the user expectations (MSCs or state-based properties such as the absence of deadlock): faults manifest themselves as discrepancies between the system behavior and these expectations. An interesting characterization is proposed in (Weiss, Esfandiari, & Luo, 2007) where failures of WSCs are considered as interactions between WSs, similarly to feature interactions in telecommunications services. Interactions are classified for instance as goal conflict or as resource contention.
Orthogonal approaches are based on the transformation of BPEL into the Intermediate Format Language (IF), as presented for instance in (Lallali, Zaidi, & Cavalli, 2008). In these approaches the test cases are generated using the TestGen-IF tool (Cavalli, Montes De Oca, Mallouli, & Lallali, 2008). More recently, the WSOFT (Web Service composition, Online Testing Framework) (Cao, Felix, & Castane, 2010) consists of a framework which combines together test execution and debug of test cases focused on unit testing. Specifically the WSOFT approach focuses on the testing of a composition in which all partners are simulated by the WSOFT. Timing constraints and synchronous time delays are also considered. In particular, timing constraints are expressed by the TEFSM (Timed Extended Finite State Machines) formal specification. The same authors present in (Cao, Felix, Castane, & Berrada, 2010) a similar framework that automatically generates and executes tests “online” for conformance testing of a composition of Web services described in BPEL. The framework considers unit testing and is based on a timed modeling of BPEL specification. Considering specifically the verification of the data transformations involved in the WSC execution, few proposals are available. From the modeling point of view, this lack has been outlined in (Marconi, Pistore, & Traverso, 2006) where a model representing the data transformations performed during the execution of a WSC is proposed. This model can be used together with behavioral specifications to automate the WSC process. In (Bartolini, Bertolino, Marchetti, & Parissis, 2008) the authors focus on testing based on datarelated models and schematically settle future research issues on the perspectives opened by dataflow-based validation of WSs. In particular the authors discuss two possible approaches based on dataflow modeling and analysis for the validation of WSCs. Precisely, a methodology for using a dataflow model derived from the requirements
385
Approaches to Functional, Structural and Security SOA Testing
specification and a technique focused on a dataflow model extracted from the BPEL description of the composition. However, in the contest of Service Oriented Architectures (SOAs) where independent web services can be composed with other services to provide richer functionality, interoperability testing becomes a major challenge. Independent web services usually provide just an interface, enough to invoke them and develop some general functional (black-box) tests, but insufficient for a tester to develop an adequate understanding of the integration quality between the application and the independent web services. To address this lack, in (Bartolini, Bertolino, Elbaum, & Marchetti, 2009) the authors proposed a “whitening” approach to make web services more transparent through the addition of an intermediate coverage service. The approach, named Service Oriented Coverage Testing (SOCT), provides a tester with feedback about how a whitened service, called a Testable Service, is exercised.
COMBINING FUNCTIONAL AND STRUCTURAL APPROACHES In this section, we aim at applying some of the approaches presented in the paper, to try and evaluate a combined methodology against a sample web service. Since several approaches cover different steps of the testing process, it is possible to use them in a non-contrasting merged approach, getting the benefits of each of them. Specifically, the experiment proposed here explores the coverage analysis of a web service after choosing a functional technique for generating the test suite.
Applied Approaches In this section we provide some additional details about the applied testing technique. In particular, we consider the already introduced WS-TAXI (Bartolini, Bertolino, Marchetti, & Polini, 2009)
386
and SOCT (Bartolini, Bertolino, Elbaum, & Marchetti, 2009) approaches, and apply them to a common case study. As said, the purpose of SOCT is to mitigate the limitation associated with the on-line testing activity by enabling the application of traditional white-box testing techniques to SOA applications. The envisioned testing scenario for SOCT is depicted in Figure 1. The traditional actors in SOA testing (Canfora, & Di Penta, 2009) are the service provider, who can test their service before deployment, and the service integrator, who has to test the orchestrated services. The TCov (Bartolini, Bertolino, Elbaum, & Marchetti 2009) service sits between the service developer and the service provider. The TCov provider can be assumed as a trusted provider of services that delivers coverage information to service providers as it is tested by the service developer. To realize the SOCT scenario, the provided services must be instrumented (callout 1 in Figure 1) to enable the collection of coverage data, not differently from how instrumentation is normally performed for traditional white-box testing. As the developer invokes the services during online testing (callout 2), coverage measures are collected from services and sent (callout 3) to TCov, which will then be responsible to process the information and make it available to the developer as a service (callout 4). It is obvious that any means to reveal more of the structure of a service will increase testFigure 1. Overview of the SOCT paradigm
Approaches to Functional, Structural and Security SOA Testing
ability. However, SOCT is designed not to reveal anything that would damage an industrial asset of the provider of the instrumented services, because the granularity of the coverage information is a provider’s choice. For generating the test suite, the experiments were carried out with the aid of WS-TAXI, based on the TAXI tool. TAXI can generate conforming XML instances from an XML Schema. To generate the instances, it uses a modified version of the Category Partition (CP) algorithm (Offutt, & Xu, 2004) to find all possible structures for the elements by adopting a systematic black-box criterion. CP splits the functional specifications of the service into categories and further into choices. Categories represent the parameters and conditions which are relevant to the algorithm for testing purposes, while choices are the possible values (either valid or invalid) which can be assigned to each category. CP also includes constraints to avoid having redundant tests in the test suite. In particular, TAXI processes the schema by applying the following rules: •
•
•
choice elements are processed by generating instances with every possible child. Multiple choice elements produce a combinatorial number of instances. This ensures that the set of sub-schemas represents all possible structures derivable from choice; Element occurrences are analyzed, and the constraints are determined, from the XML Schema definition. Boundary values for minOccurs and maxOccurs are defined; all elements result in a random sequence of the all children elements for generating the instance. This new sequence is then used during the values assignment to each element.
Exploiting the information collected so far and the structure of the (sub)schema, TAXI derives a set of intermediate instances by combining the
occurrence values assigned to each element. The final instances are derived from the intermediate ones by assigning values to the various elements. Two approaches can be adopted: values can either be picked randomly from those stored in an associated database, or generated randomly (e.g., a random sequence of characters for a string element) if the database contains no value for an element. Since the number of instances with different structures could be huge, in the current implementation TAXI only selects one value per element for each instance.
Testbed The testbed selected for this experimentation was a modified version of WorldTravel (WorldTravel, 2008)1. WorldTravel is a web service developed at the Georgia Institute of Technology, specifically to be used as a testbed for research on web services. From a functional point of view, WorldTravel provides research facilities for flights within the United Stated and Canada. The software is written in Java and runs on an Apache Geronimo application server with the Tomcat servlet container. Although it is designed to run both on Windows and Linux operating systems, only the Linux setup was used for these tests. To retrieve flight information, WorldTravel relies on a MySQL database, which in the current experimentation was running on the same computer as the Geronimo server. The database contains much more data than those actually used by the service. As previously mentioned, the WorldTravel software was slightly modified. First off, some bugs had to be fixed. Secondly, some improvements were performed with respect to the original version. Changes made included some additional checks on the input data, resulting in error messages if wrong input is provided. Last, the code was instrumented for applying SOCT. Probes were introduced for two different types of coverage analysis, block coverage and branch coverage.
387
Approaches to Functional, Structural and Security SOA Testing
An additional software used for this experimentation is TCov. Technically, it is a simple web service developed in PHP running on an Apache web server and a MySQL database, which collects coverage data for the instrumented web service and allows the user to retrieve those data in summarized reports. In the setup used for this work, TCov resides on a different server from the one running WorldTravel. Since the point of view of this experimentation is that of the service integrator, i.e., a developer who creates a service composition, another service was created that uses WorldTravel as a backend. This service, which will be referred to as SO (standing for Service Orchestration) from now on, simulates the behavior of a travel agency, with the purpose of allowing customers to check flights availability, according to the trip, the dates, and the desired fare class and/or carrier. The experiments have been run against this service, which uses WorldTravel and is set to take advantage of the TCov instrumentation to evaluate the coverage reached.
TEST SUITE GENERATION The first step consisted in developing the set of test cases to run on the web service. To do this, the web service tester can normally rely on two types of information provided by the web service provider: the functional specification and the WSDL interface. The former contains detailed information on what the web service expects as input data and how these data should be combined. The latter is a machine-readable specification which, among other things, contains the exact structure of the input and output SOAP messages. Although the two pieces of information are quite similar, there are important differences in that the functional specification can be expressed at various degrees of detail, but usually is defined at a higher level, while the WSDL interface contains
388
structural details (ranges, patterns and so on) which are normally not highlighted in the specification. This difference is reflected in the two different types of functional testing approach which can be applied for test suite generation: one based on the specification, and the other based on the WSDL interface. This led to two different test suites generated for SO, one created through the Category Partition (CP) algorithm, the other using the WS-TAXI tool on the WSDL interface. In the following, the two test suites will be referred to as TS1 and TS2, respectively. SO receives queries with specific attributes (one-way or round trip, dates, acceptable fare classes, and desired airlines), and finds the best fares among the flights meeting the search criteria. SO has a single operation, FareResponse bestFare(FareRequest), which takes the following input data as part of the FareRequest object: • • • • • •
the originating airport (string) the destination airport (string) the date of the flight to the destination (dateTime) [0..1] the date for the return flight (dateTime), if needed for a round-trip flight [0..3] the requested fare classes (enum: First, Business, Economy) [0..*] the requested airlines
The data returned in the FareResponse object contains zero or one flight that include: • • •
[1..2] trips, each containing the flight name, date, origin and destination the fare class (string) the cost of the flight (int)
The generation of TS1, as mentioned, is based on the aforementioned Category Partition algorithm. For each component of the FareRequest object, the following situations be considered: all possible valid structures, missing element, null element, structural error, syntactic error.
Approaches to Functional, Structural and Security SOA Testing
For example, for the date of the return flight, the following cases were taken into considerations: zero elements, one element with a correct date, one empty element, one element with a date in the wrong format, one element with a date prior to the onward flight. The test suite was generated by making a combinatorial composition of all valid inputs, and by adding single test cases for nonvalid inputs. TS1 consists of a total of 31 tests. TS2, on the other hand, has been generated using WS-TAXI. WS-TAXI is a set of tools which produce and execute a test suite using only the WSDL as an input. The process is the following: 1. the aforementioned soapUI tool is used to generate SOAP envelopes for the invocations of all operations in the WSDL. The non-commercial version of soapUI was used, but only to generate stubs of SOAP calls for the operations declared in a WSDL file. soapUI fills the contents of the envelope with placeholders for data, without assigning significant values. For example, string values are filled with random Latin words, while occurrences are managed simply by introducing a comment in the envelope notifying the minimum and maximum number of elements allowed at that spot, leaving to the user the endeavor to select a specific number of occurrences. Since the purpose of TS2 is to exploit all possible structures for the SOAP calls, this is clearly insufficient, so the envelopes generated by soapUI are completely stripped of their contents, which will be replaced in the following steps. A different software capable of generating SOAP envelopes from WSDL files could be used in alternative, such as Altova XMLSpy (Altova, n.d.); 2. a simple script is used to extract from the WSDL interface all informations related to the data structure of the messages used in the operations. A WSDL generally has a section which contains or refers an XML Schema,
and the WSDL messages (rectius their parts) contain elements which are described therein. This step produces a separate XSD file named as the WSDL (but obviously with a different extension); 3. the XSD file generated in step 2 is fed as an input to the TAXI tool. For these experiments, the database described previously was populated with the allowed origin and destination airports and the possible carriers; dates were generated randomly, while the fare classes (first, business, and economy) were stored in an enumeration in the schema (TAXI manages enumerations by assigning random values among those allowed). A total of 200 possible instances were generated for the FareRequest element; 4. a simple script takes all the XML instances and fills the envelopes created with soapUI with each of them, producing an equal number of SOAP input messages; 5. last, each of the SOAP messages generated in the previous step is sent to the web service, and the response (if it is received and does not time out) is compared against an oracle for correctness. This process generates a test suite which normally is bigger than the one created with the CP algorithm, but the benefit is that most of the process is automated (except for step 1, since to our best knowledge there is no way to run soapUI from a command line, which would allow to insert it into a batch execution), and requires much less handiwork from the tester. Additionally, there is a greater variability in the values with respect to TS1, and that might be relevant in situations where the behavior of the service is data-dependent.
TEST SUITES COMPARISON The two methodologies for building TS1 and TS2 differ a lot under several aspects:
389
Approaches to Functional, Structural and Security SOA Testing
•
•
•
•
•
the effort required to build them: TS1 requires more work on the tester’s side, while TS2 is mostly automatic; the number of instances generated: TS1 is strictly dependent on the CP algorithm and the granularity of the functional specification (the more detailed and close to the implementation details the specification is, the bigger the number of tests will be), whereas TS2 can produce an arbitrary number of instances (generally a lot more than TS1). At its limit, if the specification is extremely detailed, the two test suites would be made up of the same tests; the data variability: TS2 is limited only by the size of the database used by TAXI, while the building of TS1 is not data-driven; the structure of the invocations: TS1 also contains non-allowed data structures, which is normally not possible using WS-TAXI; the time required for executing the test suite: since TS2 can be much larger, it can take much longer to run all tests on the service.
Obviously, the question that arises is: which one is to be preferred? The answer is not easy, because some of the previous points favor TS1 while othFigure 2. Coverage measures for TS1
390
ers would suggest TS2. The approach chosen to compare the two test suites was to execute each of them in turn and measure the coverage using TCov. Specifically, the incremental coverage was tested, meaning that a single testing session for each test suite was run, and the coverage reached was measured after each test. This is in line with the way TCov is supposed to be used, where a tester would execute tests which used all the features that he needed from the service and then measure how much of the service these features would use. The results are shown in Figures 2 and 3. The figures display the results both of block and branch coverage. The former shows the coverage measures for TS1, while the latter for TS2. The results highlight some important considerations. First off, it can be noted that both test suites eventually reach similar results. This is not unexpected, because both test suites in the end are built using some version of the CP algorithm, plus the behavior of the service is not data-dependent. TS1 reaches its maximum coverage at the st 31 test, measuring 78.125% block coverage and 68.235% branch coverage. TS2, instead, reaches its maximum at the 65th test, with a 78.125% block coverage and 69.412% branch coverage. As expected, TS2 contains a lot more tests which do not exercise new parts of the web service, so
Approaches to Functional, Structural and Security SOA Testing
many tests are redundant. However, this is not a problem as the test suite is built automatically. An issue which was not expected at first was that TS2 reached a higher branch coverage than TS1 (albeit slightly). This was considered worthy of closer examination. After some analysis on the detailed coverage results, and comparing those with the source code of WorldTravel, it was discovered that there was a set of input values which TS1 did not exercise whereas TS2 did. This was due to the degree of detail of the functional specification (which didn’t produce that specific test in TS1), along with the higher variability in the values offered by TS2. This occurs because of the way the test suites are built, specifically based on the lack of an information in the functional specifications, and additional runs of each test suite would produce exactly the same coverage. However, the difference is very minor, resulting only in a single probe exercised by TS2 and not by TS1.
Security Testing Security aspects are highly critical in designing and developing web services. It is possible to distinguish at least two kinds of strategies for addressing protective measures of the communica-
tion among web services: security at the transport level and security at the message level. Enforcing the security at the transport level means that the authenticity, integrity, and confidentiality of the message (e.g., the SOAP message) are completely delegated to the lower-level protocols that transport the message itself (e.g., HTTP + TLS/SSL) from the sender to the receiver. Such protocols use public key techniques to authenticate both the end points and agree to a symmetric key, which is then used to encrypt packets over the (transport) connection. Since SOAP messages may carry vital business information, their integrity and confidentiality need to be preserved, and exchanging SOAP messages in a meaningful and secured manner remains a challenging part of system integration. Unfortunately, those messages are prone to attacks based on an on-the-fly modification of SOAP messages (XML rewriting attacks or XML injection) that can lead to several consequences such as unauthorized access, disclosure of information, or identity theft. Message-level security within SOAP and web services is addressed in many standards such as WS-Security (OASIS Standard Specification Web Services Security, 2006), which provides mechanisms to ensure end-to-end security and allows to protect some sensitive parts of a SOAP message by means of
Figure 3. Coverage measures for TS2
391
Approaches to Functional, Structural and Security SOA Testing
XML Encryption (Imamura, Dillaway, & Simon, 2002) and XML Signature (Bartel, Boyer, Fox, LaMacchia, & Simon, 2008). The activity of fault detection is an important aspect of (web) service security. Indeed, most breaches are caused when a system component is used in an unexpected manner. Improperly tested code, executed in a way that the developer did not intend, is often the primary culprit for security vulnerability. Robustness and other related attributes of web services can be assessed through the testing phase and are designed first off by analyzing WSDL document to know what faults could affect the robustness quality attribute of web services, and secondly by using the fault-based testing techniques to detect such faults (Hanna, & Munro, 2008). Focusing in particular on testing aspects, the different strategies and approaches that have been developed over the years can be divided into passive or active mechanisms. Passive mechanisms consist of observing and analyzing messages that the component under test exchanges with its environment (Benharref, Dssouli, Serhani, & Glitho, 2009). In this approach, which has been specially used for fault management in networks (Lee, Netravali, Sabnani, Sugla, & John, 1997) the observer can be either on-line or off-line: an on-line observer collects and checks the exchanged input/output in real time, while an off-line observer uses the log files generated by the component itself. Recently, passive testing has been proposed as a good approach for checking whether a system respects its security policy, as in (Mallouli, Bessayah, Cavalli, & Benameur, 2008; Bhargavan, Fournet, & Gordon, 2008). In this case, formal languages have been used in order to give a verdict about the system conformity with its security requirements. Active testing is based on the generation and the application of specific test cases in order to detect faults. All the techniques have the purpose of providing evidence in security aspects, i.e.,
392
that an application faces its requirements in the presence of hostile and malicious inputs. Like functional testing, security testing relies on what is assumed to be a correct behaviour of the system, and on non-functional requirements. However, the complexity of (web) security testing is bigger than functional testing, and the variety of different aspects that should be taken into consideration during a testing phase implies the use of a variety of techniques and tools. An important role in software security is played by negative testing (Lyndsay, 2003), i.e., test executions attempting to show that the application does something that it is not supposed to do (Nyman, 2008). Negative tests can discover significant failures, produce strategic information about the model adopted for test case derivation, and provide overall confidence in the quality and security level of the system. Other common adopted methodologies and techniques include (Wong, & Grzelak, 2006): • • •
fuzz testing; injection; policy-based testing.
We will briefly describe validation based on them in the following subsections.
Fuzz Testing The word fuzzing is conventionally used to refer to a black-box software testing method for identifying vulnerabilities in data handling. It involves generating semivalid data and submitting them in defined input fields or parameters (files, network protocols, API calls, and other targets) in an attempt to break the program and find bugs. Usually the term “fuzz” means feeding a set of inputs taken from random data to a program, and then systematically identifying the failures that arise (Sutton, Greene, & Amini, 2007). Semivalid data are correct enough to keep parsers from immediately dismissing them, but
Approaches to Functional, Structural and Security SOA Testing
still invalid enough to cause problems. Fuzzing is useful in identifying the presence of common vulnerabilities in data handling, and the results of fuzzing can be exploited by an attacker to crash or hijack the program using a vulnerable data field. Yet, fuzzing covers a significant portion of negative test cases without forcing the tester to deal with each specific test case for a given boundary condition. Sutton et al. (2007) present a survey of fuzzing techniques and tools. In particular fuzzing approaches can be divided into: data generation; environment variable and argument fuzzing; web application and server fuzzing; file format fuzzing (Kim, Choi, Lee, & Lee, 2008); network protocol fuzzing, web browser fuzzing; in-memory fuzzing. In the specific area of web service security, fuzzing inputs can often be generated by programmatically analyzing the WSDL or sample SOAP requests and making modifications to the structure and content of valid requests. File fuzzing strategy is also used for detecting XML data vulnerability. In particular, two main testing methods, generation and mutation, are adopted (Oehlert, 2005). XML-based fuzzing techniques using the first approach parse the XML schema and the XML documents to generate the semivalid data. The second way to get malformed data is to start with a known set of good data and mutate it in specific places. The authors of (Choi, Kim, & Lee, 2007), propose a methodology and an automated tool performing efficient fuzz testing for text files, such as XML or HTML files, by considering types of values in tags. This tool named TAFT (Tag-Aware text file Fuzz testing Tool) generates text files, extracts tags and analyzes them, makes semivalid data from random data returned by different functions according to types of values and inserts them into values of tags, then automatically executes a target software system using fault-inserted files and observes fault states.
Injection In general, web services can interact with a variety of systems and for this reason they must be resistant to injection attacks when external systems are accessed or invoked. The most prevalent injection vulnerabilities include SQL injection, command injection, LDAP injection, XPath injection, and code injection. There are undoubtedly many other forms of injection and the tester should be aware of these for testing different subsystems. Most injection vulnerabilities can be easily and quickly identified by the process of fuzzing due to the presence of meta-characters in various language syntaxes. Recently, SQL Injection Attack has become a major threat to web applications. SQL injection occurs when a database is queried with an SQL statement which contains some userinfluenced inputs that are outside the intended parameters range. If this occurs, an attacker may be able to gain control of the database and execute malicious SQL or scripts. In the following, we will review some solutions to mitigate the risk posed by SQL injection vulnerabilities. The authors of (Huang, Yu, Hang, Tsai, Lee, & Kuo, 2004) secure potential vulnerabilities by combining static analysis with runtime monitoring. Their solution, WebSSARI, statically analyzes source code, finds potential vulnerabilities, and inserts runtime guards into the source code. The authors of (Fu, Lu, Peltsverger, Chen, Qian, & Tao, 2007) present a static analysis framework (called SAFELI) for discovering SQL injection vulnerabilities at compile time. SAFELI analyzes bytecode and relies on string analysis. It employs a new string analysis technique able to handle hybrid constraints that involve boolean, integer, and string variables. Most popular string operations can be handled. The authors of (Halfond, & Orso 2005) secure vulnerable SQL statements by combining static analysis with statement generation and runtime monitoring. Their solution, AMNESIA, uses a model-based approach to detect illegal queries before they are
393
Approaches to Functional, Structural and Security SOA Testing
executed on the database. It analyzes a vulnerable SQL statement, generates a generalized statement structure model for the statement, and allows or denies each statement based on a runtime check against the statically-built model.
Policy-Based Testing An important aspect in the security of modern information management systems is the control of accesses. Data and resources must be protected against unauthorized, malicious or improper usage or modification. For this purpose, several standards have been introduced that guarantee authentication and authorization. Among them, the most popular are Security Assertion Markup Language (SAML) (OASIS Assertions and Protocols for the OASIS Security Assertion Markup Language, 2005) and eXtensible Access Control Markup Language (XACML) (OASIS eXtensible Access Control Markup Language, 2005). Access control mechanisms verify which users or processes have access to which resources in a system. To facilitate managing and maintaining access control, access control policies are increasingly written in specification languages such as XACML (OASIS eXtensible Access Control Markup Language, 2005), a platform-independent XML-based policy specification language, or RBAC (Role-Based Access Control) (Ferraiolo, Sandhu, Gavrila, Kuhn, & Chandramouli, 2001). Whenever a user requests access to a resource, that request is passed to a software component called Policy Decision Point (PDP). A PDP evaluates the request against the specified access control policies and permits or denies the request accordingly. Policy-based testing is the testing process to ensure the correctness of policy specifications and implementations. By observing the execution of a policy implementation with a test input (i.e., access request), the testers may identify faults in policy specifications or implementations, and validate whether the corresponding output (i.e., access decision) is as intended. Although
394
policy testing mechanisms vary because there is no single standard way to specify or implement access control policies, in general, the main goals to conduct policy testing are to ensure the correctness of the policy specifications, and the conformance between the policy specifications and implementations. The authors of (Hu, Martin, Hwang,, & Xie, 2007) classify recent approaches on XACML policy specification testing in the following main categories: •
•
Fault Models and Mutation Testing: there exist various basic fault models for different types of policies. Martin and Xie (Martin, & Xie, 2007a) propose a fault model to describe simple faults in XACML policies. They categorize faults broadly as syntactic and semantic faults. Syntactic faults are the result of simple typos. For example, in XACML, an XML schema definition (XSD) can be used to check for obvious syntactic flaws. Semantic faults are involved with the logical constructs of the policy language. Based on the fault model, mutation operators are described in (Martin, & Xie, 2007a) to emulate syntactic and semantic faults in policies. The authors of (Le Traon, Mouelhi, & Baudry, 2007) design mutation operators on a given Organization Based Access Control (OrBAC) model. They consider similar semantic and syntactic mutation operators as the preceding ones in addition to other mutation operators including rule deletion, rule addition, and role change based on a role hierarchy. Testing criteria: testing criteria are used to determine whether sufficient testing has been conducted and it can be stopped, and measure the degree of adequacy or sufficiency of a test suite. Among testing criteria for policy testing, there are structural coverage criteria and fault coverage criteria
Approaches to Functional, Structural and Security SOA Testing
•
(Martin, Xie, & Yu, 2006). The former are defined based on observing whether each individual policy element has been evaluated when a test suite (set of requests) is evaluated by a PDP. The latter are defined based on observing whether each (seeded) potential fault is detected by the test suite. Test generation: to test access control policies, policy testers can manually generate test inputs (i.e., requests) to achieve high structural policy coverage and fault coverage (i.e., fault-detection capability). To reduce manual effort, automated test generation approaches can be used. These approaches for test case generation starting by XACML policies deal with random heuristics and test suite reduction (Martin, Xie, & Yu, 2006), policy values based approaches (Martin, & Xie, 2006) and change impact analysis (Martin, & Xie, 2007b). In particular, the Targen tool proposed in (Martin, & Xie, 2006) derives the set of requests satisfying all the possible combinations of truth values of the attribute idvalue pairs found in specific sections of a XACML policy. The Cirg tool proposed in (Martin, & Xie, 2007b) is able to exploit change impact analysis for test cases generation starting from policies specification. In particular, it integrates the Margrave tool (Fisler, Krishnamurthi, Meyerovich, & Tschantz, 2005) which performs change impact analysis so as to reach high policy structural coverage. Test generation for model-based policies derives abstract test cases directly from models such as FSMs that specify model-based policies. The authors of (Le Traon, Mouelhi, & Baudry, 2007) propose test generation techniques to cover rules specified in policies based on the OrBAC (Kalam, Baida, Balbiani, Benferhat, Cuppens, Deswarte, Miège, Saurel, & Trouessin, 2003) model. The same authors in (Mouelhi, Le Traon, &
Baudry, 2009) present a new automated approach for selecting and adapting existing functional test cases for security policy testing. The method includes a threestep technique based on mutation applied to security policies and Aspect-Oriented Programming (AOP) for transforming automatically functional test cases into security policy test cases. All the existing solutions for XACML test case derivation are based on the policy specification. Policies, or more general meta-models of security policies, provide a model that specifies the access scheme for the various actors accessing the resources of the system. This has proven effective; however for the purpose of test case generation a second important model composes the access control mechanism: the standard format representing all the possible compliant requests. An approach fully exploiting the potential of this second model has been implemented in the X-CREATE framework (Bertolino, Lonetti, & Marchetti, 2010). Differently from existing solutions that are based only on the policy specification, this framework exploits the special characteristics of XACML access control systems of having a unique and once-forall specified structure of the input requests: the XACML Context Schema, which establishes the rules to which access requests should conform. This schema is used in the X-CREATE framework for deriving a universally valid conforming test suite that is then customizable to any specific policy. Finally, we also address some works dealing with policy implementation testing. In particular, the authors of (Li, Hwang, & Xie, 2008) propose an approach to detect defects in XACML implementations by observing the behaviours of different XACML implementations (policy evaluation engine or PDP) for the same test inputs, whereas
395
Approaches to Functional, Structural and Security SOA Testing
(Liu, Chen, Hwang, & Xie, 2008) concerns the performance of XACML request processing and presents a scheme for efficient XACML policy implementation.
CONCLUSION AND FUTURE RESEARCH ISSUES In this chapter we overviewed the recent approaches for functional and structural testing of web services and web service compositions. We provided also an example of a combined testing strategy. In particular, we have integrated within an automated environment the execution of a test plan based on a functional strategy with the evaluation of its coverage measure. To the best of our knowledge, it is the first attempt of combining functional and structural strategies in a SOA environment. The results obtained showed that it is possible to achieve a coverage measure even starting from a specification-based test plan. As expected, high coverage can be achieved both by generating the test suite by hand and automatically, but the latter requires less time and effort and is more exhaustive. This first attempt paves the way to a new field of research about the integration of different testing strategies. Finally, we also surveyed approaches to security-based validation. Due to the pervasiveness of web services in business and social life, they should guarantee to preserve privacy and confidentiality of carried-out information. Thus in the world of Web 2.0, where the distribution of on-line services and applications has reached a high granularity, functional, structural approaches and security testing constitute a third complementary and essential facet of SOA validation. In particular, since most of services involve money transactions, relevant personal information and other critical contexts, they need to be tested thoroughly, and possibly fast for quick deployment.
396
For this an essential research issue for future research is the creation of an automated testing process, both on the side of test suite generation and steps that the tester must perform to execute the tests. This is especially true in service compositions where, being a high number of applications involved, a single point of failure could make the whole system collapse. Considering specifically the functional approaches, most of them are still strongly constrained by the need of a formal specification model to be provided along with the service. In many circumstances, companies do not provide such detail and testers are often required to derive it from their own experience. In this situation, looser approaches based on interfaces (e.g., WSDL) are more suited, yet less effective. A possible field for future research could be the development of strategies located halfway between the two extremes. This way, the specification-based part of the approach would drive test case generation, whereas the interface-based part would drive test case execution. In particular, a special case of the combined vision could be to use the access control policies as a refinement of the functional specification. This would produce test cases that, before been executed, must be compliant with the specified access policy constraints. The approach presented in this chapter of combining different methodologies is also an example of a direction where research should progress in the future. Combined methodologies not only help speed up the testing process by allowing simultaneous analysis from multiple perspectives, but can also bring an extra added value by taking the best of the individual approaches. In this chapter we presented a combination of functional and structural testing, but clearly it is not the only possibility. We plan to investigate different combination strategies, in particular considering the benefits of combining security testing with the other ones.
Approaches to Functional, Structural and Security SOA Testing
ACKNOWLEDGMENT This chapter provides an overview of work that has been partially supported by the European Project FP7 IP 216287: TAS3 and by the Italian MIUR PRIN 2007 Project D-ASAP.
REFERENCES Altova. XML Spy. Retrieved from http://www. altova.com/products/xmlspy/xml_editor.html Bai, X., Dong, W., Tsai, W. T., & Chen, Y. (2005). WSDL-Based Automatic Test Case Generation for Web Services Testing, Proceding of the IEEE International Workshop on Service-Oriented System Engineering (SOSE), Beijing, 207-212. Bai, X., Lee, S., Tsai, W. T., & Chen, Y. (2008) Ontology-based test modeling and partition testing of web services. Proceedings of the 2008 IEEE International Conference on Web Services. IEEE Computer Society. September, Beijing, China, 465–472. Bartel, M., Boyer, J., Fox, B., LaMacchia, B., & Simon, E. (2008). W3C Recommendation XML Signature Syntax and Processing. Retrieved from http://www.w3.org/TR/xmldsig-core/. Bartolini, C., Bertolino, A., Elbaum, S. C., & Marchetti, E. (2009). Whitening SOA testing. Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering. Amsterdam, The Netherlands, August 24-28, 161-170. Bartolini, C., Bertolino, A., Marchetti, E., & Parissis, I. (2008) Data Flow-based Validation of Web Services Compositions: Perspectives and Examples, in V, R. de Lemos, F. Di Giandomenico, H. Muccini, C. Gacek, M. Vieira (Eds.). Architecting Dependable Systems, Springer 2008. LNCS5135, 298-325.
Bartolini, C., Bertolino, A., Marchetti, E., & Polini, A. (2009) WS-TAXI: a WSDL-based testing tool for Web Services. Proceeding of 2nd International Conference on Software Testing, Verification, and Validation ICST 2009. Denver, Colorado USA, 2009, April 1- 4. Benharref, A., Dssouli, R., Serhani, M., & Glitho, R. (2009). Efficient traces’collection mechanisms for passive testing of Web Services. Information and Software Technology, 51(23), 362–374. doi:10.1016/j.infsof.2008.04.007 Bertolino, A., Frantzen, L., Polini, A., & Tretmans, J. (2006). Audition of Web Services for Testing Conformance to Open Specified Protocols. In R. Reussner and J. Stafford and C. Szyperski (Ed). Architecting Systems with Trustworthy Components. (pp. 1-25). LNCS 3938. Springer-Verlag. Bertolino, A., Gao, J., Marchetti, E., & Polini, A. (2007). Automatic Test Data Generation for XML Schema-based Partition Testing. Proceedings of the Second international Workshop on Automation of Software Test International Conference on Software Engineering. IEEE Computer Society. May 20 - 26, 2007. Washington. Bertolino, A., Lonetti, F., & Marchetti, E. (2010). Systematic XACML request generation for testing purposes. Accepted for publication to the 36th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). 1-3 September 2010. Lille, France. Bertolino, A., & Polini, A. (2005). The audition framework for testing web services interoperability. 31st EUROMICRO International Conference on Software Engineering and Advanced Applications, 134-142. Bhargavan, K., Fournet, C., & Gordon, A. (2008). Verifying policy-based web services security In Proc. of the 11th ACM conference on Computer and communications security, 268-277.
397
Approaches to Functional, Structural and Security SOA Testing
Canfora, G., & Di Penta, M. (2009). Serviceoriented architectures testing: A survey. Software Engineering Journal, 78–105. doi:10.1007/9783-540-95888-8_4
Foster, H., Uchitel, S., Magee, J., & Kremer, J. (2003). Model-based verification of web service compositions. In Proceedings of ASE (pp. 152–163). IEEE Computer Society.
Cao, H., Ying, S., & Du, D. (2006). Towards Model-based Verification of BPEL with Model Checking. In Proceeding of Sixth International Conference on Computer and Information Technology (CIT 2006). 20-22 September 2006, Seoul, Korea, 190-194.
Frantzen, L., & Tretmans, J. (2006). Towards Model-Based Testing of Web Services. International Workshop on Web Services - Modeling and Testing (WS-MaTe2006. Palermo. Italy. June 9th, 67-82. Fu, X., Bultan, T., & Su, J. (2004). Analysis of Interacting BPEL Web Services. Proceeding of International Conference on World Wide Web. May, New York, USA, 17 – 22.
Cao, T. D., Felix, P., & Castane, R. (2010). WSOTF: An Automatic Testing Tool for Web Services Composition. Proceeding of the Fifth International Conference on Internet and eb Applications and Services, 7-12. Cao, T. D., Felix, P., Castane, R., & Berrada, I. (2010). Online Testing Framework for Web Services, Proceeding of the Third International Conference on Software Testing, Verification and Validation, 363-372. Cavalli, A., Montes De Oca, E., Mallouli, W., & Lallali, M. (2008). Two Complementary Tools for the Formal Testing of Distributed Systems with Time Constraints. Proceeding of 12th IEEE International Symposium on Distributed Simulation and Real TimeApplications, Canada, Oct 27 – 29.
Fu, X., Lu, X., Peltsverger, B., Chen, S., Qian, K., & Tao, L. (2007). A static analysis framework for detecting SQL injection vulnerabilities. In Proc. of COMPSAC, 87-96. García-Fanjul, J. Tuya, J., & de la Riva, C. (2006). Generating Test Cases Specifications for BPEL Compositions of Web Services Using SPIN. Proceeding of International Workshop on Web Services Modeling and Testing (WS-MaTe2006). Halfond, W., & Orso, A. (2005). AMNESIA: analysis and monitoring for neutralizing sql injection Attacks. In Proc. of 20th IEEE/ACM International Conference on Automated Software Engineering, 174-183.
Choi, Y., Kim, H., & Lee, D. (2007). Tag-aware text file fuzz testing for security of a software System. In Proc. of the International Conference on Convergence Information Technology, 2254-259.
Hanna, S., & Munro, M. (2008). Fault-Based Web Services Testing. In Proc. of Fifth International Conference on Information Technology: New Generations 471-476.
Eviware (n.d.) SoapUI. Retrieved from soapUI. http://www.soapui.org/.
Harel, D., & Thiagarajan, P. (2004). Message sequence charts. UML for Real, Springer, 77-105.
Ferraiolo, D., Sandhu, R., Gavrila, S., Kuhn, D., & Chandramouli, R. (2001). Proposed NIST standard for rolebased access control. ACM Transactions on Information and System Security, 4(3), 224–274. doi:10.1145/501978.501980
Heckel, R., & Mariani, L. (2005, April 2 - 10). Automatic conformance testing of web services. Fundamental Approaches to Software Engineering. LNCS, 3442, 34–48.
Fisler, K., Krishnamurthi, S., Meyerovich, L. A., & Tschantz, M. C. (2005) Verification and change-impact analysis of access-control policies. In Proc. of ICSE 2005, 196-205. 398
Hu, V. C., Martin, E., Hwang, J., & Xie, T. (2007). Conformance checking of access control policies specified in XACML. In Proc. of the 31st Annual International Conference on Computer Software and Applications, 275-280.
Approaches to Functional, Structural and Security SOA Testing
Huang, Y., Yu, F., Hang, C., Tsai, C., Lee, D., & Kuo, S. (2004). Securing web application code by static analysis and runtime protection. In Proc. of 13th International Conference on World Wide Web, 40-52.
Li, Z. J., Zhu, J., Zhang, L. J., & Mitsumori, N. (2009).Towards a Practical and Effective Method for Web Services Test Case Generation. Proceeding of ICSE Workshop on Automation of Software Test AST. May, 106-114.
Imamura, T., Dillaway, B., & Simon, E. (2002). W3C Recommendation XML Encryption Syntax and Processing. Retrieved from http://www. w3.org/TR/xmlenc-core/.
Liu, A. X., Chen, F., Hwang, J., & Xie, T. (2008). Xengine: A fast and scalable XACML policy evaluation engine. In Proc. of International Conference on Measurement and Modeling of Computer Systems, 265-276.
Jiang, Y., Hou, S. S., Shan, J. H., Zhang, L., & Xie, B. (Eds.). (2005). Contract-Based Mutation for Testing Components. IEEE International Conference on Software Maintenance. Kalam, A. A. E. I., Baida, R., Balbiani, P., Benferhat, S., Cuppens, F., Deswarte, Y., & Miège, A. Saurel, C., & Trouessin, G. (2003). Organization based access control. In Proc. of the 4th IEEE International Workshop on Policies for Distributed Systems and Networks, 120-131. Kim, H., Choi, Y., Lee, D., & Lee, D. (2008). Practical Security Testing using File Fuzzing. In Proc. of 10th International Conference on Advanced Communication Technology, 2, 1304-1307. Lallali, M., Zaidi, F., & Cavalli, A. (2008). Transforming BPEL into Intermediate Format Language for Web Services Composition Testing. Proceeding of 4th IEEE International Conference on Next Generation Web Services Practices, October. Le Traon, Y., Mouelhi, T., & Baudry, B. (2007). Testing security policies: Going beyond functional testing. In Proc. of the 18th IEEE International Symposium on Software Reliability, 93-102. Lee, D., Netravali, A., Sabnani, K., Sugla, B., & John, A. (1997). Passive testing and applications to network management. In Proc. of. International Conference on Network Protocols, 113-122. Li, N., Hwang, J., & Xie, T. (2008). Multipleimplementation testing for XACML implementations. In Proc. of Workshop on Testing, Analysis and Verification of Web Software, 27-33.
Lyndsay, J. (2003). A positive view of negative testing. Retrieved from http://www.workroomproductions.com/papers.html Ma, C., Du, C., Zhang, T., Hu, F., & Cai, X. (2008). WSDL-based automated test data generation for web service. 2008 International Conference on Computer Science and Software Engineering. Wuhan, China, 731-737. Mallouli, W., Bessayah, F., Cavalli, A., & Benameur, A. (2008). Security Rules Specification and Analysis Based on Passive Testing. In Proc. of IEEE Global Telecommunications Conference, 1-6. Marconi, A., Pistore, M., & Traverso, P. (2006). Specifying data-flow requirements for the automated composition of web services. Proceeding of the Fourth IEEE International Conferfeence on Software Engineering and Formal Methods (SEFM 2006), 11-15 September 2006, Pune, India, 147-156. Martin, E., & Xie, T. (2006). Automated test generation for access control policies. In Supplemental Proc. of ISSRE Martin, E., & Xie, T. (2007a). A fault model and mutation testing of access control policies. In Proc. of the 16th International Conference on World Wide Web, 667-676. Martin, E., & Xie, T. (2007b). Automated test generation for access control policies via changeimpact analysis. In Proc. of SESS 2007, 5-11.
399
Approaches to Functional, Structural and Security SOA Testing
Martin, E., Xie, T., & Yu, T. (2006). Defining and measuring policy coverage in testing access control policies. In Proc. of ICICS, 139-158. Mouelhi, T., Le Traon, Y., & Baudry, B. (2009). Transforming and selecting functional test cases for security policy testing. In Proc. of International Conference on Software Testing Verification and Validation, 171-180. Narayanan, S., & McIlraith, S. (2003). Analysis and Simulation of Web Services. Journal of Computer Networks, 42(5), 675–693. doi:10.1016/ S1389-1286(03)00228-7 Nyman, J. (2008). Positive and Negative Testing GlobalTester, TechQA, 15(5). Retrieved from http://www.sqatester.com/methodology/PositiveandNegativeTesting.htm OASIS Assertions and Protocols for the OASIS Security Assertion Markup Language (SAML) V2.0. (2005) Retrieved from http://docs.oasisopen.org/security/saml/v2.0/ OASIS eXtensible Access Control Markup Language (XACML) version 2.0. (2005).Retrieved from http://docs.oasis-open.org/xacml/2.0/access_control-xacml-2.0-core-spec-os.pdf OASIS Standard Specification Web Services Security. (2006). SOAP Message Security 1.1, (WS-Security 2004), Retrieved from http://www. oasis-open.org/committees/download.php/16790/ wss-v1.1-spec-os-SOAPMessageSecurity.pdf OASIS WSBPEL. (2007) Web Services Business Process Execution Language Version 2.0. Retrieved from http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.pdf. Oehlert, P. (2005). Violating assumptions with fuzzing. IEEE Security and Privacy, 3(2), 58–62. doi:10.1109/MSP.2005.55
400
Offutt, J., & Xu, W. (2004). Generating test cases for web services using data perturbation. SIGSOFT Softw. Eng. Notes, 29(5), 1–10. doi:10.1145/1022494.1022529 Parasoft (n.d.). Retrieved from http://www. parasoft.com/jsp/home.jsp. Pistore, M., Marconi, A., Bertoli, P., & Traverso, P. (2005). Automated Composition of Web Services by Planning at the Knowledge Level. Proceeding of Intenational Joint Conference on Artificial Intelligence (IJCAI). Pp. 1252-1259. PushToTest. (n.d.). TestMaker. Retrieved from http://www.pushtotest.com/. Sneed, H. M., & Huang, S. (2006). WSDLTest-a tool for testing web services. Proceeding of the IEEE Int. Symposium on Web Site Evolution. Philadelphia, PA, Usa IEEE Computer Society, pp. 12-21. Sutton, M., Greene, A., & Amini, P. (2007). Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley. Tsai, W. T., Bai, X., Paul, R., Feng, K., & Yu, L. (2002). Scenario-Based Modeling and Its Applications. IEEE WORDS. Tsai, W. T., Paul, R., Song, W., & Cao, Z. (2002). Coyote: an XML-based Framework for Web Services Testing. 7th IEEE International Symp. High Assurance Systems Eng. (HASE 2002). Wang, Y., Bai, X., Li, J., & Huang, R. (2007). Ontology-based test case generation for testing web services. Proceedings of the Eighth International Symposium on Autonomous Decentralized Systems. Sedona, AZ, USA, Mar. 2007, IEEE Computer Society. 43–50. Weiss, M., Esfandiari, B., & Luo, Y. (2007). Towards a classification of web service feature interactions. Computer Networks, 51(2), 359–381. doi:10.1016/j.comnet.2006.08.003
Approaches to Functional, Structural and Security SOA Testing
Wong, C., & Grzelak, D. (2006). A Web Services Security Testing Framework. SIFT: Information Security Services. Retrieved from http://www.sift. com.au/assets/downloads/SIFT-Web-ServicesSecurity-Testing-Framework-v1-00.pdf. WorldTravel. (2008). Retrieved from http://www. cc.gatech.edu/systems/projects/WorldTravel WSDL. (2007).Web Services Description Language (WSDL) Version 2.0. Retrieved from http:// www.w3.org/TR/wsdl20/. Yan, J., Li, Z., Yuan, Y., Sun, W., & Zhang, J. (2006). BPEL4WS Unit Testing: Test Case Generation Using a Concurrent Path Analysis Approach. Ptoceeding of 17th International Symposium on Software Reliability Engineering (ISSRE 2006), 7-10 November, Raleigh, North Carolina, USA. 75-84. Yang, Y., Tan, Q. P., Yong, X., Liu, F., & Yu, J. (2006). Transform BPEL Workflow into Hierarchical CP-Nets to Make Tool Support for Verification. Proceeding of Frontiers of WWW Research and Development - APWeb 2006, 8th Asia-Pacific Web Conference. (pp. 275-284). Harbin, China, January 16-18, 2006. Yuan, Y., Li, Z., & Sun. (2006). A Graph-Search Based Approach to BPEL4WS Test Generation. In Proceedings of the International Conference on Software Engineering Advances (ICSEA 2006), October 28 – November 2, Papeete, Tahiti, French Polynesia.
Zheng, Y., Zhou, J., & Krause, P. (2007). An Automatic Test Case Generation Framework for Web Services”, Journal of Software. Vol 2, No. 3, September
KEY TERMS AND DEFINITIONS SOA Testing: The set of testing approaches and methodologies focused on the verification and validation of SOA specific aspects. Structural Testing: Also called white-box testing, requires complete access to the object’s structure and internal data, which means the visibility of the source code. Functional Testing: Also called black-box testing, relies on the input/output behaviour of the system. Security Testing: Verification of securityrelated aspects of the application.
ENDNOTE 1
The modified version of WorldTravel can be requested to the authors with an email to [email protected].
401
402
Chapter 18
Detecting Vulnerabilities in Web Services:
Can Developers Rely on Existing Tools? Nuno Antunes University of Coimbra, Portugal Marco Vieira University of Coimbra, Portugal
ABSTRACT Although web services are becoming business-critical components, they are often deployed with software bugs that can be maliciously exploited. Numerous developers are not specialized on security and the common time-to-market constraints limit an in-depth testing for vulnerabilities. In this context, vulnerability detection tools have a very important role helping the developers to produce less vulnerable code. However, developers usually select a tool to use and rely on its results without knowing its real effectiveness. This chapter presents two case studies on the effectiveness of several well-known vulnerability detection tools and discusses their strengths and limitations. Based on lessons learned, the chapter also proposes a benchmarking technique that can be used to select the tool that best fits a specific scenario. The main goal is to provide web service developers with information on how much they can rely on widely used vulnerability detection tools and on how to select the most adequate tool.
INTRODUCTION Ranging from on-line stores to large corporations, web services are increasingly becoming a strategic vehicle for data exchange and content distribution (Chappell & Jewell, 2002). As addressed in
Chapter 15, web services are so widely exposed that any existing security vulnerability will most probably be uncovered and exploited by hackers. Moreover, hackers are moving their focus to applications’ code, often improperly implemented, searching for vulnerabilities by exploring appli-
DOI: 10.4018/978-1-60960-794-4.ch018
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Detecting Vulnerabilities in Web Services
cations’ inputs with specially tampered values. These values can take advantage of existing vulnerabilities representing considerable danger to the application’s owner, for instance, by giving to an attacker access to read, modify or destroy reserved resources. To prevent vulnerabilities developers must apply best coding practices, perform security reviews, execute penetration testing, use code vulnerability detectors, etc. Still, many times developers focus on the implementation of functionalities and on satisfying the costumer’s requirements and disregard security aspects. Additionally, most developers are not security specialists and the common time-to-market constraints limit an in-depth search for vulnerabilities. In this context, vulnerability detection tools have a very important role helping the developers to produce less vulnerable code. Although there are several techniques for vulnerability detection in web applications (see chapters 7 and 15), in practice, there are two main approaches to test web services for vulnerabilities (Stuttard & Pinto, 2007): •
•
Static code analysis: “white-box” approach that consists of the analysis of the web application source code. This can be done manually or by using automatic tools. The problem is that exhaustive source code analysis may be difficult and may not find all security flaws due to the complexity of the code. Penetration testing: “black-box” approach that consists of the analyses of the web application execution in search for vulnerabilities. In this approach, the scanner (either a human or a software tool) does not know the internals of the web application and it uses fuzzing techniques over the web HTTP requests.
Due to time constraints or resource limitations, developers frequently have to select a specific tool
from the large set of tools available and strongly rely on that tool to detect potential security problems in the code being developed. The problem is that this is usually done without really knowing how good each tool is. This way, developers urge the definition of a practical approach that helps them comparing alternative tools concerning their ability to detect vulnerabilities. This chapter introduces some penetration testing and static code analysis tools that are widely used to detect vulnerabilities in web services. The strengths and limitations of both techniques are discussed in the context of the results of two case studies, conducted to understand the effectiveness of the existing tools in real scenarios. Based on the lessons learned we then present a benchmarking technique that can be applied to select the best tools for specific development settings (i.e., according to the objectives of the developers). The main goal is to provide web service developers with information on how much they can rely on the existing vulnerability detection tools and on how to select and use those tools to obtain the most benefit possible.
VULNERABILITY DETECTION TECHNIQUES AND TOOLS Penetration testing and static code analysis are the two techniques most used by web service developers to detect security vulnerabilities in their code (Stuttard & Pinto, 2007). Penetration testing consists of stressing the application from the point of view of an attacker (“black-box” approach) using specific malicious inputs. On the other hand, static code analysis is a “white-box” approach that consists of analyzing the source code of the application (without executing it) looking for potential vulnerabilities (among other types of software defects). Both penetration testing and static code analysis can be performed manually or automatically. However, automated tools are the
403
Detecting Vulnerabilities in Web Services
typical choice as, comparing to manual tests and inspection, execution time and cost are quite lower. Penetration testing tools are also referred as web vulnerability scanners. Web vulnerability scanners provide an automatic way for searching for vulnerabilities, avoiding the repetitive and tedious task of doing hundreds or even thousands of tests by hand for each vulnerability type. Most of these scanners are commercial tools, but there are also some free application scanners (often with limited use, since they lack most of the functionalities of their commercial counterparts). Two very popular free security scanners that support web services testing are Foundstone WSDigger (Foundstone, Inc., 2005) and WSFuzzer (OWASP Foundation, 2008). WSDigger is a free open source tool developed by Foundstone that executes automated penetration testing in web services. This tool contains sample attack plugins for SQL Injection, cross-site scripting (XSS), and XPath Injection. WSFuzzer is a free open source program that mainly targets HTTP based SOAP services. This tool was created based on real-world manual SOAP penetration testing work, automating it. The main problem of both tools is that, in fact, they do not detect vulnerabilities: they attack the web service and log the responses leaving to the user the task of examining those logs and identify the vulnerabilities. This requires the user to be an “expert” in security and to spend a huge amount of time to examine all the results. As for commercial scanners, three brands lead the market: •
404
HP WebInspect: this tool “performs web application security testing and assessment for today’s complex web applications, built on emerging Web 2.0 technologies” (HP, 2008). This tool includes pioneering assessment technology, including simultaneous crawl and audit (SCA) and concurrent application scanning. It is a broad application that can be applied for penetration testing in web-based applications.
•
•
IBM Rational AppScan: this “is a leading suite of automated Web application security and compliance assessment tools that scan for common application vulnerabilities” (IBM, 2008). This tool is suitable for users ranging from non-security experts to advanced users that can develop extensions for customized scanning environments. IBM Rational AppScan can be used for penetration testing in web applications, including web services. Acunetix Web Vulnerability Scanner: this “is an automated web application security testing tool that audits a web applications by checking for exploitable hacking vulnerabilities” (Acunetix, 2008). Acunetix WVS can be used to execute penetration testing in web applications or web services and is quite simple to use and configure.
Static code analyzers examine the code without actually executing it (Livshits & Lam, 2005). The analysis performed by existing tools varies depending on their sophistication, ranging from tools that consider only individual statements and declarations to others that consider the complete code. Among other usages (e.g., model checking and data flow analysis), these tools provide an automatic way for highlighting possible coding errors. The following paragraphs briefly introduce some of the most used and well-known static code analyzers, including both commercial and free tools: •
FindBugs: open source tool that “uses static analysis to look for bugs in Java code” (University of Maryland, 2009). Findbugs is composed of various detectors each one specialized in a specific pattern of bugs. The detectors use heuristics to search in the bytecode of Java applications for these patterns and classify them according to categories and priorities.
Detecting Vulnerabilities in Web Services
•
•
•
Yasca (Yet Another Source Code Analyzer): “framework for conducting source code analyses” (Scovetta, 2008) in a wide range of programming languages, including Java. Yasca is a free tool that includes two components: the first is a framework for conducting source code analyses and the second is an implementation of that framework that allows integration with other static code analyzers (e.g., FindBugs, PMD, and Jlint). Fortify 360: suite of tools for vulnerability detection commercialized by Fortify Software (Fortify Software, 2008). The module Fortify Source Code Analyzer performs static code analysis. According to Fortify, it is able to identify the root-cause of the potentially exploitable security vulnerabilities in source code. It supports scanning of a wide variety of programming languages, platforms, and can be used in several integrated development environments. IntelliJ IDEA: commercial tool that provides a powerful IDE for Java development and includes “inspection gadgets” plug-ins with automated code inspection functionalities (JetBrains, 2009). IntelliJ IDEA is able to detect security issues in java source code.
UNDERSTANDING THE EFFECTIVENESS OF VULNERABILITY DETECTION TOOLS In this section we present two case studies that demonstrate the effectiveness of vulnerability detection tools, focusing on the two most used techniques: penetration testing and static code analysis. Both case studies consisted of four steps: 1. Preparation: select the vulnerability detection tools and the web services being tested.
2. Execution: use the tools to scan or analyze the services to identify potential vulnerabilities. 3. Verification: perform manual analysis to confirm the existing vulnerabilities and discard false positives (i.e., vulnerabilities incorrectly reported by the tools). 4. Analysis: analyze the results obtained and systematize the lessons learned. In both studies, vulnerability detection tools are characterized using two key measures of interest: coverage and false positives rate. The first portrays the percentage of existing vulnerabilities that are detected by a given tool, while the second represents the number of reported vulnerabilities that in fact do not exist. In practice, the goal of the two case studies is to answer the question: Can developers rely on existing vulnerability detection tools?
CASE STUDY #1: APPLYING PENETRATION TESTING TO PUBLIC WEB SERVICES The first case study consisted of detecting vulnerabilities in a set of 300 publicly available web services. Obviously, as in this scenario it is not possible to have access to the source code, the experiment was limited to penetration testing. Four commercial and well-known penetration testing tools were used, including two different versions of a specific brand: HP WebInspect, IBM Rational AppScan, and Acunetix Web Vulnerability Scanner (these tools are frequently referred by vendors as vulnerability scanners). For the results presentation we have decided not to mention the brand and the versions of the tools to assure neutrality and because commercial licenses do not allow, in general, the publication of tool evaluation results. This way, the tools are referred in the rest of this section as VS1.1, VS1.2, VS2, and VS3 (without any order in particular). Vulnerability scanners
405
Detecting Vulnerabilities in Web Services
VS1.1 and VS1.2 refer to the two versions of the same product (VS1.2 is the newest version and is built on top of VS1.1). The web services selection was as random as possible. The initial step was to identify a large set of web services. The first source was a web site that lists publicly available web services (http:// xmethods.net/). From this list we have selected a first set of 450 web services. Then we used the Web Services Search Engine (http://webservices. seekda.com/) to discover additional services. seekda is a portal that enables searching for public web services based on the services’ description. To discover services we have used a large set of generic keywords, ranging from popular tags (e.g., business, tourism, commercial, universityf) to queries with countries (e.g., country:AR, country:PT, country:US) and keywords related to company names (e.g., Oracle, Sun, Microsoft, Google, Acunetix). The resulting list included 6180 web services. As it was not possible to test such a large set of services (scanning some the services takes more than 15 minutes, which makes testing 6180 an unpractical quest) we decided to randomly select 300 services from this list (there is no particular reason for having selected 300 services, besides the time we had available for the experiments). However, many of the services selected had to be discarded due to different reasons, namely: invalid/ malformed WSDL (117 services discarded), un-
able to retrieve WSDL (30 services discarded), no methods found (20 services discarded), authentication required (40 services discarded), unhandled exception (19 services discarded), communication errors (50 services discarded), scanning problems (24 services could not be tested by the scanners; scanning terminated without reporting any specific error or warning message), and excessive (more than two hours) testing duration (15 services discarded due to network and application server constraints). The services discarded were replaced by others randomly selected from the initial list. Due to space restrictions, the final set of web services tested is not included in this chapter (it is available at (Antunes, Vieira, & Madeira, 2008a) together with additional details and results).
Overall Results Analysis Table 1 presents the overall results of the study. For each penetration testing tool, the table presents the total number of vulnerabilities and the number of services in which those vulnerabilities were found. As shown, the penetration testing tools identified six different types of vulnerabilities (see (Stuttard & Pinto, 2007) for details on these vulnerabilities). As we can see in Table 1, different penetration testing tools report different types of vulnerabilities. This is a first indicator that tools implement different forms of penetration tests and that the
Table 1. Overall results for penetration testing over public web services Vulnerability Types
VS1.1
VS1.2
VS2
# Vuln.
# WS
# Vuln.
# WS
# Vuln.
SQL Injection
217
38
225
38
XPath Injection
10
1
10
1
VS3 # WS
# Vuln.
# WS
25
5
35
11
0
0
0
0
Code Execution
1
1
1
1
0
0
0
0
Possible Parameter Based Buffer Overflow
0
0
0
0
0
0
4
3
Possible Username or Password Disclosure
0
0
0
0
0
0
47
3
Possible Server Path Disclosure Total
406
0
0
0
0
0
0
17
5
228
40
236
40
25
5
103
22
Detecting Vulnerabilities in Web Services
results from different tools may be difficult to compare. Some additional observations are: •
•
•
•
Tool VS1.1 and VS1.2 (two different versions of the same brand) are the only ones that detected XPath Injection vulnerabilities. An important aspect is that, when compared to SQL Injection, the number of XPath-related vulnerabilities is quite small. In fact, XPath vulnerabilities were detected in a single service, indicating that most web services make use of a database instead of XML documents to store information. Tools VS1.1 and VS1.2 detected a code execution vulnerability. This is a particularly critical vulnerability that allows attackers to execute code on the server. After discovering this vulnerability we performed some manual tests that confirmed the possibility of executing operating system commands (e.g., ‘cat /etc/passwd’, ‘ls -la’) and get the corresponding answer in a readable format. VS3 was the only one identifying vulnerabilities related to buffer overflow, username and password disclosure, and server path disclosure. SQL Injection is the only type of vulnerability that was detected by the four tools used. However, different tools reported different vulnerabilities in different web services. In fact, the number of SQL Injection vulnerabilities reported by VS1.1 and VS1.2 is much higher than the number of vulnerabilities detected by VS2 and VS3.
manually confirm the existence (or not) of each vulnerability detected. Confirming the existence of a vulnerability without having access to the source code is quite difficult. Thus, we defined a set of rules and corresponding checks to classify the vulnerabilities detected by the penetration testing tools in three groups: a) False positives, b) Confirmed vulnerabilities, and c) Doubtful. Detected vulnerabilities were classified as false positives if meeting one of the following cases: •
•
•
Detected vulnerabilities were classified as confirmed vulnerabilities if satisfying one of the following conditions: •
False Positives Analysis The results presented so far do not consider false positives (i.e., situations where tools detected a vulnerability that in the reality does not exist). However, it is well known that false positives are very difficult to avoid. This way, we decided to
For SQL Injection vulnerabilities, if the error/answer obtained is related to an application robustness problem and not to a SQL command (e.g., a NumberFormatException). The error/value in the web service response is not caused by the elements “injected” by the scanner. In other words, the same problem occurs when the service is executed with valid inputs. For path and username/password disclosure, the information returned by the service is equal to the information submitted by the client (e.g., the vulnerability scanner) when invoking the web service. In other words, there is no information disclosure.
•
For SQL Injection vulnerabilities, if it is possible to observe that the SQL command executed was invalidated by the values “injected” by the scanner (or manually). This is possible if the SQL command or part of it is included in the web service response (e.g., stack trace). For SQL Injection vulnerabilities, if the “injected” values lead to exceptions raised by the database server.
407
Detecting Vulnerabilities in Web Services
•
•
•
•
•
If it is possible to access unauthorized services or web pages (e.g., by breaking the authentication process using SQL Injection). For code execution, if it is possible to execute operating system commands or any other type of code (e.g., Java, Perl) For Path disclosure and username/password disclosure, if it is possible to observe the location of folders and files in the server or the username/password being used. For XPath Injection, if the “injected” values lead to exceptions raised by the XPath parser. For Buffer Overflow, if the server does not answer to the request or raises an exception specifically related to buffer overflow.
If none of these rules can be applied then there is no way to confirm whether a vulnerability really exists or not. These cases were classified as doubtful. Figure 1 shows the results for SQL Injection vulnerabilities.
As shown, the number of vulnerabilities that we were not able to confirm (doubtful cases) is low for VS1.1, VS1.2, and VS3 (always less than 15%), but considerably high for VS2 (32%). This means that the false positive results are relatively accurate for the first three tools, but it is an optimistic figure (zero false positives) for scanner VS2. Obviously, we can also read the false positive results shown in Figure 1 as a range, going from an optimistic value (confirmed false positives) to a pessimistic value (confirmed false positives + doubtful cases). The number of (confirmed) false-positives is high for scanners VS1.1 and VS1.2, and is also high for VS3, in relative terms. Scanner VS2 shows zero confirmed false positives, but it detected a fair percentage (8 out of 25) of vulnerabilities that were classified as doubtful, thus a pessimistic interpretation of results is that 8 out of 25 vulnerabilities may be false positives. Obviously, the low number of vulnerabilities detected by VS2 and VS3 (25 and 35 respectively) also limits the absolute number of false positives.
Figure 1. False positives observed for SQL Injection in the public web services
408
Detecting Vulnerabilities in Web Services
Table 2 presents the false positive results for the other vulnerabilities. In this case, we were able to confirm the existence (or inexistence) of all vulnerabilities and no doubts remained. An interesting aspect is that all XPath Injection and Code Execution vulnerabilities were confirmed. On the other hand, all vulnerabilities related to username and password disclosure were in fact false positives (in all cases the username/information information returned is equal to the one sent by the scanner). Due to the large percentage of SQL Injection vulnerabilities observed, we decided to look at them in more detail. Figure 2 presents the SQL Injection vulnerabilities intersections after removing the false positives. The doubtful situations
were in this case considered as existing vulnerabilities (i.e., optimistic assumption from the point of view of the penetration testing tools effectiveness). The areas of the circles are roughly proportional to the number of vulnerabilities detected by the respective tool. The same does not happen with the intersection areas, as it would be impossible to represent it graphically. Results clearly show that, even if we manually remove the false positives, the four tools report different vulnerabilities. An interesting result is that three vulnerabilities were detected by VS1.1 and were not detected by VS1.2 (the newer version of the scanner). The reverse also happens for 15 vulnerabilities, which is expectable as a newer version is anticipated to detect more
Table 2. False positives in the public web services Vulnerability
Penetration Testing Tools
Confirmed
F. P.
XPath Injection
VS1.1 & VS1.2
10
0
Code Execution
VS1.1 & VS1.2
1
0
Buffer Overflow
VS3
1
3
Username/Password Disclosure
VS3
0
47
Server Path Disclosure
VS3
16
1
Figure 2. SQL Injection vulnerabilities without false positives in the public web services
409
Detecting Vulnerabilities in Web Services
vulnerabilities than an older one (but that should happen without missing any of the vulnerabilities identified by the older version, which was not the case). These results called our attention and we tried to identify the reasons. After analyzing the detailed results we concluded that all of these 18 vulnerabilities are in the group of the doubtful ones (maybe they are really false positives, but we were not able to demonstrate that), preventing us from drawing a definitive conclusion.
Coverage Analysis A key aspect is to understand the coverage of the vulnerabilities detected. Coverage compares the number of vulnerabilities detected against the total number of vulnerabilities. Obviously, in our case it is impossible to know how many vulnerabilities were not disclosed by any of the tools (we do not have access to the source code). Thus, it is not possible to calculate the coverage. However, it is still possible to make a relative comparison based on the data available. In practice, we know the total number of vulnerabilities detected (which correspond to the union of the vulnerabilities detected by the four penetration testing tools after removing the false positives) and the number of vulnerabilities detected by each individual penetration testing tool. Based on this information it is possible to get an optimistic coverage estimation for each tool (i.e., the real coverage will be lower than the value presented). Obviously, this is relevant only for SQL Injection vulnerabilities as it is the only type that is detected by all the penetration testing tools. Table 3 presents the coverage results. As shown, 149 different SQL Injection vulnerabilities were detected (as before, we decided to include the doubtful situations as existing vulnerabilities). Each scanner detected a subgroup of these vulnerabilities, resulting in partial detection coverage. VS1.1 and VS1.2 present quite good results. On
410
the other hand, the coverage of VS2 and VS3 is very low.
CASE STUDY #2: APPLYING PENETRATION TESTING AND STATIC CODE ANALYSIS TO HOMEIMPLEMENTED WEB SERVICES The second case study consisted of detecting vulnerabilities in a set of eight home-implemented web services providing a total of 28 operations (see Table 4). As we have access to the source code of the services tested, in this scenario we could apply both penetration testing and static code analysis. Thus, besides the penetration testing tools used in the previous case study (i.e., HP WebInspect, IBM Rational AppScan, and Acunetix Web Vulnerability Scanner), three vastly used static code analyzers that provide the capability of detecting vulnerabilities in Java applications’ source or bytecode have been considered, namely: FindBugs, Yasca, and IntelliJ IDEA. Note that, we have selected analyzers that focus on the Java language as this language is nowadays largely used for the development of web applications, but any other language could have been considered. In the rest of this section we referred to these tools as VS1.2, VS2, VS3, SA1, SA2, and SA3, with no particular order (for the penetration testing tools we use the same acronym used in the previous study). Regarding the tools configuration, three aspects should be emphasized: Table 3. Coverage for SQL Injection in the public web services Scanner
# SQL Injection Vulnerability
Coverage %
VS1.1
130
87.2%
VS1.2
142
95.3%
VS2
25
16.8%
VS3
26
17.4%
Total
149
100.0%
Detecting Vulnerabilities in Web Services
•
•
•
Before running a penetration-testing tool over a given service, the underlying database was restored to a predefined state. This avoids the cumulative effect of previous tests and guarantees that all the tools started the service testing in a consistent state. If allowed by the penetration testing tool, information about the domain of each parameter was provided. If the tool requires the user to set an exemplar invocation per operation, the exemplar respected the input domains of the operation. All the tools in this situation used the same exemplar. The static analyzers were configured to fully analyze the services code. For the analyzers that use binary code, the deployment-ready version was used.
Four of the web services tested implement a subset of the services specified by the standard TPC-App performance benchmark (Transaction Processing Performance Council, 2008). Four other services have been adapted from code publicly available on the Internet (Exhedra Solutions, Inc., 2009). These eight services are implemented in Java and use a relational database to store data and SQL commands to manage it. Table 4 characterizes the web services (the source code can be found at (Antunes, Vieira, & Madeira, 2008b)),
including the number of operations per service (#Op), the total lines of code (LoC) per service, the average number of lines of code per operation (LoC/Op), and the average cyclomatic complexity (Lyu, 1996) of the operations (Avg. C.). These indicators were calculated using SourceMonitor (Campwood Software, 2008) and due to space constraints we do not discuss this characterization (the information provided is quite intuitive).
Web Services Manual Inspection To perform a correct evaluation of the services it is essential to correctly identify the existing vulnerabilities. This way, a team of three developers, with two or more years of experience in security of database centric web applications, was invited to review the source code looking for vulnerabilities (false positives were eliminated by cross-checking the vulnerabilities identified by different people). A key aspect is that different tools report vulnerabilities in different ways. In fact, for penetration testing tools (that identify vulnerabilities based on the web service responses) a vulnerability is counted for each vulnerable parameter that allows attacking the service. On the other hand, for static analysis tools (that vet services code looking for possible security issues) a vulnerability is counted for each vulnerable line in the service code. Thus,
Table 4. Home-implemented web services characterization
Public-Code
TPC-App
Service
#Op
LoC
LoC/Op
Avg. C.
ProductDetail
Get details about a product
Short Description
1
105
105,0
6,0
NewProducts
Add new product to the database
1
136
136,0
6,0
NewCustomer
Add new customer to the database
1
184
184,0
9,0
ChangePaymentMethod
Change customer’s payment method
1
97
97,0
11,0
JamesSmith
Manages personal data about students
5
270
54,0
6,0
PhoneDir
Phone book
5
132
26,4
2,8
Bank
Manages bank operations
5
175
35,0
3,4
Bank3
Manages bank operations (different from Bank service)
6
377
62,8
9,0
411
Detecting Vulnerabilities in Web Services
we asked the security experts to identify both the parameters and source code lines prone to attacks. Table 5 present the summary of the vulnerabilities detected by the security experts. The results show a total number of 61 inputs and of 28 lines of code prone to SQL Injection attacks (the only type of vulnerability identified). Due to space reasons we do not detail these results. Interested readers can found those at (Antunes et al., 2008b). Note that, for the results analysis we assume that the services have no more vulnerabilities than the union of the vulnerabilities detected by the security experts and by the tools used (obviously, excluding the false positives).
Penetration Testing Results Figure 3.a) shows the results for the penetration testing tools (the number of vulnerable inputs detected by the security experts is shown in the last column of the graphic). As we can see, the different tools reported a different number of vulnerabilities and none of the tools detected more than 51% of the vulnerabilities detected by the security experts. Excluding the false-positives, penetration testing tools did not detect any vulnerability that was not previously detected by the security team. In fact, VS1.2 identified the higher number of vulnerTable 5. Vulnerabilities found in the home-implemented services’ code Service
#Vulnerable Inputs
#Vulnerable Lines
ProductDetail
0
0
NewProducts
1
1
NewCustomer
15
2
ChangePaymentMethod
2
1
JamesSmith
20
5
PhoneDir
6
4
Bank
4
3
13
12
61
28
Bank3 Total
412
abilities (50.8% of the total vulnerabilities), but it was also the scanner with the higher number of false positives (it detected 5 vulnerabilities that, in fact, do not exist). The intersection of the vulnerable inputs detected by the different tools and by the security experts is illustrated in Figure 3.b). As we can see, different penetration testing tools detected different vulnerabilities and from the set of vulnerabilities detected by the experts, we observe that VS1.2 missed only one (red in the Figure 3.b)). This is interesting, considering that all the other scanners detected it, although they presented lower coverage than VS1.2 (the difference is related to the way the scanners interpret the web service response). Additionally, only 5 vulnerabilities were detected by all the penetration testing tools, but this number is, obviously, limited by the low coverage of VS3.
Static Code Analysis Results Figure 4.a shows the number of vulnerable lines identified by the static code analyzers (the number of vulnerabilities detected by the security experts is shown in the last column of the graphic). As shown, SA2 detected the higher number of vulnerabilities, with 100% of coverage, but identified 10 false positives, which represents 26.3% of the vulnerabilities identified. The coverage of SA3 (39.3%) is very low when compared to the other two tools. The high rate of false positive is, in fact, a problem shared by all the static analyzers used, as all reported more than 23% of false positives. These tools detect certain code patterns that usually indicate vulnerabilities, but the problem is that many times they detect vulnerabilities that do not exist. Figure 4.b illustrates the intersection of vulnerable lines detected by the different tools. Here it is visible that different analyzers detected different vulnerabilities. SA2 detected exactly the same 28 vulnerabilities identified by the security experts. A key observation is that only 9 out of the 28
Detecting Vulnerabilities in Web Services
existing vulnerabilities were detected by all the static analyzers, but this is obviously limited by the coverage of SA3.
Comparing Penetration Testing with Static Analysis As mentioned before, different tools report vulnerabilities in different ways (i.e., vulnerable parameters vs vulnerable lines), which means that it is not possible to compare penetration testing
tools and static code analysis tools in terms of the absolute number of vulnerabilities detected. This way, in this section we compare the two detection approaches based on the coverage and false-positive rates. It is important to emphasize that this comparison cannot be generalized as we tested a limited set of tools and used a limited set of web services. In other words, the coverage and false-positive results presented are only valid in the context of the experiments performed in the present work. Nevertheless, these results provide
Figure 3. Results for penetration testing in the home-implemented web services
Figure 4. Results for static code analysis in the home-implemented web services
413
Detecting Vulnerabilities in Web Services
an interesting indication on the usefulness of the two approaches and on the effectiveness of some existing tools. Figure 5 compares the coverage and the falsepositives for the tools tested in the present work. Results show that the coverage of static code analysis tools is typically much higher than of penetration testing tools (this is due to the fact that static analysis processes the entire code, while penetration testing is dependent on the coverage of the workload, which is frequently quite low). False positives are a problem for both approaches, but have more impact in the case of static analysis. A key observation is that different tools implementing the same approach frequently report different vulnerabilities in the same piece of code. Although the results of this study cannot be generalized,
they highlight the strengths and limitations of both approaches and suggest that future research on this topic is of utmost importance. Finally, results show that web services programmers should be very careful when selecting a vulnerability detection approach and when interpreting the results provided by automated tools.
Selecting Vulnerability Detection Tools The low coverage and the high number of false positives observed in the case studies presented before highlight the limitations of vulnerability detection tools. This way, developers urge the definition of a practical approach that helps them comparing alternative tools concerning their abil-
Figure 5. Comparison of penetration testing with static analysis
414
Detecting Vulnerabilities in Web Services
ity to detect vulnerabilities. Computer benchmarks are standard tools that allow evaluating and comparing different systems or components according to specific characteristics (e.g., performance, robustness, dependability, etc) (Gray, 1992). We believe that a benchmarking approach is a powerful way to evaluate and compare vulnerability detection tools. Our proposal to benchmark vulnerability detection tools is mainly inspired on measurementbased techniques. The basic idea is to exercise the tools under benchmarking using web services code with and without vulnerabilities and, based on the detected vulnerabilities, calculate a small set of measures that portray the detection capabilities of the tools. In this context, the main components of such benchmark are: •
•
•
Workload: represents the work that a tool must perform during the benchmark execution. In practice, it consists of a set of services (with and without security vulnerabilities) that are used to exercise the vulnerability detection tools during the benchmarking process. Measures: characterize the effectiveness of the tools under benchmarking in detecting the vulnerabilities that exist in the workload services. The measures must be easy to understand and must allow the comparison among different tools. Procedure: describes the procedure and rules that must be followed during the benchmark execution.
Due to the high diversity of vulnerability types and detection approaches, the definition of a benchmark that can be used for all vulnerability detection tools is an unattainable goal. This way, as recommended in (Gray, 1992), a benchmark must be specifically targeted to a particular domain. The benchmark presented in this section is available at (Antunes & Vieira, 2009) and focuses on penetration testing and static analysis tools,
able to detect SQL Injection vulnerabilities in web services.
Workload The workload defines the work that has to be done by the vulnerability detection tool during the benchmark execution. In practice, the workload includes the code that is used to exercise the vulnerability detection capabilities of the tools under benchmarking. It is mainly influenced by two factors: •
•
The types of vulnerabilities (e.g., SQL Injection, XPath Injection, file execution) detected by the tools under benchmarking, which define the types of vulnerabilities that must exist in the workload. The vulnerability detection approaches (e.g., penetration-testing, static analysis), which specify the approaches used by the tools under benchmarking to detect vulnerabilities.
In order to define a representative workload we adapted code from three standard benchmarks developed by the Transactions processing Performance Council, namely: TPC-App, TPC-C, and TPC-W (see details on these benchmarks at (Transaction Processing Performance Council, 2009)). TPC-App is a performance benchmark for web services infrastructures and specifies a set of web services accepted as representative of real environments. TPC-C is a performance benchmark for transactional systems and specifies a set of transactions that include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses. Finally, TPC-W is a benchmark for web-based transactional systems. The business represented by TPC-W is a retail store over the Internet where several clients access the website to browse, search, and process orders. Although TPC-C and TPC-W do not define the transactions
415
Detecting Vulnerabilities in Web Services
in the form of web services, they can easily be implemented and deployed as such. A key issue is that the workload needs to include realistic SQL Injection vulnerabilities. Although feasible, the injection of artificial vulnerabilities (Fonseca, Vieira, & Madeira, 2007) introduces complexity and may suffer from representativeness issues. When possible, the option should be to consider code with real vulnerabilities, (inadvertently) introduced by the developers during the coding process. This way, for the present work we invited an external developer to implement the TPC-App web services (without disclosing the objective of the implementation in order not to influence the final result) and successfully searched the Web for publicly available imple-
mentations of TPC-C and TPC-W, which were adapted to the form of web services by the same external developer (this adaptation consisted basically of the encapsulation of the transactions as web services, without modifying the functional structure of the code). Obviously, this was a risky choice as there was some probability of getting code without vulnerabilities. However, as expected, the final code included several SQL Injection vulnerabilities (see Table 6), which is representative of the current situation in real web services development, as shown in the case studies presented before. To better understand the existing vulnerabilities, we invited a team of 3 external experts, with two or more years of experience in security of
Table 6. Vulnerabilities found in the workload Benchmark TPC-App
TPC-C
TPC-W
Service Name
Vulnerable Inputs
Vulnerable Queries
LOC
Avg. C.
ProductDetail
0
0
121
5
NewProducts
15
1
103
4.5
NewCustomer
1
4
205
5.6
ChangePaymentMethod
2
1
99
5
Delivery
2
7
227
21
NewOrder
3
5
331
33
OrderStatus
4
5
209
13
Payment
6
11
327
25
StockLevel
2
2
80
4
AdminUpdate
2
1
81
5
CreateNewCustomer
11
4
163
3
CreateShoppingCart
0
0
207
2.67
DoAuthorSearch
1
1
44
3
DoSubjectSearch
1
1
45
3
DoTitleSearch
1
1
45
3
GetBestSellers
1
1
62
3
GetCustomer
1
1
46
4
GetMostRecentOrder
1
1
129
6
GetNewProducts
1
1
50
3
GetPassword
1
1
40
2
GetUsername Total
416
0
0
40
2
56
49
2654
-
Detecting Vulnerabilities in Web Services
database centric applications to conduct a formal inspection of the code looking for both input parameters and source code lines prone to SQL Injection attacks. Table 6 presents the summary of the vulnerabilities detected by the security experts, the total number of lines of code (LoC) per service, and the average Cyclomatic Complexity (Avg. C.) (Lyu, 1996) of the code (calculated using SourceMonitor). The results show a total of 56 vulnerable inputs and of 49 vulnerable SQL queries in the set of services considered. In order to exercise the tools under benchmarking in a more exhaustive and realistic manner we decided to generate additional versions of the web services. The first step consisted of creating a new version for each service with all the known vulnerabilities fixed. Then we generated several versions for each service, each one having only one vulnerable SQL query. This way, for each web service we have one version without known vulnerabilities, one version with N vulnerabilities, and N versions with one vulnerable SQL query each. This accounts for a total of 80 different versions, with 158 vulnerable inputs and 87 vulnerable queries, which we believe is enough to exercise vulnerability detection tools (see Section 4.4.4 for a discussion of the benchmark properties).
Measures The measures in a benchmark for vulnerability detection tools must be understood as results that can be useful to characterize the tools in a relative fashion (e.g., to compare two alternative tools). A key difficulty related to the definition of the benchmark measures is that different vulnerability detection tools report vulnerabilities in different ways. In fact, penetration-testing tools (that identify vulnerabilities based on the application response) report vulnerable inputs, while static analysis tools (that vet code looking for possible security issues) report vulnerable lines in the code. Due to this dichotomy, it is very difficult (or even imposable) to compare the effectiveness
of tools that implement different vulnerability detection approaches, based on the number of vulnerabilities reported for the same piece of code. This way, our proposal is to characterize vulnerability detection tools using the F-Measure proposed by (Van Rijsbergen, 1979), which is largely independent of the way vulnerabilities are counted. In fact, it represents the harmonic mean of two very popular measures (precision and recall), which, in the context of vulnerability detection, can be defined as: Precision: a ratio of correctly detected vulnerabilities to the number of all detected vulnerabilities. In our context it can be represented as follows: precision =
TP TP + FP
(1)
Recall: a ratio of correctly detected vulnerabilities to the number of known vulnerabilities. In our context it can be represented as follows: recall =
TP TV
(2)
Where: •
•
•
TP (true positives) is the number of true vulnerabilities detected (i.e., vulnerabilities that, in fact, exist in the code); FP (false positives) is the number of vulnerabilities detected that, in fact, do not exist. TV (true vulnerabilities) is the total number of vulnerabilities that exist in the workload code.
Assuming an equal weight for precision and recall, the formula for the F-Measure is: F − Measure =
2 ⋅ precision ⋅ recall precision + recall
(3)
417
Detecting Vulnerabilities in Web Services
The three measures can be used to establish different rankings, depending on the purposes of the benchmark user. Note that these are proven metrics that are typically used to portray the effectiveness of many computer systems, particularly in information retrieval systems. Thus, these measures should be easily understood by most of users.
Procedure The benchmarking procedure consists of four steps: 1. Preparation: select the vulnerability detection tools to be benchmarked. 2. Execution: use the tools under benchmarking to detect vulnerabilities in the workload code. 3. Measures calculation: analyze the vulnerabilities reported by the tools and calculate the measures. 4. Ranking and selection: rank the tools under benchmarking using the F-Measure, precision, and recall measures. Based on the preferred ranking select the most effective tool (or tools). After executing the benchmark it is necessary to compare the vulnerabilities reported by the tool with the ones that effectively exist in the workload code. Vulnerabilities correctly detected are counted as true positives and vulnerabilities
detected but that do not exist in the code are counted as false positives. This is the information needed to calculate the precision and recall of the tool, and consequently the F-measure.
Benchmarking Example In this benchmarking example we consider the same penetration testing and static code analyzes tools (and corresponding configurations) used in the case study presented in Section 3.2. Again, for the results presentation we do not mention the brand of the tools to assure neutrality. This way, the tools are referred in the rest of this section as VS1.2, VS2, VS3, SA1, SA2, and SA3 (using the same acronyms as in the previous case studies).
Benchmarking Results The tools under benchmarking were run over the workload code (Step 2: Execution). The vulnerabilities detected were then compared with the existing ones and used to calculate the benchmark measures (Step 3: Measures calculation). Table 7 presents the overall benchmarking results (FMeasure, precision, and recall). As we can see, the SA2 tool is the one that presents the higher F-Measure. Additionally, another static code analysis tool SA1 present better results than the penetration-testing tools. SA3 and VS3 are the tools with the lowest F-Measure. Benchmark measures can be easily used to rank the tools under benchmarking (Step 4: Rank-
Table 7. Benchmarking results
418
Tool
F-Measure
Precision
Recall
SA1
0.691
0.923
0.552
SA2
0.780
0.640
1
SA3
0.204
0.325
0.149
VS1.2
0.378
0.455
0.323
VS2
0.297
0.388
0.241
VS3
0.037
1
0.019
Detecting Vulnerabilities in Web Services
ing and selection) according to three criteria: precision (focus on the balance between true positives and false positives), recall (focus on the true positives rate), and F-Measure (focus on the balance between precision and recall). Table 8 presents possible rankings for the tools. We divide the ranking in two, considering the approach used to report vulnerabilities (inputs or queries), as defining a single ranking for tools that search for vulnerabilities in different ways may not be meaningful (nevertheless, the benchmark measures allow such a ranking). Tools presented in the same cell are ranked in the same position due to the similarity of the results. As we can see from the analysis of Table 8, SA2 seems to be the most effective tool considering both the F-Measure and recall. However, the
most effective tool when we consider precision is SA1, being SA2 the second best. VS3 seems to be the least effective tool in terms of F-Measure and recall. However, it has a very good precision (in fact it reported no false positives, but detected only 3 of the existing vulnerabilities). Excluding SA3, static analysis seems to be a better option than penetration testing. The following sections discuss the benchmark results in more detail.
Results for Penetration-Testing Figure 6 shows the vulnerabilities reported by the penetration testing tools (the last bar in the graph presents the total number of vulnerabilities in the workload code). As we can see, the different tools reported a different number of vulnerabilities. An
Table 8. Benchmarked tools ranking Criteria Inputs
F-Measure Precision Recall
Queries
1st
2nd
3rd
VS1.2
VS2
VS3
VS3
VS1.2
VS2
VS1.2
VS2
VS3
F-Measure
SA2
SA1
SA3
Precision
SA1
SA2
SA3
Recall
SA2
SA1
SA3
Figure 6. Benchmarking results for the penetration testing
419
Detecting Vulnerabilities in Web Services
important observation is that all the tools detected less than 35% of the known vulnerabilities. VS1.2 identified the higher number of vulnerabilities (≈32% of the total vulnerabilities). However, it was also the scanner with the higher number of false positives. The very low number of vulnerabilities detected by VS3 can be partially explained by the fact that this tool does not allow the user to set any information about input domains, nor it accepts any exemplar request. This means that the tool generates a completely random workload that, probably, is not able to test the parts of the code that require specific input values.
Results for Static Analysis Figure 7 shows the number of vulnerable SQL queries identified by the static analyzers. SA2 detected the higher number of vulnerabilities, with 100% of true positives (an excellent result), but identified 49 false positives, which represents ≈36% of the vulnerabilities identified. The high rate of false positive is, in fact, a problem shared by SA3, which reported more than ≈67% of false positives. The reason is that these tools detect certain patterns that usually indicate vulnerabilities, Figure 7. Benchmarking results for static analysis
420
but many times they detect vulnerabilities that do not exist, due to intrinsic limitations of the static profile of the code.
Properties Validation When proposing a new benchmark it is of utmost importance to validate and discuss its representativeness, portability, repeatability, non-intrusiveness, and simplicity of use (Gray, 1992). The representativeness of the workload greatly influences the representativeness of the benchmark. As mentioned before, we are aware of the limitations of the workload code, as it may not be representative of all the SQL Injection vulnerability patterns found in web services. Our thesis is that the workload should be good enough to allow the comparison of vulnerability detection tools. In fact, what is important is that the benchmark results accurately portray the tools effectiveness in a relative manner. Comparing the benchmarking results with the effectiveness of the tools under benchmarking in different scenarios allows checking if the benchmark accurately portrays the effectiveness of vulnerability detection tools in a relative manner. This way, we decided to compare
Detecting Vulnerabilities in Web Services
the benchmarking results with the results of the case study presented in Section 3.2. Table 9 presents a summary of the F-Measure, precision, and recall metrics based on the results of the vulnerabilities detected in the case study presented in Section 3.2. As expected, the measures are not equal to the ones reported by the benchmark (see Table 7). This is normal as this set of services has different code characteristics and different SQL Injection vulnerabilities. Table 10 presents the tools ranking obtained using this data (tools in the same cell are ranked in the same position). Comparing this ranking with the one proposed using the benchmark measures (see Table 8) we can observe the following: 1) the ranking based on the F-Measure is precisely the same; 2) the ranking based on precision differs for VS2 and VS1.2; and 3) the ranking based on recall is the same. This suggests that the tools’ ranking derived from the benchmarking campaign adequately portrays the relative effectiveness of the tools. However, to prove the
property and improve the benchmark representativeness, more vulnerable web services need to be added to the workload. Note that, progressively improving/modifying the workload is also a way for preventing vendors to tune their tools to the benchmark. Regarding portability, the benchmark seems to be quite portable. In fact, we were able to successfully benchmark three penetration testers and three static code analyzers. It is important to emphasize that these tools are provided by different vendors and have very different functional characteristics. The benchmark is portable because it is not based on the implementation details of any specific tool. The proposed benchmark must report similar results when used more than once over the same tool. To check repeatability we executed the benchmark for VS1.2 and SA2 (penetration tester and static code analyzer with the higher F-Measure) two more times. Table 11 presents the results of the three executions. As we can see, the results for
Table 9. Results for third-party web services Tool
TP
FP
F-Measure
Precision
Recall
SA1
23
7
79.3%
76.7%
82.1%
SA2
28
10
84.9%
73.7%
100%
SA3
11
4
51.2%
73.3%
39.3%
VS1.2
31
5
63.9%
86.1%
50.8%
VS2
22
1
52.4%
95.7%
36.1%
VS3
6
0
17.9%
100%
9.8%
Table 10. Ranking based on third party services Criteria Inputs
F-Measure Precision
2nd
3rd
VS1.2
VS2
VS3
VS3
VS2
VS1.2
VS1.2
VS2
VS3
F-Measure
SA2
SA1
SA3
Precision
SA1
SA2
SA3
Recall
SA2
SA1
SA3
Recall Queries
1st
421
Detecting Vulnerabilities in Web Services
Table 11. Repeatability results VS1.2 Run 0
Run 1
Run 2
Run 0
Run 1
Run 2
F-Measure
0.378
0.381
0.378
0.78
0.78
0.78
Precision
0.455
0.452
0.455
0.64
0.64
0.64
Recall
0.323
0.329
0.323
1
1
1
the SA2 are always the same. This was expected as static code analyzers analyze the code in a deterministic manner, which removes variance from the results. On the other hand some small variations can be observed for VS1.2. However, these variations are always under 0.01, which suggests that the benchmark is quite repeatable. The benchmark does not require any changes to the benchmarked tools, which guarantees the non-intrusiveness property. This is possible because the measures portray the tools effectiveness from the point-of-view of the service they provide (i.e., vulnerabilities reported) and not based on the internal behavior. The use of the proposed benchmark is quite simple to use (in part, because most steps are automatic). In fact, we have been able to run the benchmark for all the tools in about 6 man-days, which correspond to an average of 0.75 mandays per benchmarking experiment. Running the benchmark only requires executing the tools and comparing the reported vulnerabilities with the ones that effectively exist.
CONCLUSION AND FUTURE RESEARCH ISSUES In this chapter we presented two case studies on the effectiveness of penetration testing and static analysis in the detection of vulnerabilities in web services. Results show that selecting a vulnerability detection tool is a very difficult task and that, in general, the effectiveness of those tools is
422
SA2
quite low. Another key observation is that a large number of vulnerabilities was observed in the web services tested (including publicly available ones), confirming that many services are deployed without proper security testing. Based on the observations of the case studies we argue that developers urge the definition of a benchmarking approach that helps them selecting the most adequate vulnerability detection tool. Thus, the chapter presented an approach to benchmark the effectiveness of SQL Injection vulnerability detection tools for web services. Several tool have been benchmarked, including both commercial and open-source tools. The results show that the proposed benchmark can be easily used to assess and compare penetration testers and static code analyzers. A key remark is that developers must be careful when relying on vulnerability detection tools for securing their code. In practice, developers need to improve development practices to build more secure web services and must select the detection tools that best fit their specific scenarios. No tool provides a complete solution and multiple vulnerability detection tools and approaches should be considered before deploying business critical web services. Future research on the improvement of the existing methodologies is still needed. In the case of static code analyzers, it is necessary to reduce the number false positive identified. For penetration testing tools, it is necessary to investigate techniques to increase the coverage of vulnerabilities detected. Also techniques that combine static code
Detecting Vulnerabilities in Web Services
analysis with penetration testing are important to improve the quality of the results obtained. Another open research topic is the analysis of the scalability of these tools in terms of the cost of testing a large number of web services. It is also necessary to understand how the complexity of the web services impact on the time necessary to perform the tests. Concerning benchmarking, an open issue is the development of methodologies to inject representative vulnerabilities in order to generate different workloads from any set of web services.
REFERENCES Acunetix. (2008). Acunetix Web Vulnerability Scanner. Retrieved September 29, 2008, from http://www.acunetix.com/vulnerability-scanner/
Fonseca, J., Vieira, M., & Madeira, H. (2007). Testing and Comparing Web Vulnerability Scanning Tools for SQL Injection and XSS Attacks. In 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007) (pp. 365-372). Fortify Software. (2008). Fortify 360 Software Security Assurance. Retrieved June 8, 2010, from http://www.fortify.com/products/fortify-360/ Foundstone, Inc. (2005). Foundstone WSDigger. Foundstone Free Tools. Retrieved September 29, 2008, from http://www.foundstone.com/us/ resources/proddesc/wsdigger.htm Gray, J. (1992). Benchmark Handbook: For Database and Transaction Processing Systems. San Francisco, CA: Morgan Kaufmann Publishers Inc.
Antunes, N., & Vieira, M. (2009). A benchmark for SQL Injection vulnerability detection tools for Web Services. Retrieved from http://eden.dei. uc.pt/~mvieira/
HP. (2008). HP WebInspect. Retrieved September 29, 2008, from https://h10078. www1.hp.com/cda/hpms/display/main/ hpms_content.jsp?zn=bto&cp=1-11-201200%5E9570_4000_100__
Antunes, N., Vieira, M., & Madeira, H. (2008a). Web Services Vulnerabilities. Retrieved from http://eden.dei.uc.pt/~mvieira/
IBM. (2008). IBM Rational AppScan. Retrieved September 29, 2008, from http://www-01.ibm. com/software/awdtools/appscan/
Antunes, N., Vieira, M., & Madeira, H. (2008b). Penetration Testing vs Static Code Analysis. Retrieved from http://eden.dei.uc.pt/~mvieira/
JetBrains. (2009). IntelliJ IDEA. Retrieved from http://www.jetbrains.com/idea/free_java_ide. html
Campwood Software. (2008). SourceMonitor Version 2.5. Retrieved from http://www.campwoodsw.com/sourcemonitor.html
Livshits, V. B., & Lam, M. S. (2005). Finding security vulnerabilities in Java applications with static analysis. In Proceedings of the 14th conference on USENIX Security Symposium-Volume 14 (p. 18).
Chappell, D. A., & Jewell, T. (2002). Java Web Services. Sebastopol, CA: O’Reilly & Associates, Inc. Exhedra Solutions, Inc. (2009). Planet Source Code. Retrieved from http://www.planet-sourcecode.com/
Lyu, M. R. (Ed.). (1996). Handbook of software reliability engineering. Hightstown, NJ, USA: McGraw-Hill, Inc. OWASP Foundation. (2008). OWASP WSFuzzer Project. Retrieved September 29, 2008, from http:// www.owasp.org/index.php/Category:OWASP_ WSFuzzer_Project
423
Detecting Vulnerabilities in Web Services
Scovetta, M. (2008). Yet Another Source Code Analyzer. Retrieved October 8, 2008, from www. yasca.org Stuttard, D., & Pinto, M. (2007). The web application hacker’s handbook: discovering and exploiting security flaws. Wiley Publishing, Inc. Transaction Processing Performance Council. (2008, February 28). TPC BenchmarkTM App (Application Server) Standard Specification, Version 1.3. Retrieved December 7, 2008, from http://www.tpc.org/tpc_app/ Transaction Processing Performance Council. (2009). Transaction Processing Performance Council. Retrieved from http://www.tpc.org/ University of Maryland. (2009). FindBugs™ Find Bugs in Java Programs. Retrieved March 12, 2009, from http://findbugs.sourceforge.net/ Van Rijsbergen, C. J. (1979). Information retrieval. London: Buttersworth.
ADDITIONAL READING Antunes, N., Laranjeiro, N., Vieira, M., & Madeira, H. (2009). Effective Detection of SQL/ XPath Injection Vulnerabilities in Web Services. In IEEE International Conference on Services Computing (pp. 260-267). Presented at the 2009 IEEE International Conference on Services Computing, Bangalore, India: IEEE Computer Society. Antunes, N., & Vieira, M. (2009a). Comparing the Effectiveness of Penetration Testing and Static Code Analysis on the Detection of SQL Injection Vulnerabilities in Web Services. In 15th IEEE Pacific Rim International Symposium on Dependable Computing (pp. 301-306). Presented at the 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, Shanghai, China: IEEE Computer Society.
424
Antunes, N., & Vieira, M. (2009b). Detecting SQL Injection Vulnerabilities in Web Services. In Fourth Latin-American Symposium on Dependable Computing (pp. 17-24). Presented at the Fourth Latin-American Symposium on Dependable Computing, Joao Pessoa, Brazil: IEEE Computer Society. Arkin, B., Stender, S., & McGraw, G. (2005). Software penetration testing. IEEE Security & Privacy, 3(1), 84–87. doi:10.1109/MSP.2005.23 Ayewah, N., Hovemeyer, D., Morgenthaler, J. D., Penix, J., & Pugh, W. (2008). Using static analysis to find bugs. IEEE Software, 25(5), 22–29. doi:10.1109/MS.2008.130 Boyd, S. W., & Keromytis, A. D. (2004). SQLrand: Preventing SQL injection attacks. In Applied Cryptography and Network Security (pp. 292–302). Buehrer, G., Weide, B. W., & Sivilotti, P. A. (2005). Using parse tree validation to prevent SQL injection attacks. In Proceedings of the 5th international workshop on Software engineering and middleware (p. 113). Christensen, E., Curbera, F., Meredith, G., & Weerawarana, S. (2001, March 15). Web Service Definition Language (WSDL) 1.1. World Wide Web Consortium (W3C). Retrieved October 18, 2007, from http://www.w3.org/TR/wsdl Christey, S., & Martin, R. A. (2006). Vulnerability type distributions in CVE. V1. 0, 10, 04. Curbera, F., Duftler, M., Khalaf, R., Nagy, W., Mukhi, N., & Weerawarana, S. (2002). Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI. IEEE Internet Computing, 6(2), 86–93. doi:10.1109/4236.991449
Detecting Vulnerabilities in Web Services
Fonseca, J., Vieira, M., & Madeira, H. (2009). Vulnerability & attack injection for web applications. In IEEE/IFIP International Conference on Dependable Systems & Networks, 2009. DSN ‘09 (pp. 93-102). Presented at the IEEE/IFIP International Conference on Dependable Systems & Networks, 2009. DSN ‘09. Halfond, W. G., & Orso, A. (2005). AMNESIA: analysis and monitoring for NEutralizing SQLinjection attacks. In Proceedings of the 20th IEEE/ ACM international Conference on Automated software engineering (p. 183). Huang, Y., Huang, S., Lin, T., & Tsai, C. (2003). Web application security assessment by fault injection and behavior monitoring. In Proceedings of the 12th international conference on World Wide Web (pp. 148-159). Budapest, Hungary: ACM. Kiczales, G. J., Lamping, J. O., Lopes, C. V., Hugunin, J. J., Hilsdale, E. A., & Boyapati, C. (2002). Aspect-oriented programming. Google Patents. Laranjeiro, N., Vieira, M., & Madeira, H. (2009a). Improving Web Services Robustness. In 2009 IEEE International Conference on Web Services (pp. 397–404). Laranjeiro, N., Vieira, M., & Madeira, H. (2009b). Protecting Database Centric Web Services against SQL/XPath Injection Attacks. In Database and Expert Systems Applications (pp. 271–278). Lindstrom, P. (2004). Attacking and defending web service. A Spire Research Report. Madeira, H., Kanoun, K., Arlat, J., Crouzet, Y., Johanson, A., & Lindström, R. (2000). Preliminary Dependability Benchmark Framework. DBench Project, IST, 25425. Singhal, A., Winograd, T., & Scarfone, K. (2007). Guide to Secure Web Services: Recommendations of the National Institute of Standards and Technology. Report, National Institute of Standards and Technology, US Department of Commerce, 800–95.
Stock, A., Williams, J., & Wichers, D. (2007). OWASP top 10. OWASP Foundation, July. Valeur, F., Mutz, D., & Vigna, G. (2005). A learning-based approach to the detection of SQL attacks. Intrusion and Malware Detection and Vulnerability Assessment, 123–140. Vieira, M., Laranjeiro, N., & Madeira, H. (2007a). Assessing robustness of web-services infrastructures. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN’07 (pp. 131–136). Vieira, M., Laranjeiro, N., & Madeira, H. (2007b). Benchmarking the robustness of web services. In 13th Pacific Rim International Symposium on Dependable Computing, 2007. PRDC 2007 (pp. 322–329). Wagner, S., J\ürjens, J., Koller, C., & Trischberger, P. (2005). Comparing bug finding tools with reviews and tests. Testing of Communicating Systems, 40–55. Zanero, S., Carettoni, L., & Zanchetta, M. (2005). Automatic Detection of Web Application Security Flaws. Black Hat Briefings.
KEY TERMS AND DEFINITIONS Benchmark: Standard tool that allows evaluating and comparing different systems, components, or tools. F-Measure: metric that represents the harmonic mean of precision and recall. False Positive: vulnerability reported by a tool that, in fact, does not exist. Penetration Testing: “black-box” approach that consists of the analysis of the execution of the application in search for vulnerabilities. Precision: a ratio of correctly detected vulnerabilities to the number of all detected vulnerabilities.
425
Detecting Vulnerabilities in Web Services
Recall: a ratio of correctly detected vulnerabilities to the number of known vulnerabilities. Static Code Analysis: “white-box” approach that consists of the analysis of the source code of the web application. This can be done manually or by using automatic code analysis tools. Vulnerability Detection Coverage: metric that portrays the percentage of existing vulnerabilities that are detected by a given tool.
426
Vulnerability: software weakness that can be maliciously explored by hackers to get access to proprietary data and resources. Web Services: self-contained and interoperable application components that communicate using open protocols (e.g., SOAP – Simple Object Access Protocol).
427
Compilation of References
Aad, I., Hubaux, J.-P., & Knightly, E. W. (2004). Denial of service resilience in ad hoc networks. In Proceedings of the ACM Annual International Conference on Mobile Computing and Networking (MobiCom) (pp. 202-215), New York, NY, USA, 2004. ACM Press. Aalst, W. d., & Hee, K. v. (2002). Workflow Management Models, Methods, and Systems. London, Massachusetts, England: The MIT Press Cambridge. Aalst, W. M. P. d. (2000). Workflow Verification: Finding Control-Flow Errors Using Petri-Net-Based Techniques. In Business Process Management: Models, Techniques, and Empirical Studies (Vol. 1806, pp. 161–183). Berlin: Springer-Verlag. Abadie, J., & Carpentier, J. (1969). Generalization of the Wolfe reduced gradient method to the case of nonlinear constraints. In Flecher, R. (Ed.), Optimization. New York, USA: Academic Press. Acunetix. (2008). Acunetix Web Vulnerability Scanner. Retrieved September 29, 2008, from http://www.acunetix. com/vulnerability-scanner/ Akyildiz, I. F., & Wang, X. (2005). A survey on wireless mesh networks. IEEE Communications Magazine, 43(9), 23–30. doi:10.1109/MCOM.2005.1509968 Alrifai, M., & Risse, T. (2009). Combining global optimization with local selection for efficient QoS-aware service composition. In Proceedings of the 18th international Conference on World Wide Web (WWW ‘09). (pp. 881890). New York, NY: ACM. Altova. XML Spy. Retrieved from http://www.altova. com/products/xmlspy/xml_editor.html
Amazon, Auto Scaling and load balance. Retrieved October 24, 2010, from http://aws.amazon.com/autoscaling/ Amazon, EC2 SLA. Retrieved October 24, 2010, from http://aws.amazon.com/ec2-sla/ Amazon. (2010). Amazon Web Services Solutions Catalog. Retrieved September 15, 2008, from http://aws. amazon.com/products/ AmberPoint, Inc. (2003). Managing Exceptions in Web Services Environments, An AmberPoint Whitepaper. Retrieved from http://www.amberpoint.com. Andrews, T., Curbera, F., Dholakia, H., et al. (2003). Business process execution language for web services (BPEL4WS) 1.1. Andrieux, A., Czajkowski, K., Dan, A., Keahey, K., Ludwig, H., Nakata, T., Pruyne, J., Rofrano, J., Tuecke, S., & Xu, M. (2007). Web Services Agreement Specification (WSAgreement). OGF proposed recommendation (GFD.107). Andrieux, A., Czajkowski, K., Dan, A., Ludwig, H., Nakata, T., Pruyne, J., et al. (2007). Web Service Agreement Specification (WS-Agreement). Retrieved from http:// www.ogf.org/documents/GFD.107.pdf Anselone, P. M. (1960). Persistence of an Effect of a Success in a Bernoulli Sequence. Journal of the Society for Industrial and Applied Mathematics, 8(2), 272–279. doi:10.1137/0108015 Answers to Questions Relative to High Tension Transmission. (1904)... Transactions of the American Institute of Electrical Engineers, XXIII, 571–604. doi:10.1109/TAIEE.1904.4764484
Amazon Web Services. Retrieved October 24, 2010, from http://aws.amazon.com/about-aws/ Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Compilation of References
Antunes, N., & Vieira, M. (2009). A benchmark for SQL Injection vulnerability detection tools for Web Services. Retrieved from http://eden.dei.uc.pt/~mvieira/ Antunes, N., Laranjeiro, N., Vieira, M., & Madeira, H. (2009). Effective Detection of SQL/XPath Injection Vulnerabilities in Web Services. In IEEE International Conference on Services Computing (pp. 260-267). Presented at the 2009 IEEE International Conference on Services Computing, Bangalore, India: IEEE Computer Society. Antunes, N., Vieira, M., & Madeira, H. (2008a). Web Services Vulnerabilities. Retrieved from http://eden.dei. uc.pt/~mvieira/ Antunes, N., Vieira, M., & Madeira, H. (2008b). Penetration Testing vs Static Code Analysis. Retrieved from http://eden.dei.uc.pt/~mvieira/ Apache Software Foundation. (2008). Maven. Retrieved February 14, 2008, from http://maven.apache.org/ Ardagna, D., & Pernici, B. (2006). Dynamic Web service composition with QoS constraints. International Journal of Business Process Integration and Management, 1(3), 233–243. doi:10.1504/IJBPIM.2006.012622 Ardagna, D., & Pernici, B. (2007). Adaptive service composition in flexible processes. IEEE Transactions on Software Engineering, 33(6), 369–384. doi:10.1109/ TSE.2007.1011 Artaiam, N., & Senivongse, T. (2008). Enhancing serverside QoS monitoring for Web Services. In Proceedings of the International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing. IEEE Computer Society.
Avizienis, A., Randell, B., & Landwehr, C. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. doi:10.1109/TDSC.2004.2 Avizienis, A., Laprie, J.-C., & Randell, B. (2001). Fundamental Concepts of Dependability. LAAS-CNRS, Technical Report N01145. Avizienis, A., Laprie, J. C., & Randell, B. (2004), Dependability and its threats: A taxonomy, IFIP World Computer Congress (WCC-2004), Toulouse, France (pp 91-120). Avriel, M. (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. Awerbuch, B., Curtmola, R., Holmer, D., Nita-Rotaru, C., & Herbert, R. (2004). Mitigating byzantine attacks in ad hoc wireless networks. Technical report. Center for Networking and Distributed Systems, Computer Science Department, Johns Hopkins University. Awerbuch, B., Curtmola, R., Holmer, D., Nita-Rotaru, C., & Herbert, R. (2005). On the survivability of routing protocols in ad hoc wireless networks. In Proceedings of the First International Conference on Security and Privacy for Emerging Areas in Communications Networks (pp. 327-338),Washington, DC, USA, 2005. IEEE Computer Society. AWS S3. Service Level Agreement. Retrieved 03 28, 2010, from AWS: http://aws.amazon.com/s3-sla/ AWS. EC2 Service Level Agreement. Retrieved 03 28, 2010, from AWS: http://aws.amazon.com/ec2-sla/
Atlassian, (2010). Clover - Code Coverage Analysis, Retrieved October 22, 2010, from http://www.atlassian. com/software/clover/
Ayanoglu, E., Chih-Lin, I., Gitlin, R. D., & Mazo, J. E. (1993). Diversity coding for transparent self-healing and fault-tolerant communication networks. IEEE Transactions on Communications, 41(11), 1677–1686. doi:10.1109/26.241748
Aura, T., & Maki, S. (2002). Towards a survivable security architecture for ad-hoc networks. In Proceedings of the International Workshop on Security Protocols (pp. 63–73), London, UK, 2002. Springer-Verlag.
Badidi, E., Esmahi, L., Adel Serhani, M., & Elkoutbi, M. (2006). WS-QoSM: A Broker-based Architecture for Web Services QoS Management (pp. 1–5). Innovations in Information Technology.
Avizienis, A. (1997). Toward Systematic Design of FaultTolerant Systems. IEEE Transactions on Computers, 30(4), 51–58.
Bai, X., Dong, W., Tsai, W. T., & Chen, Y. (2005). WSDLBased Automatic Test Case Generation for Web Services Testing, Proceding of the IEEE International Workshop on Service-Oriented System Engineering (SOSE), Beijing, 207-212.
428
Compilation of References
Bai, X., Lee, S., Tsai, W. T., & Chen, Y. (2008) Ontologybased test modeling and partition testing of web services. Proceedings of the 2008 IEEE International Conference on Web Services. IEEE Computer Society. September, Beijing, China, 465–472. Baresi, L., Di Nitto, E., & Ghezzi, C. (2006). Toward open-world software: Issue and challenges. Computer, 39, 36–43. doi:10.1109/MC.2006.362 Baresi, L., & Guinea, S. (2006). Towards Dynamic Web Services. Paper presented at the Proceedings of the 28th International Conference on Software Engineering (ICSE). Baresi, L., Heclel, R., Thone, S., & Varro, D. (2003). Modeling and Validation of Service-Oriented Architectures Application vs. Style. The fourth joint meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering. Barlow, R. E., & Proschan, F. (1967). Mathematical Theory of Reliability. New York, NY: John Wiley & Sons. Barlow, R. E. (2002). Mathematical reliability theory: from the beginning to the present time (Lindqvist, B. H., & Doksum, K. A., Eds.). Mathematical and Stastical Methods In Reliability, World Scientific Publishing. Barrett, P. A., Hilborne, A. M., Veríssimo, P., Rodrigues, L., Bond, P. G., Seaton, D. T., et al. (1990). The Delta-4 Extra Performance Architecture (XPA). Paper presented at the The 20th International Symposium on Fault-Tolerant Computing (FTCS). Bartel, M., Boyer, J., Fox, B., LaMacchia, B., & Simon, E. (2008). W3C Recommendation XML Signature Syntax and Processing. Retrieved from http://www.w3.org/TR/ xmldsig-core/. Bartolini, C., Bertolino, A., Lonetti, F., & Marchetti, E. (2011). Approaches to functional, structural and security soa testing. Chapter 17 in this book. Bartolini, C., Bertolino, A., Elbaum, S. C., & Marchetti, E. (2009). Whitening SOA testing. Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering. Amsterdam, The Netherlands, August 24-28, 161-170.
Bartolini, C., Bertolino, A., Marchetti, E., & Parissis, I. (2008) Data Flow-based Validation of Web Services Compositions: Perspectives and Examples, in V, R. de Lemos, F. Di Giandomenico, H. Muccini, C. Gacek, M. Vieira (Eds.). Architecting Dependable Systems, Springer 2008. LNCS5135, 298-325. Bartolini, C., Bertolino, A., Marchetti, E., & Polini, A. (2009) WS-TAXI: a WSDL-based testing tool for Web Services. Proceeding of 2nd International Conference on Software Testing, Verification, and Validation ICST 2009. Denver, Colorado USA, 2009, April 1- 4. Basharin, G. P., Langville, A. N., & Naumov, V. A. (2004). The Life and Work of A. A. Markov. Linear Algebra and its Applications. Special Issue on the Conference on the Numerical Solution of Markov Chains, 386, 3-26. Basin, D., Doser, J., & Lodderstedt, T. (2006). Model driven security: From UML models to access control infrastructures. [TOSEM]. ACM Transactions on Software Engineering and Methodology, 15(1). Battre’. D., Hovestadt, M., Kao, O., Keller, A., & Voss, K. (2007). Planning-based scheduling for SLA-awareness and grid integration. PlanSIG, (pp. 1). Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming: theory and algorithms. New York, NY: John Wiley & Sons, Inc. Beauche, S., & Poizat, P. (2008). Automated Service Composition with Adaptive Planning. [LNCS]. Lecture Notes in Computer Science, 5364, 530–537. doi:10.1007/9783-540-89652-4_42 Beck, K. (2003). Test-driven development: by example. Addison-Wesley Professional. Ben Lakhal, N., Kobayashi, T., & Yokota, H. (2005). A failure-aware model for estimating and analyzing the efficiency of web services compositions. 11th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2005), Changsha, Hunan, China (pp. 114-124). Benharref, A., Dssouli, R., Serhani, M., & Glitho, R. (2009). Efficient traces’collection mechanisms for passive testing of Web Services. Information and Software Technology, 51(23), 362–374. doi:10.1016/j. infsof.2008.04.007
429
Compilation of References
Bennani, M. N., & Menascé, D. A. (2005). Resource allocation for autonomic data centers using analytic performance models. In K. Schwan & Y. Wang (Eds.), ICAC’05: The Second IEEE International Conference on Autonomic Computing, 229-240. Berbner, R., Spahn, M., Repp, N., Heckmann, O., & Steinmetz, R. (2010). WSQoSX – A QoS Architecture for Web Service Workflows. [LNCS]. Lecture Notes in Computer Science, 4749, 623–624. doi:10.1007/978-3540-74974-5_59 Berbner, R., Spahn, M., Repp, N., Heckmann, O., & Steinmetz, R. (2006). Heuristics for QoS-aware web service composition. Proceedings of the IEEE International Conference on Web Services, 72-82. Berman, V., & Mukherjee, B. (2006). Data security in MANETs using multipath routing and directional transmission. In Proceedings of the IEEE International Conference on Communications (ICC) (pp. 2322-2328), v. 5. IEEE Computer Society. Bertolino, A. (2010). Can your software be validated to accept candies from strangers? Keynote at Spice Conference’10, 18-20 May 2010 - Pisa, Italy. Bertolino, A., & Polini, A. (2009). SOA test governance: Enabling service integration testing across organization and technology borders. In WebTest ’09: Proc. IEEE ICST Workshops, (pp. 277–286)., Washington, DC, USA. IEEE CS. Bertolino, A., & Polini, A. (2005). The audition framework for testing web services interoperability. 31st EUROMICRO International Conference on Software Engineering and Advanced Applications, 134-142. Bertolino, A., De Angelis, G., Frantzen, L., & Polini, A. (2008). The PLASTIC framework and tools for testing service-oriented applications. In (De Lucia & Ferrucci, 2009), (pp. 106–139). Bertolino, A., De Angelis, G., Frantzen, L., & Polini, A. (2008). Model-based Generation of Testbeds for Web Services. In Proc. of the 20th IFIP Int. Conference on Testing of Communicating Systems (TESTCOM 2008), LNCS. Springer Verlag. – to appear.
430
Bertolino, A., De Angelis, G., & Polini, A. (2007). A QoS Test-bed Generator for Web Services. In Proc. of ICWE 2007, number 4607 in LNCS. Springer. Bertolino, A., Frantzen, L., Polini, A., & Tretmans, J. (2006). Audition of Web Services for Testing Conformance to Open Specified Protocols. In R. Reussner and J. Stafford and C. Szyperski (Ed). Architecting Systems with Trustworthy Components. (pp. 1-25). LNCS 3938. Springer-Verlag. Bertolino, A., Gao, J., Marchetti, E., & Polini, A. (2007). Automatic Test Data Generation for XML Schema-based Partition Testing. Proceedings of the Second international Workshop on Automation of Software Test International Conference on Software Engineering. IEEE Computer Society. May 20 - 26, 2007. Washington. Bertolino, A., Lonetti, F., & Marchetti, E. (2010). Systematic XACML request generation for testing purposes. Accepted for publication to the 36th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). 1-3 September 2010. Lille, France. Bessani, A.N., Sousa, P., Correia, M., Neves, N.F., & Verissimo, P. (2008). The CRUTIAL way of critical infrastructure protection. IEEE Security & Privacy, 6(6). Bhargavan, K., Fournet, C., & Gordon, A. (2008). Verifying policy-based web services security In Proc. of the 11th ACM conference on Computer and communications security, 268-277. Bhoj, P., Singhal, S., & Chutani, S. (2001). SLA Management in federated environments. Computer Networks, 35, 5–24. doi:10.1016/S1389-1286(00)00149-3 Bianco, P., & Kotermanski, R. Merson, & Summa, P. (2007). Evaluating a service-oriented architecture. Software Engineering Institute Technical Report CMU/ SEI-2007-TR-015. Bianculli, D., & Ghezzi, C. (2008). Dynamo-AOP user manual. PLASTIC EU Project. Binmore, K. (1991). Fun and games: a text on game theory. Lexington, MA: D. C. Heath. Birman, K., Renesse, R. v., & Vogels, W. (2004). Adding High Availability and Autonomic Behavior to Web Services. Paper presented at the International Conference on Software Engineering (ICSE 2004).
Compilation of References
Birnbaum, Z. W., Esary, J. D., & Saunders, S. C. (1961). Multi-component systems and structures and their reliability. Technometrics, 3(1). doi:10.2307/1266477 Blischke, W. R., & Murthy, D. N. P. (Eds.). (2003). Case Studies in Reliability and Maintenance. Hoboken, New Jersey: John Wiley & Sons. Blythe, J., Deelman, E., & Gil, Y. (2004). Automatically Composed Workflows for Grid Environments. IEEE Intelligent Systems, 16–23. doi:10.1109/MIS.2004.24 Bo, Y., & Xiang, L. (2007). A study on software reliability prediction based on support vector machines. In IEEE International Conference on Industrial Engineering and Engineering Management (pp. 1176-1180). Presented at the IEEE International Conference on Industrial Engineering and Engineering Management. doi:10.1109/ IEEM.2007.4419377 Bolch, G., Greiner, S., Meer, H., & Trivedi, K. S. (2006). Queueing Networks and Markov Chains - Modeling and Performance Evaluation with Computer Science Applications. John Wiley & Sons. Bonell, M. (1996). The UNIDROIT Principles of International Commercial Contracts and the Principles of European Contract Law: Similar Rules for the Same Purpose (pp. 229–246). Uniform Law Review. Boniface, M., Phillips, S., Sanchez-Macian, A., & Surridge, M. (2009). Dynamic service provisioning using GRIA SLAs. Service-Oriented Computing-ICSOC 2007 Workshops, (pp. 56-67). Vienna, Austria.
Brandic, I. Music, D., & Dustdar, S. (2009). Service Mediation and Negotiation Bootstrapping as First Achievements Towards Self-adaptable Grid and Cloud Services. In Grids and Service-Oriented Architectures for Service Level Agreements. P. Wieder, R. Yahyapour, and W. Ziegler (eds.), Springer, New York, USA. Brandic, I., Venugopa, S., Mattess, M., & Buyya, R. (2008). Towards a Meta-negotiation Architecture for SLA-Aware Grid Services. International Workshop on Service-Oriented Engineering and Optimization, (pp. 17). Bangalore, India. Bridgman, P. W. (1922). Dimensional analysis. New Haven: Yale University Press. Brown, M. (2010). Optimizing Apache Server Performance. Retrieved May 20, 2010, from http://www. serverwatch.com/tutorials/article.php/3436911 Browne, P. (2009). JBoss Drools Business Rules. Packt Publishing. Bruneo, D., Distefano, S., Longo, F., & Scarpa, M. (2010). QoS assessment of WS-BPEL processes through non-Markovian Stochastic Petri Nets. 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), Atlanta, USA (pp. 1-12). Buchegger, S., Mundinger, J., & Le Boudec, J. Y. (2008). Reputation systems for self-organized networks. IEEE Technology and Society Magazine, 27(1), 41–47. doi:10.1109/MTS.2008.918039
Bouchenak, S. (2010). Automated Control for SLA-Aware Elastic Clouds, Proceedings of the Fifth International Workshop on Feedback Control Implementation and Design in Computing Systems and Network. Paris, France (pp. 27-28).
Buchegger, S., & Le Boudec, J. Y. (2002). Performance analysis of the CONFIDANT protocol: Cooperation of nodes - fairness in dynamic ad-hoc networks. In Proceedings of the ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), Lausanne, CH, 2002. IEEE Computer Society.
Boudriga, N. A., & Obaidat, M. S. (2005). Fault and intrusion tolerance in wireless ad hoc networks. In Proceedings of the IEEE Wireless Communications and Networking Conference (pp. 2281-2286), v. 4. IEEE Computer Society.
Buco, M. J., Chang, R. N., Luan, L. Z., Ward, C., Wolf, J. L., & Yu, P. S. (2004). Utility computing SLA management based upon business objectives. IBM Systems Journal, 43(1), 159–178. doi:10.1147/sj.431.0159
Box, G. E. P., Luceno, A., & Del Carmen PaniaguaQuinones, M. (2009). Statistical Control by Monitoring and Adjustment. Broché.
Budhiraja, N., Marzullo, K., & Schneider, F. B. (1992). Primary-backup protocols: Lower bounds and optimal implementations. Paper presented at the of the Third IFIP Conference on Dependable Computing for Critical Applications.
BPT homepage, T. (2010). http://bptesting.sourceforge. net/.
431
Compilation of References
Buttyan, L., & Hubaux, J. P. (2003). Stimulating cooperation in self-organizing mobile ad hoc networks. Mobile Network, 8(5), 579–592. doi:10.1023/A:1025146013151
Cannings, R., Dwivedi, H., & Lackey, Z. (2008). Hacking exposed Web 2.0: Web 2.0 security secrets and solutions. 1. New York, NY, USA: McGraw-Hill, Inc.
Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599–616. doi:10.1016/j.future.2008.12.001
Cao, H., Ying, S., & Du, D. (2006). Towards Model-based Verification of BPEL with Model Checking. In Proceeding of Sixth International Conference on Computer and Information Technology (CIT 2006). 20-22 September 2006, Seoul, Korea, 190-194.
Buyya, R., & Alexida, D. (2001). A case for economy Grid architecture for service oriented Grid computing. In Proceedings of the 10th International Heterogeneous Computing Workshop(HCW). San Francisco, CA.
Cao, T. D., Felix, P., & Castane, R. (2010). WSOTF: An Automatic Testing Tool for Web Services Composition. Proceeding of the Fifth International Conference on Internet and eb Applications and Services, 7-12.
Buyya, R., Pandey, S., & Vecchiola, C. (2009). Cloudbus Toolkit for Market-Oriented Cloud Computing, In Proceedings of the 1st International Conference on Cloud Computing (CloudCom 2009, Springer, Germany). Beijing, China.
Cao, T. D., Felix, P., Castane, R., & Berrada, I. (2010). Online Testing Framework for Web Services, Proceeding of the Third International Conference on Software Testing, Verification and Validation, 363-372.
Buyya, R., Ranjan, R., & Calheiros, R. N. (2009). Modeling and Simulation of Scalable Cloud Computing Environments and the CloudSim Toolkit: Challenges and Opportunities. In Proceedings of the 7th High Performance Computing and Simulation Conference (HPCS 2009), ISBN: 978-1-4244-4907-1, IEEE Press, New York, USA, Leipzig, Germany. Campwood Software. (2008). SourceMonitor Version 2.5. Retrieved from http://www.campwoodsw.com/ sourcemonitor.html Canfora, G., Di Penta, M., Esposito, R., & Villani, M. L. (2008). A framework for QoS-aware binding and re-binding of composite web services. Journal of Systems and Software, 81(10), 1754–1769. doi:10.1016/j. jss.2007.12.792 Canfora, G., & Di Penta, M. (2009). Service-oriented architectures testing: A survey. Software Engineering Journal, 78–105. doi:10.1007/978-3-540-95888-8_4 Canfora, G., DiPenta, M., Esposito, R., & Villani, M. L. (2005). An approach for QoS-aware service composition based on genetic algorithms. GECCO ‘05: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, 1069-1075.
432
Capkun, S., Buttyan, L., & Hubaux, J. P. (2003). Selforganized public-key management for mobile ad hoc networks. IEEE Transactions on Mobile Computing, 2(1), 52–64. doi:10.1109/TMC.2003.1195151 Cardellini, V., Casalicchio, E., Grassi, V., & Lo Presti, F. (2007). Flow-based service selection for Web service composition supporting multiple QoS classes. Proceedings of the IEEE International Conference on Web Services, 743-750. Cardoso, J., Sheth, A., Miller, J., Arnold, J., & Kochut, K. K. (2004). Quality of Service for workflows and web service processes. Journal of Web Semantics, 1(3), 743–750. doi:10.1016/j.websem.2004.03.001 Cardoso, J. (2002). Quality of Service and Semantic Composition of Workflows. Unpublished PHD thesis, University of Georgia. Carmichael, F. (2005). A guide to game theory. Harlow, England: Pearson Education Limited. Casati, F., Ilnicki, S., Jin, L.-J., Krishnamoorthy, V., & Shan, M.-C. (2000). eFlow: A Platform for Developing and Managing Composite e-Services. Paper presented at the Academia/Industry Working Conference on Research Challenges (AIWORC’00).
Compilation of References
Cavalli, A., Montes De Oca, E., Mallouli, W., & Lallali, M. (2008). Two Complementary Tools for the Formal Testing of Distributed Systems with Time Constraints. Proceeding of 12th IEEE International Symposium on Distributed Simulation and Real TimeApplications, Canada, Oct 27 – 29. Chankong, V., & Haines, Y. Y. (1983). Multiobjective decision making: theory and method. Amsterdam, The Netherlands: North-Holland. Chappell, D. A., & Jewell, T. (2002). Java Web Services. Sebastopol, CA: O’Reilly & Associates, Inc. Chase, J. S., Anderson, D. C., Thakar, P. N., Vahdat, A. M., & Doyle, R. P. (2001). Managing Energy and Server Resources in Hosting Centers. The 18th ACM Symposium on Operating Systems Principles (SOSP’01), New York, NY. Chen, X., Chen, H., & Mohapatra, P. (2003). Aces: An Efficient Admission Control Scheme for QoS-aware Web Servers. Computer Communications, 26(14). doi:10.1016/ S0140-3664(02)00259-1 Chen, J. Soundararajan, G., & Amza, C. (2006). Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers. The 3rd IEEE International Conference on Autonomic Computing (ICAC 2006), Dublin, Ireland. Chen, Y., & Romanovsky, A. (2008, Jan/Feb) Improving the Dependability of Web Services Integration. IT Professional: Technology Solutions for the Enterprise, 20-26. Chen, Y., Romanovsky, A., Gorbenko, A., Kharchenko, V., Mamutov, S., & Tarasyuk, O. (2009). Benchmarking Dependability of a System Biology Application. Proceedings of the 14th IEEE Int. Conference on Engineering of Complex Computer Systems (ICECCS’2009), 146 – 153. Cheng, A., Esparza, J., & Palsberg, J. (1993). Complexity Results for 1-safe Nets. Foundations of Software Technology and Theoretical computer. Science, 761, 326–337. Cheng, S.-W., & Garlan, D. (2007). Handling uncertainty in autonomic systems. In Proceedings of the International Workshop on Living with Uncertainties (IWLU’07), colocated with the 22nd International Conference on Automated Software Engineering (ASE’07).
Chiu, D. K. W., Karlapalem, K., Li, Q., & Kafeza, E. (2002). Workflow view based e-contracts in a cross-organizational e-services environment. Distributed and Parallel Databases, 12(2–3), 193–216. doi:10.1023/A:1016503218569 Choi, Y., Kim, H., & Lee, D. (2007). Tag-aware text file fuzz testing for security of a software System. In Proc. of the International Conference on Convergence Information Technology, 2254-259. Chorzempa, M., Park, J. M., & Eltoweissy, M. (2007). Key management for long-lived sensor networks in hostile environments. Computer Communications, 30(9), 1964–1979. doi:10.1016/j.comcom.2007.02.022 Chothia, T., & Kleijn, J. (2007). Q-Automata: Modeling the Resource Usage of Concurrent Components. Electronic Notes in Theoretical Computer Science, 175, 153–167. doi:10.1016/j.entcs.2007.03.009 Chou, D. C., & Yurov, K. (2005). Security development in Web services environment. Computer Standards & Interfaces, 27(3), 233–240. doi:10.1016/S09205489(04)00099-6 Choudhury, R., Yang, X., Ramanathan, R., & Vaidya, N. H. (2006). On designing MAC protocols for wireless networks using directional antennas. IEEE Transactions on Mobile Computing, 5(5), 477–491. doi:10.1109/ TMC.2006.69 Chow, R., Golle, P., Jakobsson, M., Shi, E., Staddon, J., Masuoka, R., & Molina, J. (2009). Controlling data in the cloud: outsourcing computation without outsourcing control. In CCSW ’09: Proceedings of the 2009 ACM workshop on Cloud computing security, (pp. 85–90)., New York, NY, USA. ACM. Chu, X., Nadiminti, K., Jin, Ch., Venugopal, S., & Buyya, R. (2002). Aneka: Next-Generation Enterprise Grid Platform for e-Science and e-Business Applications. In Proceedings of the 3rd IEEE International Conference on e-Science and Grid Computing, (pp. 10-13). Bangalore, India. Chung, L., Nixon, B. A., Yu, E., & Mylopoulos, J. (2000). Non-functional requirements in software engineering. Springer. Chung, L., & Subramanian, N. (2003). Adaptive System/ Software Architecture. Journal of Systems Architecture.
433
Compilation of References
Ciotti, F. (April 2007). WS-Guard - Enhancing UDDI Registries with on-line Testing Capabilities. Master’s thesis, Department of Computer Science, University of Pisa. Colombo, E., Francalanci, C., & Pernici, B. (2002) Modeling coordination and control in cross-organizational workflows. In Proceedings of DOA/CoopIS/ODBASE, 91–106 Comuzzi, M., Theilmann, W., Zacco, G., Rathfelder, C., Kotsokalis, C., & Winkler, U. (2009). A Framework for Multi-level SLA Management. The eighth International Conference on Service Oriented Computing (ICSOC). Coulouris, G., Dollimore, J., & Kindberg, T. (2005). Distributed systems: Concepts and design (4th ed.). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Cox, D. R. (1989). Quality and Reliability: Some Recent Developments and a Historical Perspective. The Journal of the Operational Research Society, 41(2), 95–101. Curbera, F., Duftler, M., Khalaf, R., Nagy, W., Mukhi, N., & Weerawarana, S. (2002). Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI. Internet Computing, IEEE, 6(2), 86–93..doi:10.1109/4236.991449 Curphey, M., Endler, D., Hau, W., Taylor, S., Smith, T., Russell, A., McKenna, G., et al. (2002). A guide to building secure Web applications. The Open Web Application Security Project, 1. Dan, A., Ludwig, H., & Kearney, R. (2004). CREMONA: an architecture and library for creation and monitoring of WS-Agreements. In Proceedings of the Second International Conference on Service-Oriented Computing, (pp. 65-74). NY, USA. Dan, A., Ludwig, H., & Pacifici, G. (2003). Web Services Differentiation with Service Level Agreement. Retrieved from http://www.ibm.com/developerworks/library/wsslafram/ Davis, M. D. (1983). Game theory: a nontechnical introduction. New York, NY: Basic Books Inc. De Angelis, F., De Angelis, G., & Polini, A. (2010). A counter-example testing approach for orchestrated services. In Proc. of the 3rd International Conference on Software Testing, Verification and Validation (ICST 2010), (pp. 373–382). Paris, France. IEEE Computer Society.
434
de Lemos, R. (2005). Architecting Web services applications for improving availability. In R. de Lemos. De Lucia, A., & Ferrucci, F. (Eds.). (2009). Software Engineering, International Summer Schools, ISSSE 20062008, Salerno, Italy, Revised Tutorial Lectures, volume 5413 of Lecture Notes in Computer Science. Springer. De Nicola, R., Ferrari, G., Montanari, U., Pugliese, R., & Tuosto, E. (2003). A Formal Basis for Reasoning on Programmable QoS. Lecture Notes in Computer Science, 2772, 436–479. De Nicola, R., Ferrari, G., Montanari, U., Pugliese, R., & Tuosto, E. (2005). A Process Calculus for QoS-Aware Applications. Lecture Notes in Computer Science, 3454, 33–48. doi:10.1007/11417019_3 De Souza e Silva, E., & Gail, H. R. (1992). Performability analysis of computer systems: from model specification to solution, Performance Evaluation, Volume 14. Issues (National Council of State Boards of Nursing (U.S.)), 3-4, 157–196. DeLine, R. (1999). A catalog of techniques for resolving packaging mismatch. In Proceedings of the 5th Symposium on Software Reusability (SSR’99) (pp. 44-53). Deswarte, Y., & Powell, D. (2006). Internet security: an intrusion-tolerance approach. Proceedings of the IEEE, 94(2), 432–441. doi:10.1109/JPROC.2005.862320 Dhillon, B. S. (2007). Applied reliability and quality: fundamentals, methods and applications. Berlin, Heidelberg: Springer-Verlag. Diao, Y., & Gandhi, N. Hellerstein, J. Parekh, S., & Tilbury, D. (2002). Using MIMO Feedback Control to Enforce Policies for Interrelated Metrics with Application to the Apache Web Server. Network Operations and Management Symposium (NOMS). Diao, Y., Hellerstein, J. L., Parekh, S., Shaikh, H., & Surendra, M. (2006). Controlling Quality of Service in Multi-Tier Web Applications. The 26th International Conference on Distributed Computing Systems (ICDCS 2006), Lisbon, Portugal. Dierks, T., & Allen, C. (1999). The TLS protocol – Version 1.0. IETF RFC 2246.
Compilation of References
Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik, 1(1), 269–271. doi:10.1007/BF01386390
Eclipse Foundation. (2008). The AspectJ Project. Retrieved December 8, 2007, from http://www.eclipse.org/ aspectj/
Dinesh, V. (2004). Supporting Service Level Agreements on IP Networks. In Proceedings of IEEE/IFIP Network Operations and Management Symposium, 92(9), (pp. 1382-1388). NY, USA.
Einhorn, S. J., & Thiess, F. B. (1957). Intermittence as a stochastic process. NYU-RCA Working Conference on Theory of Reliability.
Djenouri, D., & Badache, N. (2007). Struggling against selfishness and blackhole attacks in MANETs. Wireless Communications and Mobile Computing. Dobson, G., & Sanchez-Macian, A. (2006). Towards unified QoS/SLA Ontologies. Proceedings of the IEEE Services Computing Workshops (pp. 169-174). Doliner, M. (2010). Cobertura. Retrieved October 22, 2010, from http://cobertura.sourceforge.net/ Dong, J., Paul, R., & Zhang, L.-J. (2008). High Assurance Service-Oriented Architecture. IEEE Computer, 41(8), 22–23. Drucker, P. (1992). The new society of organizations. Harvard Business Review. Dubey, V., & Menascé, D. A. (2010). Utility-based optimal service selection for business processes in service oriented architectures. ICWS ‘10: Proceedings of the 2010 IEEE International Conference on Web Services, 542-550 Duraes, J., Vieira, M., & Madeira, H. (2004). Dependability Benchmarking of Web-Servers. In M. Heisel et al. (Eds.), Proceedings of the 23rd Int. Conf. on Computer Safety, Reliability and Security (SAFECOMP’04), LNCS 3219, (pp. 297-310). Springer-Verlag. Duren, M. (2004). Organically assured and survivable information systems (OASIS). USA: Wetstone Technologies. Technical Report, Air Force Research Laboratory. Dustdar, S., & Schreiner, W. (2005). A survey on Web services composition. In [Inderscience Publishers, Geneva.]. International Journal of Web and Grid Services., 1(1), 1–30. doi:10.1504/IJWGS.2005.007545 Ebeling, C. E. (2005). An Introduction to Reliability and Maintainability Engineering. Waveland Press.
Ellison, R., Fisher, D., Linger, R., Lipson, H., Longstaff, T., & Mead, N. (1997). Survivable network systems: an emerging discipline. Technical report. Pittsburgh, PA: Software Engineering Institute, Carnegie Mellon University. Elmagarmid, A. K. (1992). Database Transaction Models for Advanced Applications. Morgan Kaufmann. Elnikety, S., Tracey, J., Nahum, E., & Zwaenepoel, W. (2004). A method for transparent admission control and request scheduling in e-commerce web sites. The 13th International conference on World Wide Web (WWW 2004), New York, NY. Epstein, B., & Sobel, M. (1953). Life Testing. Journal of the American Statistical Association, 48(263), 486–502. doi:10.2307/2281004 Ericson, C. A., II. (1999). Fault Tree Analysis - A History. In Proc. the 17th International Systems Safety Conference. Erl, T., Karmarka, A., Walmsley, P., Haas, H., Yalcinalp, U., & Liu, C. K. (2009). Web Service Contract Design & Versioning for SOA. Prentice Hall. Erlang, A. K. (1909). Principal Works of A. K. Erlang - The Theory of Probabilities and Telephone Conversations. First published in Nyt Tidsskrift for Matematik B, 20, 131-137. Esary, J. D., & Proschan, F. (1970). A Reliability Bound for Systems of Maintained, Interdependent Components. Journal of the American Statistical Association, 65(329), 329–338. doi:10.2307/2283596 Eugster, P. T., Felber, P. A., Guerraoui, R., & Kermarrec, A.-M. (2003). The many faces of publish/ subscribe. ACM Computing Surveys, 35(2), 114–131. doi:10.1145/857076.857078 Eviware (n.d.) SoapUI. Retrieved from soapUI. http:// www.soapui.org/.
435
Compilation of References
Exhedra Solutions, Inc. (2009). Planet Source Code. Retrieved from http://www.planet-source-code.com/ Fagan, M. E. (1976). Design and code inspections to reduce errors in program development. IBM Systems Journal, 15(3), 182–211. doi:10.1147/sj.153.0182 Fang, C.-L., Liang, D., Lin, F., & Lin, C.-C. (2007). Fault tolerant web services. Journal of Systems Architecture, 53(1), 21–38. doi:10.1016/j.sysarc.2006.06.001 Fensel, D., Lausen, H., Polleres, A., de Bruijn, J., Stollberg, M., Roman, D., & Domingue, J. (2006). Enabling Semantic Web Services: The Web Service Modeling Ontology. Springer. Ferguson, P., & Huston, G. Quality of Service: Delivering QoS on the Internet and in Corporate Networks. John Wiley & Sons, 1998. Grid’5000. Grid’5000. Retrieved May 20, 2010, from http://www.grid5000.fr/ Ferraiolo, D., Sandhu, R., Gavrila, S., Kuhn, D., & Chandramouli, R. (2001). Proposed NIST standard for rolebased access control. ACM Transactions on Information and System Security, 4(3), 224–274. doi:10.1145/501978.501980 Figueiredo, C. M. S., & Loureiro, A. A. (2010). On the Design of Self-Organizing Ad hoc Networks. Designing Solutions-Based Ubiquitous and Pervasive Computing: New Issues and Trends (pp. 248–262). Hershey, PA: IGI Global. doi:10.4018/978-1-61520-843-2.ch013 Fishburn, P. C. (1970). Utility theory for decision making. New York, NY: Wiley.
Fonseca, J., Vieira, M., & Madeira, H. (2007). Testing and Comparing Web Vulnerability Scanning Tools for SQL Injection and XSS Attacks. In 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007) (pp. 365-372). Presented at the 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007), Melbourne, Australia. doi:10.1109/ PRDC.2007.55 Fortify Software. (2008). Fortify 360 Software Security Assurance. Retrieved June 8, 2010, from http://www. fortify.com/products/fortify-360/ Forum, G. G. (2005). Web Services Agreement Specification (WS–Agreement) (Version 2005/09 ed.). OGF. Foster, A. K. (2003). The Grid 2: Blueprint for a New Computing Infrastructure. San Francisco, CA: Morgan Kaufmann. Foster, H., Uchitel, S., Magee, J., & Kremer, J. (2003). Proceeding of Model-based verification of web service compositions. In ASE (pp. 152–163). IEEE Computer Society. Foundstone, Inc. (2005). Foundstone WSDigger. Foundstone Free Tools. Retrieved September 29, 2008, from http://www.foundstone.com/us/resources/proddesc/ wsdigger.htm Fraga, J., & Powell, D. (1985). A fault- and intrusiontolerant file system. In Proceedings of the Third International Conference on Computer Security (pp. 203-218).
Fisler, K., Krishnamurthi, S., Meyerovich, L. A., & Tschantz, M. C. (2005) Verification and change-impact analysis of access-control policies. In Proc. of ICSE 2005, 196-205.
Franch, X., & Botella, P. (1996). Putting non-functional requirements into software architecture. In Proceedings of the 9th International Workshop on Software Specification and Design (pp. 60-67). Los Alamitos, CA: IEEE Computer Society.
Fitzgerald, S., Foster, I., & Kesselman, C. (1997). A directory service for configuring high-performance distributed computations. In Proceedings of the 6th IEEE Sympusium on High-Performance Distributed Computing. (pp. 365-375).
Frank, E., Holmes, G., Mayo, M., Pfahringer, B., Smith, T., & Witten, I. (2010). Weka 3 - Data Mining with Open Source Machine Learning Software in Java. Retrieved July 4, 2008, from http://www.cs.waikato.ac.nz/ml/weka/
Fogie, S., Grossman, J., Hansen, R., Rager, A., & Petkov, P. D. (2007). XSS Attacks: Cross site scripting exploits and defense. Syngress Publishing.
Frantzen, L., Tretmans, J., & Willemse, T. (2005). Test generation based on symbolic specifications. In Grabowski, J., & Nielsen, B. (Eds.), FATES 2004, number 3395 in LNCS (pp. 1–15). Springer.
436
Compilation of References
Frantzen, L., & Tretmans, J. (2006). Towards ModelBased Testing of Web Services. International Workshop on Web Services - Modeling and Testing (WS-MaTe2006. Palermo. Italy. June 9th, 67-82. Fu, X., Bultan, T., & Su, J. (2004). Analysis of Interacting BPEL Web Services. Proceeding of International Conference on World Wide Web. May, New York, USA, 17 – 22. Freedman, D. P., & Weinberg, G. M. (2000). Handbook of Walkthroughs, Inspections, and Technical Reviews: Evaluating Programs, Projects, and Products. Dorset House Publishing Co., Inc. Retrieved October 22, from http://portal.acm.org/citation.cfm?id=556043# Freier, A. O., Karlton, P., & Kocher, P. C. (1996). The SSL protocol version 3.0. Internet Draft. Frey, N. (2000). A Guide to Successful SLA Development and Management. Stamford, CT: Gartner Group Research, Strategic Analysis Report. Frolund, S., & Koistinen, J. O. (1998). A language for quality of service specification. HP Labs Technical Report. California, USA. Frolund, S., & Koistinen, J. (1998). Quality-of-service specification in distributed object systems. HewlettPackard Laboratories, Technical Report 98-158. Fu, X., Lu, X., Peltsverger, B., Chen, S., Qian, K., & Tao, L. (2007). A static analysis framework for detecting SQL injection vulnerabilities. In Proc. of COMPSAC, 87-96. Gacek, C., & Romanovsky, A. (Eds.), Architecting Dependable Systems III. Lecture Notes in Computer Science 3549 (pp. 69–91). Berlin, Germany: Springer. Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1995). Design Patterns Elements of Reusable Object-Oriented Software. Reading, MA, USA: Addison-Wesley Publishing Company. Gao, T., Ma, H., Yen, I.-L., Bastani, F., & Tsai, W.-T. (2005). Toward QoS Analysis of Adaptive ServiceOriented Architecture. IEEE International Symposium on Service-Oriented System Engineering (SOSE) (pp. 219-226).
García-Fanjul, J. Tuya, J., & de la Riva, C. (2006). Generating Test Cases Specifications for BPEL Compositions of Web Services Using SPIN. Proceeding of International Workshop on Web Services Modeling and Testing (WSMaTe2006). Garlan, D., & Schmerl, B. (2006). Architecture-driven modeling and analysis, in: Tony Cant (Ed.). Proceedings of the 11th Australian Workshop on Safety Related Programmable System. Geihs, K., Barone, P., Eliassen, F., Floch, J., Fricke, R., & Gjorven, E. (2009). A comprehensive solution for application-level adaptation. Software, Practice & Experience, 39(4), 385–422. doi:10.1002/spe.900 Gereffi, G., & Sturgeon, T. J. (2004). Globalization, employment, and economic development. Sloan. Ghosh, A. K., & Voas, J. M. (1999). Inoculating software for survivability. Communications of the ACM, 42(7), 38–44. doi:10.1145/306549.306563 Gibbens, R., Mason, R., & Steinberg, R. (2000). Internet service classes under competition. IEEE Journal on Selected Areas in Communications, 18(12), 2490–2498. doi:10.1109/49.898732 Gjorven, E., Rouvoy, R., & Eliassen, F. (2008). Crosslayer self-adaptation in service-oriented architectures. MW4SOC ‘08: Proceedings of the 3rd Workshop on Middleware for Service Oriented Computing, 37-42. Gnedenko, B. V., & Ushakov, I. A. (1995). Probabilistic Reliability Engineering (Falk, J. A., Ed.). Wiley-Interscience. doi:10.1002/9780470172421 Goad, R. (2008, Dec) Social Xmas: Facebook’s busiest day ever, YouTube overtakes Hotmail, Social networks = 10% of UK Internet traffic, [Web log comment]. Retrieved from http://weblogs.hitwise.com/robin-goad/2008/12/ facebook_youtube_christmas_social_networking.html Gönczy, L., Chiaradonna S., Di Giandomenico, F., Pataricza, A., Bondavalli A., & Bartha (2006). Dependability evaluation of web service-based processes, LNCS, Formal Methods and Stochastic Models for Performance Evaluation (pp.166-180).
437
Compilation of References
Gong, Y. L., Dong, F. P., Li, W., & Xu, Zh. W. (2003). VEGA Infrastructure for Resource Discovery in Grids. Journal of Computer Science and Technology, 18(4), 413–422. doi:10.1007/BF02948915
Griffith, R. E., & Stewart, A. (1961). A nonlinear programming technique for the optimization of continuous processing systems. Management Science, 7, 379–392. doi:10.1287/mnsc.7.4.379
Google mashup editor, http://editor.googlemashups. comHoyer, V., Stanoevska-Slabeva, K. (2009). Towards a reference model for grassroots enterprise mashup environments, in: the 17th European Conference on Information Systems (ECIS 2009). IBM mashup center, http://www-10. lotus.com/ldd/mashupswiki.nsf
Gross, D., & Harris, G. (1985). Fundamentals of Queuing Theory. New York: John Wiley.
Gorbenko, A., Kharchenko, A., & Romanovsky, A. (2007). On composing Dependable Web Services using undependable web components. Int. J. Simulation and Process Modelling, 3(1/2), 45–54. doi:10.1504/ IJSPM.2007.014714 Gorbenko, A., Kharchenko, V., & Romanovsky, A. (2009). Using Inherent Service Redundancy and Diversity to Ensure Web Services Dependability. In Butler, M. J., Jones, C. B., Romanovsky, A., & Troubitsyna, E. (Eds.), Methods, Models and Tools for Fault Tolerance, LNCS 5454 (pp. 324–341). Springer-Verlag. doi:10.1007/9783-642-00867-2_15 Gorbenko, A., Kharchenko, V., Tarasyuk, O., Chen, Y., & Romanovsky, A. (2008). The Threat of Uncertainty in Service-Oriented Architecture. Proceedings of the RISE/EFTS Joint International Workshop on Software Engineering for Resilient Systems (SERENE’20082008), ACM, 49-50. Gorbenko, A., Mikhaylichenko, A., Kharchenko, V., & Romanovsky, A. (2007). Experimenting With Exception Handling Mechanisms Of Web Services Implemented Using Different Development Kits. Technical report CS-TR 1010, Newcastle University. Retrieved from http://www. cs.ncl.ac.uk/research/pubs/trs/papers/1010.pdf. Goyeva-Popstojanova, K., Mathur, A. P., & Trivedi, K. S. (2001). Comparison of Architecture-Based Software Reliability Models. Paper presented at the 12th International Symposium on Software Reliability Engineering. Gray, J. (1992). Benchmark Handbook: For Database and Transaction Processing Systems. San Francisco, CA: Morgan Kaufmann Publishers Inc. Gray, G. (1985). Why Do Computers Stop and What Can Be Done About It?. Tandem TR 85.7.
438
Guinea, S. (2005). Self-healing web service compositions. Paper presented at the ICSE2005. Guitart, J., Torres, J., & Ayguadé, E. (2010). A Survey on Performance Management for Internet Applications. Concurrency and Computation, 22(1). doi:10.1002/cpe.1470 Guo, H., Huai, J., Li, H., Deng, T., Li, Y., & Du, Z. (2007). Optimal configuration for high available service composition. In IEEE International Conference on Web Services (ICWS ’07) (pp. 280-287). Guo, H., Huai, J., Li, H., Deng, T., Li, Y., & Du, Z. (2007). ANGEL: Optimal Configuration for High Available Service Composition. Paper presented at the 2007 IEEE International Conference on Web Services (ICWS) Guo, H., Huai, J., Li, Y., & Deng, T. (2008). KAF: Kalman Filter Based Adaptive Maintenance for Dependability of Composite Service. Paper presented at the International Conference on Advanced Information Systems Engineering (CAiSE). Hafner, M., & Breu, R. (2008). Security Engineering for Service-Oriented Architectures. Springer. Halang, W. A., Gumzej, R., Colnaric, M., & Druzovec, M. (2000). Measuring the Performance of Real-Time Systems. The International Journal of Time-Critical Computing Systems, 18(1), 59–68. doi:.doi:10.1023/A:1008102611034 Halfond, W., & Orso, A. (2005). AMNESIA: analysis and monitoring for neutralizing sql injection Attacks. In Proc. of 20th IEEE/ACM International Conference on Automated Software Engineering, 174-183. Hamadi, R., & Benatallah, B. (2003). A Petri Net-based Model for Web Service Composition. Paper presented at the Fourteenth Australasian Database Conference (ADC2003). Hanna, S., & Munro, M. (2008). Fault-Based Web Services Testing. In Proc. of Fifth International Conference on Information Technology: New Generations 471-476.
Compilation of References
Harel, D., & Thiagarajan, P. (2004). Message sequence charts. UML for Real, Springer, 77-105. Harney, J., & Doshi, P. (2006). Adaptive web processes using value of changed information. Paper presented at the International Conference on Service Oriented Computing (ICSOC). Harney, J., & Doshi, P. (2007). Speeding up adaptation of web service compositions using expiration times. Paper presented at the International World Wide Web Conference (WWW). Hasselmeyer, P., Qu, C., Schubert, L., Koller, B., & Wieder, P. (2006). Towards Autonomous Brokered SLA Negotiation, from “Exploiting the Knowledge Economy: Issues, Applications, Case Studies”. Amsterdam: IOS Press. Haykin, S. (2002). Adaptive Filter Theory (4th ed.). Pearson Education. Heckel, R., & Mariani, L. (2005, April 2 - 10). Automatic conformance testing of web services. Fundamental Approaches to Software Engineering. LNCS, 3442, 34–48. Heiss, H.-U., & Wagner, R. Adaptive Load Control in Transaction Processing Systems. The 17th International Conference on Very Large Data Bases (VLDB 1991), Barcelona, Spain, Sep. 1991. C. Loosley, F. Douglas, and A. Mimo. High-Performance Client/Server. John Wiley & Sons, November 1997. Infiniti Hewitt, E. (2009). Java SOA Cookbook. O’Reilly. Hiles, A. (1999/2000). The Complete IT Guide to Service Level Agreements-Matching Service Quality ot Business Needs. Oxford, UK: Elsevier Advanced Technology.
Hollunder, B., Hüller, M., & Schäfer, A. (2011). A Methodology for Constructing WS-Policy Assertions. In Proceedings of the International Conference on Engineering and Meta-Engineering. International Institute of Informatics and Systematics. Holmer, D., Nita-Rotaru, C., & Hubens, H. (2008). ODSBR: an on-demand secure byzantine resilient routing protocol for wireless ad hoc networks. [ACM Press.]. ACM Transactions on Information and System Security, 10(4), 1–35. Howard, M., LeBlanc, D., & Viega, J. (2005). 19 deadly sins of software security: Programming flaws and how to fix them. New York, NY, USA: McGraw-Hill, Inc. HP. (2008). HP WebInspect. Retrieved October 22, 2010, from https://h10078.www1.hp.com/cda/hpms/ display/main/hpms_content.jsp?zn=bto&cp=1-11-201200%5E9570_4000_100__ Hu, Y., Johnson, D., & Perring, A. (2003). SEAD: secure efficient distance vector routing for mobile wireless ad hoc networks. Journal Ad Hoc Networks, 1, 175–192. doi:10.1016/S1570-8705(03)00019-2 Hu, Y., Johnson, D., & Perring, A. (2005). Ariadne: a secure on demand routing protocol for ad hoc networks. Wireless Networks, 11(1-2), 21–38. doi:10.1007/s11276004-4744-y Hu, V. C., Martin, E., Hwang, J., & Xie, T. (2007). Conformance checking of access control policies specified in XACML. In Proc. of the 31st Annual International Conference on Computer Software and Applications, 275-280.
Holgersson, J., & Söderstrom, E. (2005). Web service security - Vulnerabilities and threats within the context of WS-Security. p. 138 – 146.
Huang, Y., & Lee, W. (2003). A cooperative intrusion detection system for ad hoc. In Proceedings of the First ACM Workshop on Security of Ad Hoc and Sensor Networks (pp. 135-147). ACM Press.
Hollunder, B. (2009a). WS-Policy: On Conditional and Custom Assertions. In Proceedings of the IEEE International Conference on Web Services. IEEE Computer Society.
Huang, Y., Yu, F., Hang, C., Tsai, C., Lee, D., & Kuo, S. (2004). Securing web application code by static analysis and runtime protection. In Proc. of 13th International Conference on World Wide Web, 40-52.
Hollunder, B. (2009b). Domain-Specific Processing of Policies or: WS-Policy Intersection Revisited. In Proceedings of the IEEE International Conference on Web Services. IEEE Computer Society.
Hudert, S., Wirtz, G., & Eymann, T. (2009). BabelNegA Protocol Description Language for automated SLA Negotiations, In Procedings of the IEEE Conference on Commerce and Enterprise Computing, (pp. 162-169). ShangHai, China.
439
Compilation of References
Iamnitchi, A., & Foster, I. (2001). On fully decentralized resource discovery in grid environments. In Proceedings of the 2nd International Workshop on Grid Computing, (pp. 51-62). Denver, Colorado.
Jiang, Y., Hou, S. S., Shan, J. H., Zhang, L., & Xie, B. (Eds.). (2005). Contract-Based Mutation for Testing Components. IEEE International Conference on Software Maintenance.
IBM. (2008). IBM Rational AppScan. Retrieved September 29, 2008, from http://www-01.ibm.com/software/ awdtools/appscan/
Jin, L. J., & Machiraju, V. A. (2002). Analysis on Service Level Agreement of Web Services. Technical Report HPL-2002-180, Software Technology Laboratories, HP Laboratories.
Imamura, T., Dillaway, B., & Simon, E. (2002). W3C Recommendation XML Encryption Syntax and Processing. Retrieved from http://www.w3.org/TR/xmlenc-core/. Institute for Ageing and Health. (2009). BASIS: Biology of Ageing e-Science Integration and Simulation System. Retrieved June 1, 2010, from http://www.basis.ncl.ac.uk/. Newcastle upon Tyne, UK: Newcastle University. International Data Corporation. (2007). Mission Critical North American Application Platform Study, IDC White Paper. Retrieved from www.idc.com. ISO/IEC (2005). Software engineering - Software product Quality Requirements and Evaluation (SQuaRE) - Guide to SQuaRE. ISO/IEC 25000. Ito, A., & Yano, H. (1995). The emergence of cooperation in a society of autonomous agents: The prisoner’s dilemma game under disclosure of contract histories. In Proceedings of the First International Conference on Multi-Agent Systems (ICMAS95) (pp. 201-208). Iyengar, A. K., Squillante, M. S., & Zhang, L. (1999), Analysis and characterization of large-scale web server access patterns and performance, IEEE World Wide Web Conference (pp. 85-100). Jaeger, M., Muhl, G., & Golze, S. (2005). QoS-aware composition of web services: A look at selection algorithms. Proceedings of IEEE International Conference on Web Services, 1-2. Jensen, M., Gruschka, N., Herkenhoner, R., & Luttenberger, N. (2007). SOA and Web services: New technologies, new standards - New attacks. In: ECOWS ’07: Proceedings of the Fifth European Conference on Web Services, Washington, DC, USA: IEEE Computer Society, p. 35–44. JetBrains. (2009). IntelliJ IDEA. Retrieved October 22, 2010, from http://www.jetbrains.com/idea/free_java_ide. html
440
John, K., Dennis, C., Alexander, H., Antonio, W., Jonathan, C., Premkumar, H., & Michael, D. (2002). The willow architecture: comprehensive survivability for large-scale distributed applications. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE Computer Society. Johnsen, E. (1968). Studies in multiobjective decision models. Lund, Sweden: Studentlitteratur. Joita, L., Rana, O. F., Chacn, P., Chao, I., & Ardaiz, O. (2005). Application deployment using catallactic grid middleware. In Proceedings of the 3rd International Workshop on Middleware for Grid Computing. (pp. 1-6). Grenoble, France. Jordan, D., & Evdemon, J. (2007). Web services business process execution language version 2.0. Technical report. The OASIS Consortium. Jovanovic, N., Kruegel, C., & Kirda, E. (2006). Pixy: A Static Analysis Tool for Detecting Web Application Vulnerabilities (Short Paper). In Security and Privacy, IEEE Symposium on (pp. 258-263). Berkeley/Oakland, California: IEEE Computer Society. Juric, M. B. (2006). Business Process Execution Language for Web Services BPEL and BPEL4WS 2nd Edition. Packt Publishing. Retrieved from http://portal.acm.org/citation. cfm?id=1199048&coll=Portal&dl=GUIDE&CFID=268 87380&CF TOKEN=61353912 Jürjens, J. (2004). Secure Systems Development with UML. Springer. Just, M., Kranakis, E., & Wan, T. (2003). Resisting malicious packet dropping in wireless ad hoc networks. In Proceedings of the International Conference on Ad-Hoc Networks and Wireless (ADHOC-NOW) (pp. 151-163).
Compilation of References
Kaâniche, M., Kanoun, K., & Rabah, M. (2003). Multilevel modeling approach for the availability assessment of e-business applications. Software, Practice & Experience, 33, 1323–1341. doi:10.1002/spe.550 Kaâniche, M., Kanoun, K., & Martinello, M. (2003). User-perceived availability of a web based travel agency, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN-2003), San Francisco, CA, USA (pp. 709-718). Kaâniche, M., Kanoun, K., & Rabah, M. (2001). A framework for modeling the availability of e-business systems. International Conference on Computer Communications and Networks (ICCCN) (pp.40-45). Kaheman, D., & Tversky, A. (1982). The psychology of preferences. Scientific American, (January): 160–173. doi:10.1038/scientificamerican0182-160 Kähmer, M., Gilliot, M., & Müller, G. (2008). Automating Privacy Compliance with ExPDT. In Proceedings of IEEE Conference on E-Commerce Technology. Springer. Kai-Yuan, C. (1995). Base of Software Reliability Engineering. Beijing: Tsinghua University Press. Kalam, A. A. E. I., Baida, R., Balbiani, P., Benferhat, S., Cuppens, F., Deswarte, Y., & Miège, A. Saurel, C., & Trouessin, G. (2003). Organization based access control. In Proc. of the 4th IEEE International Workshop on Policies for Distributed Systems and Networks, 120-131. Kalyanakrishnan, M., Iyer, R. K., & Patel, J. U. (1999). Reliability of Internet hosts: a case study from the end user’s perspective. Computer Networks, 31, 47–57. doi:10.1016/S0169-7552(98)00229-3 Karaenke, P., & Kirn, St. (2010). Towards Model Checking & Simulation of a Multi-tier Negotiation Protocol for Service Chains. In Proceedings of the 9th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2010), Toronto, Canada, May 10-14, 2010. Kargl, F., Klenk, A., Schlott, S., & Weber, M. (2004). Advanced detection of selfish or malicious nodes in ad hoc networks. In Proceedings of the First European Workshop on Security in Ad Hoc and Sensor Networks (ESAS) (pp. 152-165), Springer.
Keeney, R. L., & Raiffa, H. (1976). Decisions with multiple objectives: preferences and value tradeoffs. New York, NY: Wiley. Keidl, M., & Kemper, A. (2004). Towards context-aware adaptable Web services. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers, New York, NY, USA. Keller, A., & Ludwig, H. (2003). The WSLA framework: specifying and monitoring service level agreements for web services. Network and Systems Management Special Limitation on E-Business Management, 11(1), (pp, 5781). USA. Keller, A., Kar, G., Ludwig, H., Dan, A., & Hellerstein, J. L. (2002). Managing dynamic services: A contract based approach to a conceptual architecture. In Proceedings of the 8th IEEE/IFIP Network Operations and Management Symposium, (pp. 513-528). Florence, Italy, April 15-19, 2002. Kent, S., & Atkinson, R. (1998). RFC 2402 IP authentication header. Internet Engineering Task Force (IETF). Retrieved October 22, 2010, from http://www.ietf.org/ rfc/rfc2402.txt. Kent, S., & Seo, K. (2005). Security Architecture for the Internet Protocol. Networking Working Group - Request for Comments 4301, December 2005. Retrieved October 22, 2010, from http://tools.ietf.org/html/rfc4301. Kephart, J. O., & Chess, D. (2010). The Vision of Autonomic Computing. IEEE Computer, 36(1), 41–50. Kephart, J. O., & Das, R. (2007). Achieving self-management via utility functions. IEEE Internet Computing, 11(1), 40–48. doi:10.1109/MIC.2007.2 Keromytis, A. D., Parekh, J., Gross, P. N., Kaiser, G., Misra, V., Nieh, J., et al. (2003). A holistic approach to service survivability. In Proceedings of the ACM Workshop on Survivable and Self-Regenerative Systems (pp. 11-22), ACM Press. Khalil, I., Bagchi, S., & Nita-Rotaru, C. (2005). DICAS: detection, diagnosis and isolation of control attacks in sensor networks. In Proceedings of the International ICST Conference on Security and Privacy in Communication Networks (SECURECOMM) (pp. 89-100), IEEE Computer Society.
441
Compilation of References
Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J. M., & Irwin, J. (1997). AspectOriented Programming. In 11th European Conference on Object-oriented Programming. Jyväskylä, Finland. Kim, D., Machida, F., & Trivedi, K. S. (2009). Availability Modeling and Analysis of a Virtualized System. In Proc. 15th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) (pp. 365-371). IEEE Computer Society. Kim, H., Choi, Y., Lee, D., & Lee, D. (2008). Practical Security Testing using File Fuzzing. In Proc. of 10th International Conference on Advanced Communication Technology, 2, 1304-1307. Kim, J. S., & Garlan, D. (2006). Analyzing Architectural Styles with Alloy. Proceedings of Workshop on the Role of Software Architecture for Testing and Analysis. Kirkwood, T. B. L., Boys, R. J., Gillespie, C. J., Proctor, C. J., Shanley, D. P., & Wilkinson, D. J. (2003). Towards an E-Biology of Ageing: Integrating Theory and Data. Journal of Nature Reviews Molecular Cell Biology, 4, 243–249. doi:10.1038/nrm1051 Knap, T., & Mlýnková, I. (2009). Towards more secure Web services: Pitfalls of various approaches to XML Signature verification process. In: ICWS ’09: Proceedings of the 2009 IEEE International Conference on Web Services, Washington, DC, USA: IEEE Computer Society, p. 543–550. Knudsen, L. (2005). SMASH - A cryptographic hash function. In Fast Software Encryption: 12th International Workshop, FSE 2005, volume 3557 of Lecture Notes in Computer Science, pages 228-242. Springer. Kolmogoroff, A. (1931). Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung [in German] [Springer-Verlag.]. Mathematische Annalen, 104, 415–458. doi:10.1007/BF01457949 Kontio, J. (1996). A case study in applying a systematic method for cots selection. In Proceedings of the 18th International Conference on Software Engineering (ICSE96) (pp. 201-209). Los Alamitos, CA: IEEE Computer Society Press.
442
Kootbally, Z., Madhavan, R., & Schlenoff, C. (2006). Prediction in Dynamic Environments via Identification of Critical Time Points. In Military Communications Conference, 2006. MILCOM 2006 (pp. 1-7). Presented at the Military Communications Conference, 2006. MILCOM 2006. doi:10.1109/MILCOM.2006.302047 Kotz, S., & Nadarajah, S. (2000). Extreme Value Distributions: Theory and Applications. Imperial College Press. doi:10.1142/9781860944024 Krawczyk, M., Bellare, M., & Canetti, R. (1997). HMAC: Keyed-hashing for message authentication. Request for Comments (RFC 2104). Internet Engineering Task Force. Kreidl, O. P., & Frazier, T. M. (2004). Feedback control applied to survivability: a host-based autonomic defense system. IEEE Transactions on Reliability, 53(1), 148–166. doi:10.1109/TR.2004.824833 Kritikos, K., & Plexousakis, D. (2008). QoS-Based Web Service Description and Discovery. Retrieved from http:// ercim-news.ercim.eu/qos-based-web-service-descriptionand-discovery Kuo, W., & Zuo, M. J. (2003). Optimal Reliability Modeling: Principles and Applications. Hoboken, New Jersey: Joh Wiley & Sons. Kuo, D., Parkin, M., & Brooke, J. (2006). A framework & negotiation protocol for service contract. In Proceedings of the 2006 IEEE International Conference on Services Computing (SCC 2006), (pp. 253-256). Chicago, USA. Kyriakos, K., & Plexousakis, D. (2009). Mixed-integer programming for QoS-based Web service matchmaking. IEEE Transactions on Services Computing, 2(2), 122–139. doi:10.1109/TSC.2009.10 Lai, J., Wu, J., Chen, S., Wu, C., & Yang, C. (2008). Designing a taxonomy of Web attacks. In Proceedings of the 2008 International Conference on Convergence and Hybrid Information Technology. ICHIT. IEEE Computer Society, Washington, DC, 278-282. Lallali, M., Zaidi, F., & Cavalli, A. (2008). Transforming BPEL into Intermediate Format Language for Web Services Composition Testing. Proceeding of 4th IEEE International Conference on Next Generation Web Services Practices, October.
Compilation of References
Lamport, L., Shostak, R., & Pease, M. (1982). The Byzantine Generals Problem. ACM Trans. Programming Languages and Systems, 4(3), 382–401. doi:10.1145/357172.357176
Le Traon, Y., Mouelhi, T., & Baudry, B. (2007). Testing security policies: Going beyond functional testing. In Proc. of the 18th IEEE International Symposium on Software Reliability, 93-102.
Laprie, J. C. (1992). Dependability: Basic Concepts and Terminology. Springer-Verlag.
Lecue, F., & Mehandjiv, N. (2009). Towards scalability of quality driven semantic Web service composition. ICWS ‘09: Proceedings of the 2009 IEEE International Conference on Web Services. 469-476
Laprie, J. C., Randell, B., Avizienis, A., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. doi:10.1109/ TDSC.2004.2 Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and terminology. In Proc. 15th IEEE Int. Symp. on Fault-Tolerant Computing, (pp. 2-11). Laprie, J. C. (1995) Dependable computing: Concepts, limits and challenges, 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS-25) – Special Issue, Pasadena, CA, USA (pp. 42-54). Laranjeiro, N., & Vieira, M. (2008). Deploying Fault Tolerant Web Service Compositions. International Journal of Computer Systems Science and Engineering (CSSE): Special Issue on Engineering Fault Tolerant Systems, 23(5). Laranjeiro, N., & Vieira, M. (2009, March). wsTFDP: An Extensible Framework for Timing Failures Prediction in Web Services. Retrieved February 14, 2008, from http:// eden.dei.uc.pt/~cnl/papers/2010-pdsc-wsTFDP.zip Laranjeiro, N., Vieira, M., & Madeira, H. (2007). Assessing Robustness of Web-services Infrastructures. Proceedings of the International Conference on Dependable Systems and Networks (DSN’07), 131-136 Laranjeiro, N., Vieira, M., & Madeira, H. (2009). Protecting Database Centric Web Services against SQL/XPath Injection Attacks. In Database and Expert Systems Applications (pp. 271–278). Lasdon, L. S., Warren, A. D., Jain, A., & Ratner, M. (1978). Design and testing of a grg code for nonlinear optimization. ACM Transactions on Mathematical Software, 4, 34–50. doi:10.1145/355769.355773
Lee, D., Netravali, A., Sabnani, K., Sugla, B., & John, A. (1997). Passive testing and applications to network management. In Proc. of. International Conference on Network Protocols, 113-122. Lee, Y. C., Wang, C., Zomaya, A. Y., & Zhou, B. B. (2010). Profit-driven Service Request Scheduling in Clouds. In Proceedings of the International Symposium on Cluster Computing and the Grid (CCGRID). Melbourne, Australia. Leemis, L. M. (2009). Reliability: Probability Models and Statistical Methods. Lawrence Leemis. Li, N., Hwang, J., & Xie, T. (2008). Multiple-implementation testing for XACML implementations. In Proc. of Workshop on Testing, Analysis and Verification of Web Software, 27-33. Li, P., Chen, Y., & Romanovsky, A. (2006). Measuring the Dependability of Web Services for Use in e-Science Experiments. In D. Penkler, M. Reitenspiess, & F. Tam (Eds.): International Service Availability Symposium (ISAS 2006), LNCS 4328, (pp. 193-205). Springer-Verlag. Li, Z. J., Zhu, J., Zhang, L. J., & Mitsumori, N. (2009). Towards a Practical and Effective Method for Web Services Test Case Generation. Proceeding of ICSE Workshop on Automation of Software Test AST. May, 106-114. Lindstrom, P. (2004). Attacking and defending Web services. A spire research report. Linger, R. C., Mead, N. R., & Lipson, H. F. (1998). Requirements definition for survivable network systems. In Proceedings of the Third International Conference on Requirements Engineering (pp. 1-14), IEEE Computer Society. Linthicum, D. (2008). Design&validate soa in a heterogeneous environment. ZapThink White Paper, WP-0171.
443
Compilation of References
Liu, A. X., Chen, F., Hwang, J., & Xie, T. (2008). Xengine: A fast and scalable XACML policy evaluation engine. In Proc. of International Conference on Measurement and Modeling of Computer Systems, 265-276. Liu, Y., Ngu, A. H., & Zeng, L. Z. (2004). QoS computation and policing in dynamic web service selection. Paper presented at the International World Wide Web Conference (WWW). Livshits, V. B., & Lam, M. S. (2005). Finding security vulnerabilities in Java applications with static analysis. In Proceedings of the 14th conference on USENIX Security Symposium-Volume 14 (p. 18). Long, D. D. E., Carroll, J. L., & Park, C. J. (1990). A Study of the Reliability of Internet Sites. University of California at Santa Cruz. Retrieved from http://portal. acm.org/citation.cfm?id=902727&coll=Portal&dl=GUI DE&CFID=24761187&CFT TOKEN=74702970 Looker, N., Munro, M., & Xu, J. (2004). Simulating Errors in Web Services. International Journal of Simulation Systems, Science & Technology, 5(5) Looker, N., Munro, M., & Jie Xu. (2005). Increasing Web Service Dependability Through Consensus Voting. In 29th Annual International Computer Software and Applications Conference (COMPSAC) (Vol. 2, pp. 66-69). Presented at the 29th Annual International Computer Software and Applications Conference (COMPSAC). doi:10.1109/ COMPSAC.2005.88 Lorenzo, G. D., Hacid, H., Paik, H., & Benatallah, B. (2009). Data integration in mashups. SIGMOD Record, 3(1), 59–66. doi:10.1145/1558334.1558343 Lou, W., Liu, W., & Fang, Y. (2004). SPREAD: enhancing data confidentiality in mobile ad hoc networks. In Proceedings of the IEEE INFOCOM (pp. 2404-2411), IEEE Computer Society. Lowry, M. R. (1988). Component-based reconfigurable systems. Computer, 31, 44–46. Löwy, J. (2008). Programming WCF Services. O’Reilly. Loyall, J. P., Schantz, R. E., Zinky, J. A., & Bakken, D. E. (1998). Specifying and measuring quality of service in distributed object systems. In Proceedings of the 1st International Symposium on ObjectOriented Real-Time Distributed Computing, (pp. 43-54). Kyoto, Japan.
444
Luce, R. D., & Raiffa, H. (1957). Games and decisions. New York, NY: Wiley. Luckham, D. C. (2001). The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Ludwig, H. (2009). WS-Agreement Concepts and Use: Agreement-Based, Service-Oriented Architecture, Service-Oriented Computing (pp. 199–228). The MIT Press. Ludwig, H., Keller, A., Dan, A., & King, R. (2003). A Service Agreement Language for Dynamic Electronic Services. Electronic Commerce Research, 3, 43–59. doi:10.1023/A:1021525310424 Ludwig, A., & Franczyk, B. (2006). SLA Lifecycle Management in Services Grid-Requirements and Current Efforts Analysis. In Proceedings of the 4th International Conference on Grid Services Engineering and Management (GSEM), (pp. 219-246). LeipZig, Germany. Ludwig, H. (2003). Web Services QoS: External SLAs and Internal Policies, Or: How Do We Deliver What We Promise? In Proceedings of the International Conference on Web Information Systems Engineering, IEEE Computer Society. Ludwig, H., Keller, A., Dan, A., King, P., & Franck, R. (2003). Web Service Level Agreement (WSLA) Language Specification. Retrieved May 17, 2010, from http://www. research.ibm.com/wsla/. Lyndsay, J. (2003). A positive view of negative testing. Retrieved from http://www.workroom-productions.com/ papers.html Lyu, M. R. (Ed.). (1996). Handbook of software reliability engineering. Hightstown, NJ, USA: McGraw-Hill, Inc. Ma, C., Du, C., Zhang, T., Hu, F., & Cai, X. (2008). WSDL-based automated test data generation for web service. 2008 International Conference on Computer Science and Software Engineering. Wuhan, China, 731-737. Maamar, Z., Sheng, Q., & Benslimane, D. (2008). Sustaining Web Services High-Availability Using Communities. Proceedings of the 3rd International Conference on Availability, Reliability and Security, 834-841.
Compilation of References
Mahbub, K., & Spanoudakis, G. (2007). Monitoring WSAgreements: An Event Calculus–Based Approach. Test and Analysis of Web Services.
Martin, E., & Xie, T. (2007b). Automated test generation for access control policies via change-impact analysis. In Proc. of SESS 2007, 5-11.
Mallouli, W., Bessayah, F., Cavalli, A., & Benameur, A. (2008). Security Rules Specification and Analysis Based on Passive Testing. In Proc. of IEEE Global Telecommunications Conference, 1-6.
Martin, E., Xie, T., & Yu, T. (2006). Defining and measuring policy coverage in testing access control policies. In Proc. of ICICS, 139-158.
Malrait, L., Bouchenak, S., & Marchand, N. (2010). Experience with ConSer: A System for Server Control Through Fluid Modeling. IEEE Transactions on Computers, 2010.
Martinello, M. (2005) Availability modeling and evaluation of web-based services: a pragmatic approach, PhD Thesis, Institut National Polytechnique de Toulouse, France.
Malrait, L., Bouchenak, S., & Marchand, N. (2009). Fluid Modeling and Control for Server System Performance and Availability. The 39th Annual IEEE/IFIP Conference on Dependable Systems and Networks (DSN 2009).
Mashood, M., & Wikramanayake, G. (2007). Architecting secure Web services through policies. International Conference on Industrial and Information Systems. ICIIS 2007, vol., no., pp.5-10.
Marconi, A., Pistore, M., & Traverso, P. (2006). Specifying data-flow requirements for the automated composition of web services. Proceeding of the Fourth IEEE International Conferfeence on Software Engineering and Formal Methods (SEFM 2006), 11-15 September 2006, Pune, India, 147-156.
Maughan, D., Schertler, M., Schneider, M., & Turner, J. (1998). Internet security association and key management protocol (ISAKMP). Request for Comments (RFC2408). Internet Engineering Task Force.
Marcus, E., & Stern, H. (2003). Blueprints for High Availability. New York: Wiley. Marilly, E., Martinot, O., Papini, H., & Goderis, D. (2002). Service Level Agreements: A Main Challenge For Next Generation Networks. In Proceedings of the 2nd European Conference on Universal Multiservice Networks, (pp. 297-304). Toulouse, France. Marsan, M. A., Balbo, G., Conte, G., Donatelli, S., & Franceschinis, G. (1995). Modelling with Generalized Stochastic Petri Nets. John Wiley and Sons. Marti, S., Giuli, T. J., Lai, K., & Baker, M. (2000). Mitigating routing misbehavior in mobile ad hoc networks. In Proceedings of the ACM Annual International Conference on Mobile Computing and Networking (MobiCom) (pp. 255-265), ACM Press. Martin, D., Burstein, M., Hobbs, J., Lassila, O., McDermott, D., McIlraith, S., et al. (2004) OWL-s: Semantic markup for web services, http://www.daml.org/services/ owl-s/1.1/overview/. Martin, E., & Xie, T. (2007a). A fault model and mutation testing of access control policies. In Proc. of the 16th International Conference on World Wide Web, 667-676.
May, N. (2009). A Redundancy Protocol for ServiceOriented Architectures. In Service-Oriented Computing – ICSOC 2008 Workshops (Vol. 5472, pp. 211-220). Springer-Verlag. doi:10.1007/978-3-642-01247-1_22 McGough, A. S., Akram, A., Colling, D., Guo, L., Kotsokalis, C., & Krznaric, M. (2009). Enabling Scientists Through Workflow and Quality of Service, Grid Enabled Remote Instrumentation (pp. 345–359). Springer. Medhi, D., & Tipper, D. (2000). Multi-layered network survivability – models, analysis, architecture, framework and implementation: an overview. In Proceedings of the DARPA Information Survivability Conference and Exposition (DISCEX) (pp. 173–186). Menascé, D. A., & Almeida, V. A. F. (2000). Scaling for e-business: technologies, models, performance, and capacity planning. Prentice Hall PTR. Menascé, D. A. Almeida., V. A. F. (2001). Capacity Planning for Web Services: Metrics, Models, and Methods. Upper Saddle River, NJ: Prentice Hall. Menascé, D. A., & Dubey, V. (2007). Utility-based QoS brokering in service oriented architectures. In Proceedings of the IEEE 2007 International Conference on Web Services (ICWS 2007) (pp. 422–430). Los Alamitos, CA: IEEE Computer Society Press.
445
Compilation of References
Menascé, D. A., Barbara, D., & Dodge, R. (2001). Preserving QoS of E-Commerce Sites Through Self-Tuning: A Performance Model Approach. The ACM Conference on Electronic Commerce (EC’01), Tampa, FL, Oct. 2001.
Michiardi, P., & Molva, R. (2002). CORE: a collaborative reputation mechanism to enforce node cooperation in mobile ad hoc networks. [Kluwer.]. IFIP, TC6(TC11), 107–121.
Menascé, D. A., Casalicchio, E., & Dubey, V. (2008). A heuristic approach to optimal service selection in service oriented architectures. WOSP ‘08: Proceedings of the 7th International Workshop on Software and Performance, 13--24.
Microsoft. (2002). Security in a Web services world: A proposed architecture and roadmap. Retrieved October 22, 2010, from http://msdn.microsoft.com/en-us/library/ ms977312.aspx.
Menascé, D. A., Casalicchio, E., & Dubey, V. (2010). On optimal service selection in Service Oriented Architectures. Performance Evaluation Journal, North-Holland, Elsevier, 67(8), 659-675. Menascé, D. A., Ewing, J., Gomaa, H., Malek, S., & Sousa, J. (2010). A framework for utility based service oriented design in SASSY. First Joint WOSP-SIPEW International Conference on Performance Engineering, 27-36. Meng, S. (2007). QCCS: A Formal Model to Enforce QoS Requirements in Service Composition. Proceedings of the First Joint IEEE/IFIP Symposium on Theoretical Aspects of Software Engineering (pp. 389-400).
Mihindukulasooriya, N. (2008). Understanding WSSecurity policy language. WSO2. Retrieved October 22, 2010, from http://wso2.org/library/3132. Milan-Franco, J., Jimenez-Peris, R., Patino-Martinez, M., & Kemme, B. (2004). Adaptive Middleware for Data Replication. The 5th ACM/IFIP/USENIX international conference on Middleware (Middleware 2004), New York, NY, USA, 2004. Microsoft. Optimizing Database Performance. Retrieved May 20, 2010, from http://msdn. microsoft.com/enus/library/aa273605(SQL.80).aspx Miller, D. W., & Starr, M. K. (1960). Executive decisions and operations research. Englewood Cliffs, NJ: Prentice-Hall.
Mennie, D., & Pagurek, B. (2000). An Architecture to Support Dynamic Composition of Service Components. Paper presented at the Workshop on Component-Oriented Programming (WCOP)
Milosevic, Z., Sadiq, S. W., & Orlowska, M. E. (2006). Towards a methodology for deriving contract-compliant business processes. In Proceedings of the 4th International Conference on Business Process Management, 395–400
Merad, S., de Lemos, R., & Anderson, T. (1999). Dynamic selection of software components in the face of changing requirements. Department of Computing Science, University of Newcastle upon Tyne, UK, Technical Report No 664.
Miranda, H., & Rodrigues, L. (2003). Friends and Foes: Preventing selfishness in open mobile ad hoc networks. In Proceedings of the IEEE International Conference on Distributed Computing Systems Workshop (ICDCSW) (pp. 440), IEEE Computer Society.
Merrill, D., & Grimshaw, A. (2009). Profiles for conveying the secure communication requirements of Web services. Concurrency and Computation, 21(8), 991–1011. doi:10.1002/cpe.1403
Mirkovic, J., & Reiher, P. (2004). A taxonomy of DDoS attack and DDoS defense mechanisms. SIGCOMM Comput. Commun. Rev., 34(2), 39–53. doi:10.1145/997150.997156
Merwe, J. V., Dawoud, D., & McDonald, S. (2007). A survey on peer-to peer key management for mobile ad hoc networks. ACM Computing Surveys, 39(1), 1–45. doi:10.1145/1216370.1216371 Meyer, J. F. (1980). On evaluating the performability of degradable computer systems. IEEE Transactions on Computers, 29(8), 720–731. doi:10.1109/TC.1980.1675654
446
Misra, K. B. (Ed.). (2008). Handbook of Performability Enginneering. Springer. doi:10.1007/978-1-84800-131-2 Miyazaki, S., & Sugawara, H. (2000). Development of DDBJ-XML and its Application to a Database of cDNA [Tokyo: Universal Academy Press Inc.]. Genome Informatics, 2000, 380–381.
Compilation of References
Mobach, D. G. A., Overeinder, B. J., & Brazier, F. M. T. (2006). A ws-agreement based resource negotiation framework for mobile agents. Scalable Computing: Practice and Experience, 7(1), (pp. 23-26). March 2006.
Neumann, J. V. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Shannon, C., & McCarthy, J. (Eds.), Automata Studies (pp. 43–98). Princeton University Press.
Mogollon, M. (2008). Cryptography and security services: Mechanisms and applications. IGI Global. doi:10.4018/978-1-59904-837-6
Newkirk, J. W., & Vorontsov, A. A. (2004). Test-Driven Development in Microsoft. Net. Microsoft Press. Retrieved October 22, from http://portal.acm.org/citation. cfm?id=983793# Nordbotten, N. A. (2009). XML and Web services security standards. IEEE Communications Surveys Tutorials, v. 11, n. 3, p. 4 –21.
Molloy, M. K. (1981). On the Integration of Delay and Throughput Measures in Distributed Processing Models. Ph.D. Thesis, UCLA, 1981. Moore, E. F. (1958). Gedanken-Experiments on Sequential Machines. The Journal of Symbolic Logic, 23(1), 60. doi:10.2307/2964500
Nurmela, T., & Kutvonen, L. (2007). Service Level Agreement Management in Federated Virtual Organizations. Lecture Notes in Computer Science, 4531, 62–75. doi:10.1007/978-3-540-72883-2_5
Mouelhi, T., Le Traon, Y., & Baudry, B. (2009). Transforming and selecting functional test cases for security policy testing. In Proc. of International Conference on Software Testing Verification and Validation, 171-180.
Nyman, J. (2008). Positive and Negative Testing GlobalTester, TechQA, 15(5). Retrieved from http://www.sqatester.com/methodology/PositiveandNegativeTesting.htm
Mukherjee, A., & Siewiorek, D. P. (1997). Measuring Software Dependability by Robustness Benchmarking. IEEE Transactions on Software Engineering, 23(6). doi:10.1109/32.601075 Muscariello, L., Mellia, M., Meo, M., Marsan, A., & Cignob, R. (2005). Markov models of Internet traffic and a new hierarchical MMPP model. Computer Communications, 28, 1835–1851. doi:10.1016/j.comcom.2005.02.012 Nagappan, R., Skoczylas, R., & Sriganesh, R. P. (2003). Developing Java Web services. New York, NY, USA: John Wiley & Sons, Inc. Nahman, J. M. (2002). Dependability of Engineering Systems: Modeling and Evaluation. Springer-Verlag. Narayanan, S., & McIlraith, S. (2003). Analysis and Simulation of Web Services. Journal of Computer Networks, 42(5), 675–693. doi:10.1016/S1389-1286(03)00228-7 Nash, J. F. (1950). The bargaining game. Econometrica, 18, 155–162. doi:10.2307/1907266 Natkin, S. (1980). Les Reseaux de Petri Stochastiques et leur Application a l’Evaluation des Systkmes Informatiques. CNAM, Paris, France: Thèse de Docteur Ingegneur.
O’Brien, L., Merson, P., & Bass, L. (2007). Quality Attributes for Service-Oriented Architectures. In Proceedings of the International Workshop on Systems Development in SOA Environments, IEEE Computer Society. O’Neill, M. (2003). Web services security. New York, NY, USA: McGraw-Hill, Inc. OASIS (2005a). eXtensible Access Control Markup Language (XACML). OASIS (2006). Web Services Security: SOAP Message Security 1.1. OASIS (2008). Web Services Reliable Messaging. OASIS (2009). WS-Security Policy. OASIS (2010). Web Services Quality Factors. OASIS. (2005b). Assertions and Protocols for the OASIS Security Assertion Markup Language. SAML. OASIS Assertions and Protocols for the OASIS Security Assertion Markup Language (SAML) V2.0. (2005) Retrieved from http://docs.oasis-open.org/security/saml/ v2.0/ OASIS consortium. (2005). Universal Description, Discovery, and Integration (UDDI). http://www.oasis-open. org/committees/tc_home.php?wg_abbrev=uddi-spec. accessed on June 30th, 2010.
447
Compilation of References
OASIS eXtensible Access Control Markup Language (XACML) version 2.0. (2005).Retrieved from http:// docs.oasis-open.org/xacml/2.0/access_control-xacml2.0-core-spec-os.pdf OASIS Standard Specification Web Services Security. (2006). SOAP Message Security 1.1, (WS-Security 2004), Retrieved from http://www.oasis-open.org/committees/ download.php/16790/wss-v1.1-spec-os-SOAPMessageSecurity.pdf
Oppenheimer, D., Ganapathi, A., & Patterson, D. A. (2003). Why do Internet services fail, and what can be done about it? USITS’03: Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems. Ort, E. (2005). Service-oriented architecture and Web services: Concepts, technologies, and tools. Sun Microsystems. Retrieved October 22, 2010, from http://java. sun.com/developer/technicalArticles/WebServices/soa2/
OASIS WSBPEL. (2007) Web Services Business Process Execution Language Version 2.0. Retrieved from http:// docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.pdf.
Overton, C. (2002). On the theory and practice of Internet SLAs, Computer Measurement Group. Journal of Computer Resource Measurement, 106, 32–45.
OASIS. (2006a). Web services security (WSS) TC. Retrieved October 22, 2010, from http://www.oasis-open. org/committees/tc_home.php?wg_abbrev=wss.
OWASP. (2008). OWASP WSFuzzer Project. Retrieved October 22, 2010, from http://www.owasp.org/index.php/ Category:OWASP_WSFuzzer_Project
OASIS. (2007a). WS-SecurityPolicy 1.2. Retrieved October 22, 2010, from http://docs.oasis-open.org/ws-sx/ ws-securitypolicy/v1.2/ws-securitypolicy.html.
OWASP. (2010). OWASP top 10 2010. Retrieved October 22, 2010, from http://www.owasp.org/index.php/ Top_10_2010-Main.
OASIS. (2007c). WS-Trust 1.3. Retrieved October 22, 2010, from http://docs.oasis-open.org/ws-sx/wstrust/200512.
Pacyna, P., Rutkowski, A., Sarma, A., & Takahashi, K. (2009). Trusted identity for all: Toward interoperable trusted identity management systems. Computer, 42(5), 30–32. doi:10.1109/MC.2009.168
OASIS. (2008b). Web services federation (WSFED) TC. Retrieved October 22, 2010, from http://www.oasis-open. org/committees/tc_home.php?wg_abbrev=wsfed. Object Management Group (2010a). Unified Modeling Language, Infrastructure, Version 2.3. Object Management Group. (2010b). MDA Specifications. Retrieved May 15, 2010, from http://www.omg. org/mda/specs.html. Object Management Group (2010c). Object Constraint Language, Version 2.2. O’Connor, P. D. T. (2009). Practical Reliability Engineering. New York, NY: John Wiley and Sons. Oehlert, P. (2005). Violating assumptions with fuzzing. IEEE Security and Privacy, 3(2), 58–62. doi:10.1109/ MSP.2005.55 Offutt, J., & Xu, W. (2004). Generating test cases for web services using data perturbation. SIGSOFT Softw. Eng. Notes, 29(5), 1–10. doi:10.1145/1022494.1022529
448
Padgett, J., Djemame, K., & Dew, P. (2005). Grid-Based SLA Management. Lecture Notes in Computer Science, 3470, 1076–1085. doi:10.1007/11508380_110 Pahl, C., Giesecke, S., & Hasselbring, W. (2009). An Ontology-Based Approach for Modeling Architectural Styles. Lecture Notes in Computer Science (Vol. 4758, pp. 60-75), 2007 Papadimitratos, P., & Haas, Z. J. (2003). Secure data transmission in mobile ad hoc networks. In Proceedings of the ACM Workshop on Wireless Security (WiSe) (pp. 41–50). ACM Press. Papazoglou, M. P., & Georgakopoulos, D. (2003). Serviceoriented Computing. Communications of the ACM, 46(10), 24–28. doi:10.1145/944217.944233 Papazoglou, M. P., Traverso, P., Dustdar, S., & Leymann, F. (2007). Service-Oriented Computing: State of the Art and Research Challenges. Computer, 40(11), 38–45. doi:10.1109/MC.2007.400
Compilation of References
Papazoglou, M. P., & Traverso, P. (2007). Service-Oriented Computing: State of the art and research challenges. IEEE Computer, 40, 38–45. Parasoft (n.d.). Retrieved from http://www.parasoft.com/ jsp/home.jsp. Parekh, S. S. (2002). Gandhi, N. J. L. Hellerstein, D. M. Tilbury, T. S. Jayram, and J. P. Bigus. Using Control Theory to Achieve Service Level Objectives in Performance Management. Real-Time Systems, 23(1-2). Patterson, D. A. (2008). The Data Center Is The Computer. [). NY. USA.]. Communications of the ACM, 105. doi:10.1145/1327452.1327491 Paxson, V. (1997). Measurements and analysis of endto-end Internet dynamics. PhD thesis, University of California. PCI. (2009). Payment Card Industry (PCI) data security standard - Requirements and security assessment procedures - Version 1.2.1. PCI Security Standards Council. Pekilis, B., & Seviora, R. (1997). Detection of response time failures of real-time software. In The Eighth International Symposium On Software Reliability Engineering (pp. 38-47). Presented at the The Eighth International Symposium On Software Reliability Engineering. doi:10.1109/ ISSRE.1997.630846 Periorellis, P., & Dobson, J. (2001). The travel agency case study. DSoS Project IST-1999-11585. Phan, T., Han, J., Schneider, J., Ebringer, T., & Rogers, T. (2008). A Survey of Policy-Based Management Approaches for Service Oriented Systems. In Proceedings of the Australian Conference on Software Engineering. IEEE Computer Society. Phan, T., Han, J., Schneider, J., & Wilson, K. (2008). Quality-Driven Business Policy Specification and Refinement for Service-Oriented Systems. In Proceedings of the International Conference on Service Oriented Computing. Springer. Philipp, W., Jan, S., Oliver, Z., Wolfgang, Z., & Ramin, Y. (2005). Using SLA for Resource Management And Scheduling. Grid Middleware and Services-Challenges and Solutions, 8(1), 335–347. Pierce, W. H. (1965). Failure-tolerant computer design. New York, NY: Academic Press.
Pistore, M., Marconi, A., Bertoli, P., & Traverso, P. (2005). Automated Composition of Web Services by Planning at the Knowledge Level. Proceeding of Intenational Joint Conference on Artificial Intelligence (IJCAI). Pp. 1252-1259. PLASTIC Consortium. (2008). http://plastic.isti.cnr.it/ wiki/tools. Pramstaller, N., Rechberger, C., & Rijmen, V. (2005). Breaking a new hash function design strategy called SMASH. In Selected Areas in Cryptography, 12th International Workshop, SAC 2005, volume 3897 of Lecture Notes in Computer Science, pages 234-244. Springer. Proschan, F., & Barlow, R. E. (1975). Statistical Theory of Reliability and Life Testing: Probability Models. New York: Holt, Rinehart and Winston. PushToTest. (n.d.). TestMaker. Retrieved from http:// www.pushtotest.com/. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. WileyInterscience. Qian, L., Luo, Z., Du, Y., & Guo, L. (2009). Cloud Computing: An Overview. Lecture Notes in Computer Science, 5931, 626–631. doi:10.1007/978-3-642-10665-1_63 Qian, Y., Lu, K., & Tipper, D. (2007). A design for secure and survivable wireless sensor networks. IEEE Wireless Communications, 14(5), 30–37. doi:10.1109/ MWC.2007.4396940 Raimondi, F., Skene, J., & Emmerich, W. (2008). Efficient online monitoring of web-service slas. In SIGSOFT ’08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, (pp. 170–180)., New York, NY, USA. ACM. Ramamoorthy, C. V., & Bastani, F. B. (1982). Software Reliability - Status and Perspectives. IEEE Transactions on Software Engineering, SE-8(4), 354–371. doi:10.1109/ TSE.1982.235728 Ramanujan, R., Kudige, S., & Nguyen, T. (2003). Techniques for intrusion resistant ad hoc routing algorithms TIARA. In Proceedings of the DARPA Information Survivability Conference and Exposition (DISCEX) (pp. 98). IEEE Computer Society.
449
Compilation of References
Ran, S. (2003). A model for web services discovery with QoS ACM SIGecom Exchanges, 4. Rana, O. F., Warnier, M., Quillinan, T. B., Brazier, F., & Cojocarasu, D. (2008). Managing Violations in Service level agreements. In Proceedings of the 5th International Workshop on Grid Economics and Business Models (GenCon), (pp. 349-358). Gran Canaria, Spain. Randell, B., Stroud, R., Verissimo, P., Neves, N., O’Halloran, C., Creese, S., et al. (2010). Malicious- and accidental-fault tolerance for internet applications. Retrieved June 30, 2010, from www.laas.fr/TSF/cabernet/ maftia/ Rao, J., & Su, X. (2005). A survey of automated web service composition methods. In Cardoso, J., & Sheth, A. (Eds.), Semantic Web Services and Web Process Composition. Lecture Notes in Computer Science 3387 (pp. 43–54). Berlin, Germany: Springer. doi:10.1007/978-3540-30581-1_5 Rashid, A. A., Hafid, A., Rana, A., & Walker, D. (2004). An approach for quality of service adaptation in serviceoriented Grids. Concurrency and Computation, 16, 401–412. doi:10.1002/cpe.819 Rausand, M., & Hoyland, A. (2004). System Reliability Theory: Models, Statistical Methods, and Applications. New York, NY: John Wiley & Sons. Ravi, S., Raghunathan, A., Kocher, P., & Hattangady, S. (2004). Security in embedded systems: Design challenges. ACM Transactions on Embedded Computing Systems, 3(3), 461–491. doi:10.1145/1015047.1015049 Reinecke, P., van Moorsel, A., & Wolter, K. (2006). Experimental Analysis of the Correlation of HTTP GET invocations. In A. Horvath and M. Telek (Eds.): European Performance Engineering Workshop (EPEW’2006), LNCS 4054, (pp. 226-237). Springer-Verlag.
Riehle, D., & Zuellighoven, H. (1996). Understanding and using patterns in software development. Theory and Practice of Object Systems, 2(1), 3–13. doi:10.1002/ (SICI)1096-9942(1996)2:1<3::AID-TAPO1>3.0.CO;2-# Rinderle, S., Reichert, M., & Dadam, P. (2004). Correctness criteria for dynamic changes in workflow systems-a survey. Data & Knowledge Engineering, 50, 9–34. doi:10.1016/j.datak.2004.01.002 Robert, E. Litan & Alice M. Rivlin (2001). Projecting the Economic Impact of the Internet. The American Economic Review, Vol. 91, No. 2 and Proceedings of the Hundred Thirteenth Annual Meeting of the American Economic Association (May, 2001), pp. 313-317 Rodrigues, D., Estrella, J., Monaco, F., Branco, K., Antunes, N., & Vieira, M. (2011). Engineering Secure Web Services. Chapter 16 in this book. Ron, S., & Aliko, P. (2001.). Service level agreements. Internet NG. Internet NG project (1999-2001) http://ing. ctit.utwente.nl/WU2/ Rosario, S., Benveniste, A., Haar, S., & Jard, C. (2008). Probabilistic QoS and soft contracts for transactionbased Web services orchestrations. IEEE Transactions on Services Computing, 1(4), 187–200. doi:10.1109/ TSC.2008.17 Rosenberg, F., Platzer, C., & Dustdar, S. (2006), Bootstrapping performance and dependability attributes of web services. IEEE International Conference on Web Services (ICWS ‘06) (pp. 205-212). Rosenberg, I., & Juan, A. (2009). The BEinGRID SLA framework, Report available at http://www. gridipedia. eu/slawhitepaper.html Rosenschein, J. S., & Zlotkin, G. (1994). Rules of encounter: Designing conventions for automated negotiation among computers. Boston, MA: MIT Press.
Reiser, M., & Lavenberg, S. S. (1980). Mean-Value Analysis of Closed Multi-Chain Queuing Networks. Journal of the ACM, 27(2), 313–322. doi:10.1145/322186.322195
Ross, S. M. (2003). Introduction to probability models (9th ed.). Elsevier.
Rick, L. (2002). IT Services Management a Description of Service Level Agreements. RL Consulting.
Russell, S. J., & Norvig, P. (2003). Artificial intelligence: A modern approach (2nd ed.). Upper Saddle River, New Jersey: Prentice Hall. Saaty, T. L. (1990). The analytic hierarchy process. New York, NY: McGraw-Hill.
450
Compilation of References
Sahai, A., Machiraju, V., & Sayal, M., A P. Van Moorsel A.P.A. & Casati F. (2002). Automated SLA Monitoring for Web Services. Lecture Notes in Computer Science, 2506, 28–41. doi:10.1007/3-540-36110-3_6
Schmid, M., & Kroeger, R. (2008). Decentralised QoS-Management in Service Oriented Architecture. Lecture Notes in Computer Science, 5053, 44–57. doi:10.1007/978-3-540-68642-2_4
Sahai, A., Durante, A., & Machiraju, V. (2001). Towards Automated SLA Management for Web Services, HP Tech report HPL-2001-310(R.1).
Schneider, F. B. (1990). Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Computing Surveys, 22, 299–319. doi:10.1145/98163.98167
Sahai, A., Graupner, S., Machiraju, V., & Van Moorsel, A. (2003). Specifying and Monitoring Guarantees in Commercial Grids through SLA. In Proceedings of the Third IEEE International Symposium on Cluster Computing and the Grid, (pp. 292). Tokyo, Japan. Sahner, R. A., Trivedi, K. S., & Puliafito, A. (1996). Performance and Reliability Analysis of Computer Systems - An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers. Sakellariou, R., & Yarmolenko, V. (2005). On the flexibility of WS-Agreement for job submission. In Proceedings of the 3rd International Workshop on Middleware for Grid Computing (MGC05), (pp. 1-6). Grenoble, France. Salas, J., Perez-Sorrosal, F., Patiño-Martínez, M., & Jiménez-Peris, R. (2006). WS-replication: a framework for highly available web services. Proceedings of the 15th international conference on World Wide Web, 357-366. Salatge, N., & Fabre, J.-C. (2007). Fault Tolerance Connectors for Unreliable Web Services. Proceedings of the International Conference on Dependable Systems and Networks (DSN’07). 51-60. Salehie, M., & Tahvildari, L. (2009). Self-adaptive software: Landscape and research challenges. ACM Transactions on Autonomous and Adaptive Systems, 4(2), 1556–4665. Santhanam, G. R., Basu, S., & Honavar, V. (2008). On utilizing qualitative preferences in web service composition: A CP-net based approach. SERVICES ‘08: Proceedings of the 2008 IEEE Congress on Services - Part I, 538-544. Schaffer, S. (1994). Babbage’s Intelligence: Calculating Engines and the Factory System. Critical Inquiry, The University of Chicago Press, 21(1), 203–227.
Scovetta, M. (2008). Yet Another Source Code Analyzer. Retrieved October 8, 2008, from www.yasca.org Serhani, M. A., Dssouli, R., Hafid, A., & Sahraoui, H. (2005). A QoS broker based architecture for efficient Web services selection. In Proceedings of the 2005 IEEE International Conference on Web Services (ICWS 2005) (pp. 113-120). Los Alamitos, CA: IEEE Computer Society Press. Service Level Agreement in the Data Center. (April 2002). Retrieved 03 28, 2010, from Sun Microsystems: http:// www.sun.com/blueprints Sha, L., & Abdelzaher, T., årzén, K., Cervin, A., Baker, T., Burns, A., Buttazzo, G., et al. (2004). Real Time Scheduling Theory: A Historical Perspective. Real-Time Systems, 28(2), 101–155..doi:10.1023/ B:TIME.0000045315.61234.1e Shamir, A. (1979). How to share a secret. ACM Communications, 22(11), 612–613. doi:10.1145/359168.359176 Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379–423, 623–656. Shaw, M. (1995). Comparing architectural design styles. IEEE Software, 12(6), ▪▪▪. doi:10.1109/52.469758 Sheng, Q. Z., Benetallah, B., Maamar, Z., & Ngu, A. H. H. (2009). Configurable composition and adaptive provisioning of Web services. IEEE Transactions on Services Computing, 2(1), 1939–1374. doi:10.1109/TSC.2009.1 Shetti, N. M. (2003). Heisenbugs and Bohrbugs: Why are they different?DCS/LCSR Technical Reports, Department of Computer Science, Rutgers, The State University of New Jersey. Shrage, L. (1991). User’s manual for LINGO. Chicago, IL: LINDO Systems Inc.
451
Compilation of References
Shun-Feng Su, Chan-Ben Lin, & Yen-Tseng Hsu. (2002). A high precision global prediction approach based on local prediction approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 32(4), 416-425. doi:10.1109/TSMCC.2002.806745
Song, Y., Li, Y., Wang, H., Zhang, Y., Feng, B., Zang, H., & Sun, Y. (2008). A Service-Oriented Priority-Based Resource Scheduling Scheme for Virtualized Utility Computing. Lecture Notes in Computer Science, 5374, 220–231. doi:10.1007/978-3-540-89894-8_22
Siddiqui, B. (2002). Exploring XML Encryption, part 1. IBM Corporation. Retrieved October 22, 2010, from http:// www.ibm.com/developerworks/xml/library/x-encrypt/.
Sousa, J. P., Balan, R. K., Poladian, V., Garlan, D., & Satyanarayanan, M. (2008). User guidance of resourceadaptive systems. In International Conference on Software and Data Technologies (pp. 36-44). Setubal, Portugal: INSTICC Press.
Sidharth, N., & Liu, J. (2008). Intrusion resistant SOAP messaging with IAPF. In: APSCC ’08: Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference, Washington, DC, USA: IEEE Computer Society, p. 856–862.
SpringSource. (2010). Aspect Oriented Programming with Spring. Retrieved July 6, 2008, from http://static.springframework.org/spring/docs/2.5.x/reference/aop.html
Singh, M. P. (2004). A framework and ontology for dynamic web services selection. IEEE Internet Computing, 8(5), 84–93. doi:10.1109/MIC.2004.27
Stankovic, J. A. (1988). Misconceptions about real-time computing: a serious problem fornext- generation systems. Computer, 21(10), 10–19. doi:10.1109/2.7053
Sivasubramanian, S. (2006). Pierre, G. van Steen, M., & Bhulai, S. Amsterdam: SLA-Driven Resource Provisioning of Multi-Tier Internet Applications. Technical Report, Department of Mathematics and Computer Science, Vrije Universiteit.
Sterbenz, J. P. G., Krishnan, R., Hain, R., Levin, D., Jackson, A. W., & Zao, J. (2002). Survivable mobile wireless networks: issues, challenges, and research directions. In Proceedings of the ACM Workshop on Wireless Security (WiSe) (pp. 31–40). ACM Press.
Skene, J., Lamanna, D. D., & Emmerich, W. (2004). Precise Service Level Agreements. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04), (pp. 179-188).
Stott, H. G. (1905). Discussion on “Time-Limit Relays” and “Duplication of Electrical Apparatus to Secure Reliability of Services” at New York.
SLA@SOI, Empowering the service industry with SLAaware infrastructures. Retrieved October 24, 2010, from http://sla-at-soi.eu/research/
Stuart, H. R. (1905). Discussion on “Time-Limit Relays” and “Duplication of Electrical Apparatus to Secure Reliability of Services” at Pittsburg.
Smith, D. J. (2009). Reliability, Maintainability and Risk. Elsevier.
Stuttard, D., & Pinto, M. (2007). The web application hacker’s handbook: discovering and exploiting security flaws. Wiley Publishing, Inc.
Sneed, H. M., & Huang, S. (2006). WSDLTest-a tool for testing web services. Proceeding of the IEEE Int. Symposium on Web Site Evolution. Philadelphia, PA, Usa IEEE Computer Society, pp. 12-21.
Sun Microsystems, Inc. (2010). jax-ws: JAX-WS Reference Implementation. Retrieved February 14, 2008, from https://jax-ws.dev.java.net/ - Transaction Processing Performance Council. (2008).
Solanki, M., Cau, A., & Zedan, H. (2006). ASDL: A Wide Spectrum Language For Designing Web Services. Paper presented at the 15th International Conference on the World Wide Web (WWW’06).
Sun, H., Wang, X., Zhou, B., & Zou, P. (2003). Research and Implementation of Dynamic Web Services Composition Paper presented at the Advanced Parallel Processing Technologies (APPT).
Sommerville, I. (2004). Software Engineering. Pearson Education.
Sutton, M., Greene, A., & Amini, P. (2007). Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley.
452
Compilation of References
Symons, F. J. W. (1978). Modelling and analysis of communication protocols using numerical Petri nets. Ph.D. Thesis, University of Essex. Tang, L., Dong, J., Peng, T., & Tsai, W. T. (2010). Modeling Enterprise Service-Oriented Architectural Styles. [SOCA]. Service Oriented Computing and Applications, 4(2), 81–107. doi:10.1007/s11761-010-0059-2 Tang, L., Zhao, Y., & Dong, J. (2009). Specifying Enterprise Web-Oriented Architecture, High Assurance Services Computing (pp. 241–260). Springer. Tang, L., & Dong, J. (2007). A Survey of Formal Methods for Software Architecture. Proceedings of the International Conference on Software Engineering Theory and Practice (pp. 221-227). Tang, L., Dong, J., & Peng, T. (2008). A Generic Model of Enterprise Service-Oriented Architecture, 4th IEEE International Symposium on Service-Oriented System Engineering (SOSE) (pp 1-7). Tang, L., Dong, J., Peng, T., & Tsai, W. T. (2010). A Classification of Enterprise Service-Oriented Architecture. 5th IEEE International Symposium on Service-Oriented System Engineering (SOSE) (pp. 74-81). Tang, L., Dong, J., Zhao, Y., & Zhang, L.-J. (2010). Enterprise Cloud Service Architecture. The 3rd IEEE International Conference on Cloud Computing. (pp. 27-34). Tartanoglu, F., Issarny, V., Romanovsky, A., & Levy, N. (2003). Coordinated forward error recovery for composite Web services. Paper presented at the IEEE Symposium on Reliable Distributed Systems (SRDS). Taylor, H., Yochem, A., Phillips, L., & Martinez, F. (2009). Event-Driven Architecture. Addison-Wesley. Tesauro, G., Das, R., Walsh, W. E., & Kephart, J. O. (2005). Utility-function-driven resource allocation in autonomic systems. ICAC ‘05: Proceedings of the Second International Conference on Automatic Computing, 342-343. Thayer, R., Doraswamy, N., & Glenn, R. (1998). RFC 2411 IP security document roadmap. National Institute of Standards and Technology (NIST). Retrieved October 22, 2010, from http://csrc.nist.gov/archive/ipsec/papers/ rfc2411-roadmap.txt
Tipper, D., & Sundareshan, M. (1990). Numerical Methods for Modeling Computer Networks Under Nonstationary Conditions. IEEE Journal on Selected Areas in Communications, 8(9). TPC-W. TPC-W: a transactional web e-Commerce benchmark. Retrieved May 20, 2010, from http://www.tpc.org/tpcw/ Tosic, V., Pagurek, B., Patel, K., Esfandiari, B., & Ma, W. (2005). Management applications of Web Service Offerings Language (WSOL). Information Systems, 30(7), 564–586. doi:10.1016/j.is.2004.11.005 Tosic, V., Patel, K., & Pagurek, B. (2002). WSOL – Web Service Offerings Language. Lecture Notes in Computer Science, 2612, 57–67. doi:10.1007/3-540-36189-8_5 TPC BenchmarkTM App (Application Server) Standard Specification, Version 1.3. Retrieved July 5, 2008, from http://www.tpc.org/tpc_app/ Transaction Processing Performance Council. (2008, February 28). TPC BenchmarkTM App (Application Server) Standard Specification, Version 1.3. Retrieved December 7, 2008, from http://www.tpc.org/tpc_app/ Tretmans, J. (1996). Test generation with inputs, outputs, and quiescence. In Margaria, T. & Steffen, B. (Eds.), Tools and Algorithms for Construction and Analysis of Systems, Second International Workshop, TACAS ’96, Passau, Germany, March 27-29, 1996, Proceedings, volume 1055 of Lecture Notes in Computer Science, (pp. 127–146). Springer. Trivedi, K. S. (2002). Probability and Statistics with Reliability, Queueing and Computer Science Applications. New York, NY: John Wiley & Sons. Trivedi, K. S., & Grottke, M. (2007). Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. IEEE Transactions on Computers, 4(20), 107–109. Trivedi, K. S., & Sahner, R. (2009). SHARPE at the age of twenty two. ACM SIGMETRICS Performance Evaluation Review, 36(4), 52–57. doi:10.1145/1530873.1530884 Trivedi, K., Muppala, J., Woolet, S., & Haverkort, B. (1992). Composite performance and dependability analysis. Performance Evaluation, 14, 197–215. doi:10.1016/0166-5316(92)90004-Z
453
Compilation of References
Trivedi, K. S., Wang, D., Hunt, D. J., Rindos, A., Smith, W. E., & Vashaw, B. (2008). Availability Modeling of SIP Protocol on IBM© WebSphere. In Proc. 14th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) (pp. 323-330). IEEE Computer Society. Tsai, W. T., Bai, X., Paul, R., Feng, K., & Yu, L. (2002). Scenario-Based Modeling and Its Applications. IEEE WORDS. Tsai, W. T., Paul, R., Song, W., & Cao, Z. (2002). Coyote: an XML-based Framework for Web Services Testing. 7th IEEE International Symp. High Assurance Systems Eng. (HASE 2002). Tsai, W. T., Shao, Q., Sun, X., & Elston, J. (2010). RealTime Service-Oriented Cloud Computing. The 6th World Congress on Services (pp.473-478). Tsai, W. T., Sun, X., & Balasooriya, J. (2010). ServiceOriented Cloud Computing Architecture. The Seventh International Conference on Information Technology (pp.684-689). Tsai, W. T., Zhang, D., Chen, Y., Huang, H., Paul, R., & Liao, N. (2004). A Software Reliability Model for Web Services. Paper presented at the 8th IASTED International Conference on Software Engineering and Applications. Tsai, W., Huang, Q., Xiao, B., & Chen, Y. (2006). Verification framework for dynamic collaborative services in service-oriented architecture. Quality Software, International Conference on, 0, 313–320. UN/CEFACT, (2003), ebXML Business Process Specification Schema, Version 1.09. University of Maryland. (2009). FindBugs™ - Find Bugs in Java Programs. Retrieved March 12, 2009, from http:// findbugs.sourceforge.net/ Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., & Tantawi, A. (2007). Analytic Modeling of Multi-Tier Internet Applications. ACM Transactions on theWeb (ACM TWEB), 1(1). Ushakov, I. (2007). Is Reliiabiility Theory Still Alive?. e-journal “Reliability: Theory& Applications”, 1(2).
454
Uto, N., & Melo, S. P. (2009). Vulnerabilidades em aplicações Web e mecanismos de proteção. In Minicursos SBSeg 2009 (pp. 237–283). IX Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais. Utting, M., & Legeard, B. (2006). Practical Model-Based Testing: A Tools Approach. Morgan-Kaufmann. van der Aalst, W., ter Hofstede, A., Kiepuszewski, B., & Barros, A. (2003). Workflow Patterns. Distributed and Parallel Databases, 14(1), 5–51. doi:10.1023/A:1022883727209 Van Rijsbergen, C. J. (1979). Information retrieval. London: Buttersworth. Venugopal, S., Chu, X., & Buyya, R. A Negotiation Mechanism for Advance Resource Reservation using the Alternate Offers Protocol. In Proceedings of the 16th International Workshop on Quality of Service (IWQoS2008, IEEE Communications Society Press, New York, USA), Twente, NL. Verma, K., Sivashanmugam, K., Sheth, A., Patil, A., Oundhakar, S., & Miller, J. (2005). METEOR-S WSDI: A Scalable P2P Infrastructure of Registries for Semantic Publication and Discovery of Web Services. Information Technology Management, 6(1), 17–39. doi:10.1007/ s10799-004-7773-4 Vieira, M., Antunes, N., & Madeira, H. (2009). Using web security scanners to detect vulnerabilities in web services. In IEEE/IFIP International Conference on Dependable Systems & Networks, 2009. DSN ‘09. (pp. 566-571). Presented at the IEEE/IFIP International Conference on Dependable Systems & Networks, 2009. DSN ‘09. Vieira, M., Costa, A. C., & Madeira, H. (2006). Towards Timely ACID Transactions in DBMS. In 12th Pacific Rim International Symposium on Dependable Computing (PRDC 2006) (pp. 381- 382). Presented at the 12th Pacific Rim International Symposium on Dependable Computing (PRDC 2006). doi:10.1109/PRDC.2006.63 Vieira, M., Laranjeiro, N., & Madeira, H. (2007). Assessing robustness of web-services infrastructures. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN’07 (pp. 131–136).
Compilation of References
Villela, D., Pradhan, P., & Rubenstein, D. (2007). Provisioning Servers in the Application Tier for E-Commerce Systems. ACM Transactions Interet Technolies, 7(1). Vitvar, T., Kopecky, J., Viskova, J., & Fensel, D. (2008), WSMO-lite annotations for web services, in: 5th European Semantic Web Conference. von Neumann, J., & Morgenstern, O. (1947). Theory of games and economic behavior. Princeton, NJ: Princeton University Press. W3C (2002). The Platform for Privacy Preferences 1.0 (P3P1.0) Specification. W3C (2002). XML Encryption syntax and processing. Retrieved October 22, 2010, from http://www.w3.org/ tr/xmlenc-core/. W3C (2005). Web Services Choreography Description Language (ver. 1.0 ed.). W3C. W3C (2007a). Web Services Policy 1.5 - Attachment. W3C (2007b). Web Services Policy 1.5 - Framework. W3C (2008b). XML Signature syntax and processing (second edition). Retrieved October 22, 2010, from http:// www.w3.org/tr/xmldsig-core/. W3C, (2005), Web services choreography description language version 1.0 W3C, (2007) Semantic annotations for WSDL and XML schema, http://www.w3.orgorg/TR/sawsdl/. Wada, H., Suzuki, J., & Oba, K. (2008). A ModelDriven Development Framework for Non-Functional Aspects in Service Oriented Architecture. International Journal of Web Services Research, 5(4). doi:10.4018/ jwsr.2008100101 Wada, H., Champrasert, P., Suzuki, J., & Oba, K. (2008). Multiobjective Optimization of SLA-Aware Service Composition. The IEEE Congress on Services - Part I (pp. 368-375). Walsh, W. E., Tesauro, G., Kephart, J. O., & Das, R. (2004). Utility functions in autonomic systems. In International Conference on Autonomic Computing (pp. 70-77).
Wang, C., Wang, G., Chen, A., & Wang, H. (2005). A Policy-Based Approach for QoS Specification and Enforcement in Distributed Service-Oriented Architecture. In Proceedings of IEEE International Conference on Services Computing. IEEE Computer Society. Wang, F., & Uppalli, R. (2003). SITAR: a scalable intrusion-tolerant architecture for distributed services. In Proceedings of the DARPA Information Survivability Conference and Exposition (DISCEX) (pp. 153–155). Wang, G., Wang, C., Chen, A., Wang, H., Fung, C., Uczekaj, S., et al. (2005). Service Level Management using QoS Monitoring, Diagnostics, and Adaptation for Network Enterprise Systems. Proceedings of the Ninth IEEE International EDOC Enterprise Computing Conference (pp. 239-250). Wang, H., Wang, G., Wang, C., Chen, A., & Santiago, R. (2007). Service Level Management in Global Enterprise Services: from QoS Monitoring and Diagnostics to Adaptation, a Case Study. Proceedings of the Eleventh International IEEE EDOC Conference Workshop (pp. 44-51). Wang, W.-P., Tipper, D., & Banerjee, S. (1996). A Simple Approximation for Modeling Nonstationary Queues. In Proceedings of The 15th Annual Joint Conference of the IEEE Computer and Communications Societies, Networking the Next Generation (IEEE INFOCOM’ 96), San Francisco, CA, USA, Mar. 1996. Wang, Y., Bai, X., Li, J., & Huang, R. (2007). Ontologybased test case generation for testing web services. Proceedings of the Eighth International Symposium on Autonomous Decentralized Systems. Sedona, AZ, USA, Mar. 2007, IEEE Computer Society. 43–50. Weiss, M., Esfandiari, B., & Luo, Y. (2007). Towards a classification of web service feature interactions. Computer Networks, 51(2), 359–381. doi:10.1016/j. comnet.2006.08.003 Wetzstein, B., Karastoyanova, D., Kopp, O., Leymann, F., & Zwink, D. (2010). Cross-organizational process monitoring based on service choreographies. In SAC ’10: Proceedings of the 2010 ACM Symposium on Applied Computing, (pp. 2485–2490)., New York, NY, USA. ACM.
455
Compilation of References
Wieder, P., Seidel, J., Yahyapour, R., Waldrich, O., & Ziegler, W. (2008). Using SLA for Resource Management and Schedurling-A Survey. GRID Middleware and Services, 4, 335–347. doi:10.1007/978-0-387-78446-5_22
Xie, L., Xu, L., & de Vrieze, P. T. 2010a. Lightweight Business Process Modelling. In: International Conference on E-Business and E-Government (ICEE2010), 7-9 May 2010, Guangzhou, China.
Windley, P. (2006). SOA governance: Rules of the game. on line at http://www.infoworld.com.
Xie, L., Xu, L., & de Vrieze, P. T. 2010b. Process Modelling in Process-oriented Enterprise Mashups. In: The 2nd IEEE International Conference on Information Management and Engineering (IEEE ICIME 2010), 16-18 April 2010, Chengdu, Sichuan, China.
Windows Azure Service Level Agreement. Retrieved 03 28, 2010, from http://www.microsoft.com/windowsazure/sla/ Wong, C., & Grzelak, D. (2006). A Web Services Security Testing Framework. SIFT: Information Security Services. Retrieved from http://www.sift.com.au/assets/downloads/ SIFT-Web-Services-Security-Testing-Framework-v1-00. pdf. Working Group 26 ISO/IEC JTC1/SC7 Software and Systems Engineering committee (2008). ISO/IEC 29119 Software Testing–Part 2. Workshop Series in Industry Studies Rockport Massachusetts, June 14-16, 2004. WorldTravel. (2008). Retrieved from http://www. cc.gatech.edu/systems/projects/WorldTravel WSDL. (2007).Web Services Description Language (WSDL) Version 2.0. Retrieved from http://www.w3.org/ TR/wsdl20/. Wu, B., Wu, J., Fernandez, E. B., Ilyas, M., & Magliveras, S. (2007). Secure and efficient key management in mobile ad hoc networks. Journal of Network and Computer Applications, 30(3), 937–954. doi:10.1016/j.jnca.2005.07.008 Wu, L., & Buyya, R. (2011). Service Level Agreement (SLA) in Utility Computing Systems. Chapter 1 in this book. Wurman, P. R., Wellman, M. P., & Walsh, W. E. (1998). The Michigan Internet Auctionbot: A configurable auction server for human and software agents. In Proceedings of the 2nd International Conference on Autonomous Agents, (pp.301-308). Irsee, Germany. Wylie, J. J., Bigrigg, M. W., Strunk, J. D., Gnager, G. R., Kiliccote, H., & Khosla, P. K. (2000). Survivable information storage systems. IEEE Computer, 33(8), 61–68.
456
Xu, L., & Jeusfeld, M. (2003). Pro-active Monitoring of Electronic Contracts. In: The 15th Conference On Advanced Information Systems Engineering (CAiSE 2003), 16-20 June 2003, Klagenfurt, Austria. Xu, L., de Vrieze, P. T., Phalp, K. T., Jeary, S., & Liang, P. 2010. Lightweight Process Modelling for Virtual Enterprise Process Collaboration. In: PRO-VE’2010: 11th IFIP Working Conference on Virtual Enterprises, 11-13 October 2010, Saint-Etienne, France. Xue, Y., & Nahrstedt, K. (2004). Providing faulttolerant ad hoc routing service in adversarial environments. Wireless Personal Communications: An International Journal, 29(3-4), 367–388. doi:10.1023/ B:WIRE.0000047071.75971.cd Yahoo pipes, http://pipes.yahoo.comhttp://www.oreillynet.com/pub/a/oreilly/ tim/news/2005/09/30/what-isweb-20.html Yan, J., Li, Z., Yuan, Y., Sun, W., & Zhang, J. (2006). BPEL4WS Unit Testing: Test Case Generation Using a Concurrent Path Analysis Approach. Ptoceeding of 17th International Symposium on Software Reliability Engineering (ISSRE 2006), 7-10 November, Raleigh, North Carolina, USA. 75-84. Yan, S. S., & An, H. (2009). Adaptive resource allocation for service-based systems. Proceedings of the First Asia-Pacific Symposium on Internetware. Yang, Y., Tan, Q. P., Yong, X., Liu, F., & Yu, J. (2006). Transform BPEL Workflow into Hierarchical CP-Nets to Make Tool Support for Verification. Proceeding of Frontiers of WWW Research and Development - APWeb 2006, 8th Asia-Pacific Web Conference. (pp. 275-284). Harbin, China, January 16-18, 2006.
Compilation of References
Yau, S. S., Ye, N., Sarjoughian, S., Huang, D., Roontiva, A., Baydogan, M. G., & Muqsith, M. A. (2009). Toward development of adaptive service-based software systems. IEEE Transactions on Services Computing, 2(3), 1939–1374. doi:10.1109/TSC.2009.17 Ye, X., Zhou, J., & Song, X. (2003). On reachability graphs of Petri nets. Computers & Electrical Engineering, 29(2), 263–272. doi:10.1016/S0045-7906(01)00034-9 Ye, X., & Mounla, R. (2008). A hybrid approach to QoS-aware service composition. ICWS ‘08: Proceedings of the 2008 IEEE International Conference on Web Services, 62-69. Ye, X., & Shen, Y. (2005). A Middleware for Replicated Web Services. Paper presented at the Proceedings of the IEEE International Conference on Web Services (ICWS ‘ 05). Yeo, C. S., & Buyya, R. (2007, Nov.). Pricing for Utility-driven Resource Management and Allocation in Clusters. International Journal of High Performance Computing Applications, 21(4), 405–418. doi:10.1177/1094342007083776 Yeo, C. S., DeAssuncao, M. D., Yu, J., Sulistio, A., Venugopal, S., Placek, M., & Buyya, R. (2006). Utility computing on Global Grids. In Bidgoli, H. (Ed.), Handbook of Computer Networks. New York, USA: John Wiley & Sons. Yeo, C. S., & Buyya, R. (2005). Service Level Agreement based Allocation of Cluster Resources: Handling Penalty to Enhance Utility. In Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster 2005), (pp. 1-10). MA, USA. Yeo, C. S., & Buyya, R. (2006). A Taxonomy of Marketbased Resource Management Systems for Utility-driven Cluster Computing. Software: Practice and Experience (SPE), 36 (13), (pp.1381-1419). Jan. 2006. Yeo, C. S., & Buyya, R. (2007). Integrated Risk Analysis for a Commercial Computing Service. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), (pp. 1-10). CA, USA. Yeom, G., Tsai, W.-T., Bai, X., & Min, D. (2009). Design of a Contract-Based Web Services QoS Management System. Proceedings of the 29th IEEE International Conference on Distributed Computing Systems Workshops (pp. 306-311).
Youseff, L., Butrico, M., & Da Silva, D. (2008). Toward a unified ontology of cloud computing. Grid Computing Environments Workshop, (pp. 1-10). Austin, Texas. Yu, Q., & Bouguettaya, A. (2010). Guest Editorial: Special Section on Query Models and Efficient Selection of Web Services. IEEE Transactions on Services Computing, 3(3), 161–162. doi:10.1109/TSC.2010.43 Yu, T., Zhang, Y., & Lin, K.-J. (2007). Efficient algorithms for Web services selection with end-to-end QoS constraints. ACM Transactions on the Web, 1(1), 1–26. doi:10.1145/1232722.1232728 Yuan, Y., Li, Z., & Sun. (2006). A Graph-Search Based Approach to BPEL4WS Test Generation. In Proceedings of the International Conference on Software Engineering Advances (ICSEA 2006), October 28 – November 2, Papeete, Tahiti, French Polynesia. Yue-Sheng, G., Bao-Jian, Z., & Wu, X. (2009). Research and realization of Web services security based on XML Signature. International Conference on Networking and Digital Society, v. 2, p. 116–118. Zaiane, O. R., Xin, M., & Han, J. (1998) Discovering web access patterns and trends by applying OLAP and data mining technology on web Logs, Advances in Digital Libraries conference (pp. 19-29). Zeng, L., Benatallah, B., Ngu, A. H. H., Dumas, M., Kalagnanam, J., & Chang, H. (2004). QoS-Aware Middleware for Web Services Composition. IEEE Transactions on Software Engineering, 30(5), 311–327. doi:10.1109/ TSE.2004.11 Zeng, L., Benatallah, B., Ngu, A. H. H., Dumas, M., Kalagnanam, J., & Chang, H. (2004). QoS-aware middleware for Web services composition. IEEE Transactions on Software Engineering, 30(5), 311–327. doi:10.1109/ TSE.2004.11 Zeng, J., Sun, H., Liu, X., Deng, T., & Huai, J. (2010). Dynamic Evolution Mechanism for Trustworthy Software Based on Service Composition. [in Chinese]. Journal of Software, 21(2), 261–276. doi:10.3724/ SP.J.1001.2010.03735 Zeng, J., Huai, J., Sun, H., Deng, T., & Li, X. (2009). LiveMig: An Approach to Live Instance Migration in Composite Service Evolution. Paper presented at the 2009 IEEE International Conference on Web Services (ICWS).
457
Compilation of References
Zeng, L., Lei, H., & Chang, H. (2007). Monitoring the QoS for Web Services. In Proceedings of the International Conference on Service-Oriented Computing (ICSOC). Springer.
Zheng, Z., Zhang, Y., & Lyu, M. (2010). Distributed QoS Evaluation for Real-World Web Services. Proceedings of the IEEE International Conference on Web Services (ICWS’10), 83-90.
Zhang, Q., Cherkasova, L., & Mi, N. (2008). A RegressionBased Analytic Model for Capacity Planning of MultiTier Applications. Journal of Cluster Computing, 11(3). doi:10.1007/s10586-008-0052-0
Zhong, S., Chen, J., & Yang, Y. R. (2003). Sprite: a simple, cheat-proof, credit-based system for mobile adhoc networks. In Proceedings of the IEEE INFOCOM (pp. 1987-1997). IEEE Computer Society.
Zhang, L.-J., Zhang, J., & Cai, H. (2007). Services Computing. Beijing: Tsinghua University Press.
Zhou, L., & Haas, Z. J. (1999). Securing ad hoc networks. IEEE Network, 13(6), 24–30. doi:10.1109/65.806983
Zhang, L.-J., & Zhou, Q. (2009). CCOA: Cloud Computing Open Architecture. IEEE International Conference on Web Services (pp. 607-616).
Zhou, C., Chia, L.-T., & Lee, B.-S. (2004). DAML-QoS Ontology for Web Services, Proceedings of the IEEE International Conference on Web Services (ICWS’04), (pp.472).
Zhang, Z., Dey, D., & Tan, Y. (2006). Price and QoS competition in communication services. European Journal of Operational Research (Vol.186 i2, pp. 681-693). Zheng, Y., Zhou, J., & Krause, P. (2007). An Automatic Test Case Generation Framework for Web Services”, Journal of Software. Vol 2, No. 3, September Zheng, Z., & Lyu, M. (2009). A QoS-Aware Fault Tolerant Middleware for Dependable Service Composition. Proceedings of the International Conference on Dependable Systems and Networks (DSN’09), 239-248. Zheng, Z., & Lyu, M. (2010). Collaborative Reliability Prediction for Service-Oriented Systems. Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering (ICSE’10), 35-44.
458
Zhou, J., & Niemela, E. (2006). Toward Semantic QoS Aware Web Services: Issues, Related Studies and Experience. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 553-557). Zhou, L., Pung, H. K., & Ngoh, L. H. (2006). Towards Semantic for QoS Specification. Proceedings of the 31th IEEE International Conference on Local Computer Networks (LCN). Zoutendijk, G. (1960). Methods of feasible directions. Amsterdam, The Netherlands: Elsevier.
459
About the Contributors
Valeria Cardellini is Assistant Professor in the Department of Computer Science, Systems and Production of the University of Roma “Tor Vergata”, Italy. She received her PhD degree in computer science in 2001 and her Laurea degree in computer engineering in 1997, both from the University of Roma “Tor Vergata”. She was a visiting researcher at IBM T.J. Watson Research Center in 1999. Her research interests are in the field of distributed computing systems, with special emphasis on large-scale systems and services based on Internet and the Web. On these subjects she has (co)authored more than 50 papers in international journals, book chapters, and conference proceedings. She has been co-chair of AAA-IDEA 2009, has served as a member of program and organizing committees of international conferences on Web and performance analysis areas, and serves as frequent reviewer for various wellknown international journals. She is a member of ACM and IEEE. Emiliano Casalicchio, Ph.D., is a researcher at the University of Roma “Tor Vergata”, Italy. Since 1998 his research is mainly focused on large scale distributed systems, with a specific emphasis on performance oriented design of algorithms and mechanisms for resource allocation and management. Domains of interest have been locally distributed web servers, enterprise, grid, and mobile systems, Service Oriented Architectures and Cloud systems. He is author of about 70 publications on international journals and conferences, and serves as reviewer for international journals and conferences (IEEE, ACM, Elsevier, Springer). His research is and has been founded by the Italian Ministry of Research, CNR, ENEA, the European Community and private companies. Moreover, he is and has been involved in many Italian and EU research projects, among which PERF, CRESCO, MIA, MOTIA, D-ASAP, MOBI-DEV, AEOLUS, DELIS, SMS. Kalinka Regina Lucas Jaquie Castelo Branco is an Assistant Professor at the Institute of Mathematics and Computer Science, University of São Paulo, working in the department of Computer Systems. She has experience in computer science, with emphasis on distributed computing systems and parallel computers, working mainly in the following areas: distributed systems, computer networks, security, performance evaluation, and processes scheduling. She is member of Brazilian Computer Society. Julio Cezar Estrella received the Ph.D. degree in Computer Science in 2010, the MSc degree in Computer Science in 2006, both at Institute of Mathematics and Computer Science, University of São Paulo, and the BSc in Computer Science at State University of Sao Paulo - Julio de Mesquita Filho in 2002. He has experience in computer science with emphasis in computer systems architecture, acting on the following themes: service oriented architectures, Web services, performance evaluation, distrib-
About the Contributors
uted systems, computer networks and computer security. He is an Assistant Professor working in the Department of Computer Systems at Institute of Mathematics and Computer Science of University of Sao Paulo. He is member of the Brazilian Computer Society, IEEE and ACM. Francisco José Monaco, PhD on Electrical Engineer, is professor at the University of São Paulo, Brazil and a researcher at Brazilian National Institute of Science and Technology INCT-SEC. At the Department of Computer Systems of ICMC-USP, his main scientific interests are on distributed realtime systems and self-adaptive techniques. He has authored several papers on related subjects and his current research activities include adaptive resource management on service computing, and modeling and simulation of performance and dependability attributes of distributed services. *** Ahmed Al-Moayed was graduated in 2009 from Furtwangen University of Applied Sciences with the degree “Dipl.-Inform (FH)”. From 2007 to 2009, he worked with IBM as a trainee software developer for DB2 Performance Expert and IBM WebSphere Portal. Since 2009, he is working as research assistance at the computer science department at Furtwangen University of Applied Sciences, Germany. His main research focus comprises the development of tools for creating QoS-aware Web services, event-based architectures and software design. Tom Anderson is the Dean of Business Development for the Science, Agriculture and Engineering Faculty at Newcastle University. He is also Professor of Computing Science and Director of the Centre for Software Reliability (CSR) at the University. In 1971 he joined the academic staff at Newcastle, 1986 appointed to a Chair, 1992-97 Head of Computing Science, 1998-2002 Dean of Science. During this time he spent a year with NASA and a summer at UCLA, as well as engaging in a range of external consultancy and expert witness work. His research interests are in the area of system dependability, with particular emphasis on requirements, critical systems and fault tolerance. He has over 145 scientific publications, including the co-authorship or editorship of 27 books. CSR is one of the oldest research units (over 25 years) at Newcastle; research is funded by EPSRC, industry and the CEC. CSR also coordinates a highly regarded series of industry oriented technology transfer events. Jean Arnaud works at Thales. Prior to that, he received his PhD in computer science from Grenoble University and conducted his research at INRIA. His research focuses on modeling, controlling and optimizing distributed services. He received his MS in computer science from Grenoble University in 2007. Nuno Antunes is a Ph.D. student at the University of Coimbra. He received is B.Sc. and M.Sc. Degrees in Informatics Engineering at the Department of Informatics Engineering of the University of Coimbra in 2007 and 2009, respectively. He started his Ph.D. studies in Information Science and Technology in 2009, also at the University of Coimbra. Since 2008 he has been with the Software and Systems Engineering Group (SSE) of the Centre for Informatics and Systems of the University of Coimbra (CISUC), researching topics related to methodologies and tools for the development of non-vulnerable web services. He has authored or co-authored 5 papers in refereed conferences, including the most prestigious conferences in Dependability and Services areas.
460
About the Contributors
Cesare Bartolini achieved his master degree in IT Engineering at the University of Pisa in 2003. His main research focus after his degree was on real-time systems. From 2004 to 2007 he was granted a PhD at Scuola Superiore Sant’Anna in Pisa, researching on platform-based design for real-time systems. During this time, he made an internship at United Technologies Research Center in East Hartford, CT, USA, working on real-time projects for real-time modeling. After earning his PhD, he started to collaborate with the Software Engineering Lab at ISTI-CNR in Pisa, mainly focusing on web service testing, where he is currently working under a research grant. Antonia Bertolino (http://www.iei.pi.cnr.it/~antonia/) is a CNR Research Director in ISTI-CNR, Pisa. She is an internationally renowned researcher in the fields of software testing and dependability, and participates to the FP7 projects CHOReOS, TAS3, CONNECT. She is an Associate Editor of the IEEE Transactions on Software Engineering and Springer Empirical Software Engineering Journal, and serves as the Software Testing Area Editor for the Elsevier Journal of Systems and Software. She has been the Program Chair of the flagship conference ESEC/FSE 2007. She has (co)authored over 100 papers in international journals and conferences. Sara Bouchenak is Associate Professor in computer science at Grenoble University since 2004. She conducts research at INRIA, focusing on highly-available, dependable and manageable distributed systems. She was a Visiting Professor at the Technical University of Madrid, Spain, in 2010, and she worked at EPFL, Switzerland, in 2003. Sara Bouchenak is an ACM EuroSys member, and an officer of the French chapter of ACM-SIGOPS. She received her PhD in computer science from the Grenoble Institute of Technology in 2001, and her MS in computer science from Grenoble University in 1998. Athman Bouguettaya is a Science Leader at the CSIRO ICT Centre, Canberra, Australia. He was previously a tenured faculty member in the Computer Science Department at Virginia Polytechnic Institute and State University (Virginia Tech). He holds adjunct professorships at the several leading Australian universities. He received his PhD degree in Computer Science from the University of Colorado at Boulder (USA). He is on the editorial boards of several journals including, the IEEE Transactions on Services Computing, VLDB Journal, and Distributed and Parallel Databases Journal. He is also on the editorial board of the Springer-Verlag book series on services science. He was a guest editor of a special issue the IEEE Transactions on Services Computing on Service Query Models. He has published more than 130 articles in the area of databases and service computing. He is a fellow of the IEEE. Rajkumar Buyya is Professor of Computer Science and Software Engineering; and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft Pty Ltd., a spin-off company of the University, commercializing its innovations in Grid and Cloud Computing. He has authored and published over 300 research papers and four text books. The books on emerging topics that Dr. Buyya edited include, High Performance Cluster Computing (Prentice Hall, USA, 1999), Content Delivery Networks (Springer, Germany, 2008), Market-Oriented Grid and Utility Computing (Wiley, USA, 2009), and Cloud Computing (Wiley, USA, 2011). He is one of the highly cited authors in computer science and software engineering worldwide. Software technologies for Grid and Cloud computing developed under Dr. Buyya’s leadership have gained rapid acceptance and are in use at several academic institutions
461
About the Contributors
and commercial enterprises in 40 countries around the world. Dr. Buyya has led the establishment and development of key community activities, including serving as foundation Chair of the IEEE Technical Committee on Scalable Computing and four IEEE conferences (CCGrid, Cluster, Grid, and e-Science). He has presented over 250 invited talks on his vision on IT Futures and advanced computing technologies at international conferences and institutions in Asia, Australia, Europe, North America, and South America. These contributions and international research leadership of Dr. Buyya are recognized through the award of “2009 IEEE Medal for Excellence in Scalable Computing” from the IEEE Computer Society, USA. Manjrasoft’s Aneka technology for Cloud Computing developed under his leadership has received “2010 Asia Pacific Frost & Sullivan New Product Innovation Award”. Yuhui Chen received his PhD in Computer Science from Newcastle University, UK in 2008 after his MSc in Computer Science form the same University (2003). In 2008-2010, he was a postdoc at the Institute for Ageing and Health, Newcastle University where he was involved in two BBSRC funded bioinformatics projects for Systems Biology research: BASIS and CaliBayes. He moved to University of Oxford in 2010 for a postdoc post (Senior Scientific computing in high-throughput sequencing analysis) with The Wellcome Trust Centre for Human Genetics. Dr. Chen has worked in several highly interdisciplinary projects in a variety of areas, from dependability in Service-Oriented Architecture to computational science. His research interests are system dependability, fault tolerance, high performance computing and high throughput computing. His research agenda aims at dependable scientific computing for bioinformatics and life sciences. Guglielmo De Angelis received a first-class honors degree in Computer Science in 2003 from University of L’Aquila and a PhD in Industrial and Information Engineering from Scuola Superiore Sant’Anna of Pisa in 2007. Currently he is a researcher at ISTI-CNR. His research mainly focuses on Model-Driven Engineering and on Service Oriented Architectures. In particular, he is studying approaches for: the generation of environments for QoS testing, the generation of adaptable monitoring infrastructures for QoS, analytically predict performance parameters, empirically test performance parameters, test on-line QoS parameters such as trustworthiness. Rogério de Lemos is a Lecturer in Computing Science at the University of Kent since 1999. During 2009, he was Invited Assistant Professor at the Department of Informatics Engineering, University of Coimbra, Portugal. Before joining Kent, he was a Senior Research Associate at the Centre for Software Reliability (CSR) at the University of Newcastle upon Tyne. He was the program committee co-chair for the Latin American Symposium on Dependable Computing 2003 (LADC 2003). He is a member of the Steering Committees of LADC, SEAMS (chair), and ISARCS. He is on the editorial board of the Journal of Hybrid Systems. He has over 50 scientific publications in international journals, book chapters and conferences. He has co-edited seven books on Architecting Dependable Systems, and one book on Software Engineering for Self-adaptive Systems. His main research areas are in software architectures for dependable systems, and self-adaptive software systems. Paul de Vrieze is Lecturer in the Software Systems Research Centre of Bournemouth University. Previously he was a Senior Researcher at SAP research, Switzerland, and worked as a Postdoctoral Research Fellow at CSIRO ICT Centre, Australia. He received his Ph.D. in Information Systems from
462
About the Contributors
Radboud University Nijmegen, The Netherlands in 2006. His main research interests are in enterprise information system integration, user modelling, systems modelling, adaptive systems, enterprise service design, and semantics integration. Jing Dong received the BS degree in computer science from Peking University and the PhD degree in computer science from the University of Waterloo. He has been on the faculty of Computer Science department at the University of Texas at Dallas and consulting in software industry. His research interests include services computing, formal and automated methods for software engineering, software modeling and design, and visualization. He is a senior member of the IEEE and the ACM. Vinod K. Dubey is an Associate at Booz Allen Hamilton in McLean, Virginia, USA. He completed his MS in Information Systems and PhD in Information Technology at the Volgenau School of Engineering at George Mason University, Fairfax, Virginia. Currently, he is also an Adjunct Professor at the Computer Science department at George Mason University. His research interests include system performance engineering, services computing, business process management, distributed computing, and QoS management in Service Oriented Architectures. Anatoliy Gorbenko received the PhD degree in Computer Science from National Aerospace University, Kharkiv, Ukraine in 2005. He is an Associate Professor at the Department of Computer Systems and Networks of the National Aerospace University in Kharkiv (Ukraine). There he co-coordinates the DESSERT (Dependable Systems, Services and Technologies) research group. His main research interests centre on assessment and ensuring dependability and fault tolerance in service-oriented architectures, investigating web services diversity and exception handling, and on applying these results in real industrial applications. Dr. Gorbenko is a member of EASST (European Association of Software Science and Technology). Huipeng Guo is a postdoctor with the School of Economics and Management, Beihang University, Beijing, China. He received his Ph.D. in Computer Software and Theory from Beihang University in 2009. His main research interests include dependable software and service computing. Bernhard Hollunder received his degree “Dipl.-Inform.” from the University of Kaiserslautern in 1989 and his Ph.D. (Dr. rer. nat.) from the University of Saarland in 1994. From 1995 to 2005 he worked as technology consultant for software architectures, distributed systems, and application development and led several R&D projects. Since 2005 he is professor at the computer science department at Furtwangen University of Applied Sciences, Germany. His research interests include innovative middleware technologies, security in distributed systems, and software architecture. Currently, he leads research projects that focus on comprehensive tool support for the development and monitoring of services with well-defined quality attributes. Jinpeng Huai is a Professor and President of Beihang University. He serves on the Steering Committee for Advanced Computing Technology Subject, the National High-Tech Program (863) as Chief Scientist. He is a member of the Consulting Committee of the Central Government¡¯s Information Office, and Chairman of the Expert Committee in both the National e-Government Engineering Taskforce
463
About the Contributors
and the National e-Government Standard office. He and his colleagues are leading the key projects in e-Science of the National Science Foundation of China (NSFC) and Sino-UK. He has authored over 100 papers. His research interests include middleware, peer-to-peer (P2P), grid computing, trustworthiness and security. Sheridan Jeary is co-Director of the Software Systems Research Centre at Bournemouth University and has research interests that span Requirements Engineering for Web Development, Web Systems and the alignment of Information Technology with business (strategy and process). She had several years of management and systems experience across a variety of domains before moving into academia. She was the BU Project Manager for the EU Commission funded VIDE project on Model driven development and is currently investigating the production of useful models in complex enterprises. Mohamed Kaâniche is currently “Directeur de Recherche” of CNRS, the French National Organization of Scientific Research. He joined the Dependable Computing and Fault Tolerance Research Group of LAAS-CNRS, Toulouse, France, since 1988. From March 1997 to February 1998, he was a Visiting Research Assistant Professor at the University of Illinois at Urbana-Champaign, IL, USA. His research addresses the dependability and security assessment of computer systems and critical infrastructures, using analytical modeling and experimental measurement techniques. He has (co)authored two books on these subjects and more than 100 publications in international journals and conference proceedings. He has contributed to several national and European research contracts, and acted as a consultant for companies in France and as an expert for the European Commission. He served on the organizing and program committees of the major dependability conferences in the area. He was Program Chair of PRDC-2004, EDCC-5, DSN-PDS-2010, and currently LADC-2011. Karama Kanoun is Directeur de Recherche of CNRS, heading the Dependable Computing and Fault Tolerance Research Group (http://www.laas.fr/~kanoun/). Her research interests include modeling and evaluation of computer system dependability considering hardware as well as software, and dependability benchmarking. She has (co)authored more than 150 conference and journal papers, 5 books and 10 book chapters. She has co-directed the production of a book on Dependability Benchmarking (Wiley and IEEE Computer Society, 2008). She is vice of the IFIP working group 10.4 on Dependable Computing and Fault Tolerance. She is member of the Editorial Board of IEEE Transactions on Dependable and Secure Computing, International Journal of Performability Engineering and International Journal of Critical Computer-Based Systems. She is chairing the Steering Committee of the European Dependable Computing Conference and member of the Steering Committees of DSN, ISSRE, SSIRI. She has been involved in several national and European research projects. Vyacheslav Kharchenko received his PhD in Technical Science from Military Academy named after Dzerzhinsky (Moscow, Russia) in 1981 and Doctor of Technical Science degree at the Kharkiv Military University (Ukraine) in 1995. He is a Professor and heads of the Computer Systems and Networks Department at the National Airspace University, Ukraine. He is also a senior research investigator in the field of safety-related software at the State Science-Technical Centre of Nuclear and Radiation Safety (Ukraine). He has published nearly 200 scientific papers, reports and book chapters, more than 500 inventions and is the coauthor or editor of 28 books. He has been a head of the DESSERT Inter-
464
About the Contributors
national Conference (http://www.stc-dessert.com) in 2006-2010. His research interests include critical computing, dependable and safety-related I&C systems, multi-version design technologies, software and FPGA-based systems verification and expert analysis. Dong Seong Kim received the B.S. degrees in Electronic Engineering from Korea Aerospace University, Republic of Korea in 2001. And he received M.S. and Ph.D. degree in Computer Engineering from Korea Aerospace University, Republic of Korea in 2003, 2008, respectively. And he was a visiting researcher in University of Maryland at College Park, USA in 2007. Since June 2008, he has been a postdoctoral researcher in Duke University. His research interests are in dependable and secure systems and networks, in particular, intrusion detection systems, wireless ad hoc and sensor networks, virtualization, and cloud computing system dependability and security modeling and analysis. Nuno Laranjeiro is a Ph. D. student at the University of Coimbra. He received his B.Sc. and M.Sc. Degrees in Informatics Engineering at the Department of Informatics Engineering of the University of Coimbra in 2007. He started his Ph.D. studies in Information Sciences and Technologies in 2007, also at the University of Coimbra. Since 2006 he has been with the Software and Systems Engineering Group (SSE) of the Centre for Informatics and Systems of the University of Coimbra (CISUC), researching topics related to dependability in web services. Nuno Laranjeiro has authored nearly 20 papers in the services computing and dependability area, has participated in national and European research projects and also acted as referee for several international conferences and journals in the dependability and web services areas. Peng Liang is an associate professor in the State Key Lab of Software Engineering (SKLSE) at Wuhan University. His research interests include software engineering and collaborative business process. He received his PhD in computer science from the Wuhan University. He is a member of the IEEE, ACM, and CCF (China Computer Federation). Contact him at the State Key Lab of Software Engineering, Wuhan University, Wuhan, P.R. China. Qianhui Liang holds a PhD in Computer Engineering from the University of Florida, USA. She has published in journals and conferences like IEEE Transactions on Services Computing, IEEE Transactions on Systems, Man and Cybernetics, International Journal of Web Services Research, and Knowledge and Information Systems. She served as PC vice chairs, publicity chairs and PC members of a number of international conferences, including IEEE International Conference on Information Reuse and Integration, IEEE International Conference on Services Computing, IEEE International Conference on Cloud Computing, and International Conference on Service Oriented Computing. Her research interests are services computing and cloud computing. She is a researcher at HP Labs, Singapore. Xudong Liu is a Professor in the School of Computer Science and Engineering, Beihang University. He received his PhD from Beihang University in 2008. His research interests include SOA, web service and middleware.
465
About the Contributors
Francesca Lonetti received the PhD Degree in Computer Science, from the University of Pisa in December 2007, working on RFID systems and multimedia transmission over wireless networks. Currently, she is a Post-doctoral Research fellow at Scuola Superiore Sant’Anna in Pisa, and she collaborates with the Software Engineering Lab at ISTI-CNR in Pisa. Her current research interests focus on software and security testing, in particular test case derivation from XML schema and XACML policies. Paulo R. M. Maciel graduated in Electronic Engineering in 1987, and received his MSc and PhD degrees in Electronic Engineering and Computer Science from Universidade Federal de Pernambuco, respectively. He was faculty member of the Electric Engineering Department of Universidade de Pernambuco from 1989 to 2003. Since 2001 he has been a member of the Informatics Center of Universidade Federal de Pernambuco, where he is currently Associate Professor. He is research member of the Brazilian research council (CNPq) and IEEE member. His research interests include Petri nets, formal models, performance and dependability evaluation, and power consumption analysis. He has acted as consultant as well as research project coordinator funded by companies such as HP, EMC, CELESTICA, FOXCONN, ITAUTEC and CHESF. Henrique Madeira is a full professor at the University of Coimbra, where he has been involved in the research on dependable computing since 1987. His main research interests focus on experimental evaluation of dependable computing systems, fault injection, error detection mechanisms, and transactional systems dependability, subjects on which he has authored or co-authored more than 100 papers in refereed conferences and journals. He has coordinated or participated in tens of projects funded by the Portuguese government and by the European Union. Henrique Madeira was the Vice-Chair of the IFID Working Group 10.4 Special Interest Group (SIG) in Dependability Benchmarking from the establishment of the SIG in the summer of 1999 until 2002. He has organized several Workshops and scientific events and was the Program Co-Chair of the International Performance and Dependability Symposium track of the IEEE/IFIP International Conference on Dependable Systems and Networks, DSN-PDS2004. He has been appointed as the Conference Coordinator for the DSN 2008. He has also been asked to be referee for many international conferences and journals and he has served on program committees of the major conferences of the dependability and database areas. Eda Marchetti (http://www1.isti.cnr.it/ERI/eda_marchetti/) is a researcher at CNR-ISTI. She graduated summa cum laude in Computer Science from the University of Pisa (1997) and got a PhD from the same university (2003). Her research activity focuses on Software Testing and in particular: developing automatic methodologies for testing; defining approaches for scheduling the testing activities; implementing UML-based tools for test cases generation; defining methodologies for test effectiveness evaluation. She has served as a reviewer for several international conferences and journals, and she has been part of the organizing and program committee of several international workshops and conferences. Magnos Martinello completed his Ph.D. in 2005 at INPT (Institut National Polytechnique de Toulouse) developed at LAAS-CNRS (Laboratoire d’Analyse et d’Architecture des Systèmes), France. He is currently an Associate Professor at the Federal University of Espírito Santo in Brazil. His main research interests include performability evaluation of web based applications and services, dependability evaluation of mobile and dynamic systems and performance evaluation of multimedia distributed systems.
466
About the Contributors
Rivalino Matias Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned his M.S (1997) and Ph.D. (2006) degrees in computer science, and industrial and systems engineering from the Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with Department of Electrical and Computer Engineering at Duke University, Durham, NC, working as a research associate under supervision of Dr. Kishor Trivedi. He also works for IBM Research Triangle Park in a research related to embedded system availability and reliability analytical modeling. He is currently an Associate Professor in the Computing School at Federal University of Uberlândia, Brazil. Dr. Matias has served as reviewer for IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, JOURNAL OF SYSTEMS AND SOFTWARE, and several international conferences. His research interests include reliability engineering applied to computing systems, software aging theory, dependability analytical modeling, and diagnosis protocols for computing systems. Daniel A. Menascé is the Senior Associate Dean at the Volgenau School of Engineering and a Professor of Computer Science at George Mason University, Virginia, USA. He received a Ph.D. degree in Computer Science from UCLA, an MS in Computer Science, and a BSEE both from the Pontifical Catholic University in Rio de Janeiro (PUC-RIO), Brazil. He is a Fellow of the ACM, a Senior Member of the IEEE, and an elected member of IFIP’s Working Group 7.3. The Computer Measurement Group (CMG) selected him as the recipient of the 2001 A.A. Michelson Award for “outstanding contributions to computer metrics.” In 2009 he was inducted as an honorary member of the Golden Key International Honour Society. He published over 200 technical papers and was the chief author of five books published by Prentice Hall. Menascé is a member of the editorial board of Elsevier’s Performance Evaluation Journal. His areas of interest include autonomic computing, e-commerce, performance modeling and analysis, service-oriented architectures, and software performance engineering. Salah Merad worked as a research associate in the department of computing science at Newcastle University; his research was on the application of decision theory to software engineering problems. Before joining the department, he taught statistics and operational research at the Universities of Plymouth and Newcastle. Merad received a first degree in mathematics and operational research from the university USTHB in Algeria, and a PhD in applied probability from the University of Bristol. He is currently a principle methodologist at the UK Office for National Statistics, working on sampling and estimation problems in surveys. Michele Nogueira is professor at the Department of Informatics of Federal University of Paraná. She received her Ph.D. in Computer Science at Université Pierre et Marie Curie, Laboratoire d’Informatique Paris 6 (LIP6), and her M.Sc in Computer Science at Federal University of Minas Gerais, Brazil, 2004. She has worked at security area for many years; her interest domain is security, wireless network, intrusion tolerance and dependability. She is member of the IEEE Communication Society (ComSoc) and the Association for Computing Machinery (ACM). Michael Parkin obtained a PhD. in Computer Science from the University of Manchester in 2008. He then worked at Barcelona Supercomputing Centre as a CoreGRID Industrial Fellow before joining the European Research Institute in Service Science (ERISS), hosted at Tilburg University, the Netherlands, as a post-doctoral researcher. Michael works in the EC’s S-Cube Network of Excellence, performing foundational research into all aspects of large-scale service-based systems, and the creation, management and monitoring of Service Level Agreements (SLAs) in particular. 467
About the Contributors
Keith Phalp is Associate Dean: Head of Computing & Informatics at Bournemouth University. He is particularly interested in the early, most crucial, phases of software projects, and in how best to produce software that meets the needs of its sponsors, stakeholders and users. This involves a variety of topics including: understanding business needs (strategic and operational), process modelling, software requirements, business and IT alignment, and software modelling. Andrea Polini received a first-class honors degree in Computer Science in 2000 from University of Pisa and a PhD in Computer Engineering in 2004 from Scuola Superiore Sant’Anna - Pisa. Currently, he is an Assistant Professor at the Computer Science Department of the University of Camerino. His research interests are mainly related to Verification and Testing of Complex Software Systems in particular in relation to Component Based Software Systems and Service Oriented Applications. Guy Pujolle received the Ph.D. in Computer Science from the University of Paris IX in 1975. He is currently Professor at the University Pierre et Marie Curie and member of the Scientific Advisory Board of the France Telecom Group. Pujolle is chairman of IFIP Working Group on “Network and Internetwork Architectures”. His research interests include wireless networks, security, protocols, high performance networking and intelligence in networking. Alexander Romanovsky received a PhD degree in Computer Science from St. Petersburg State Technical University. In 1992-1998 he was involved in the Predictably Dependable Computing Systems ESPRIT Basic Research Action and the Design for Validation ESPRIT Basic Project. In 1998-2000 he worked on the Diversity in Safety Critical Software EPSRC/UK Project. Prof Romanovsky was a coauthor of the Diversity with Off-The-Shelf Components EPSRC/UK Project and was involved in this project in 2001-2004. In 2000-2003 he was in the executive board of Dependable Systems of Systems IST Project. In 2004-2007 he coordinated the FP6 IST Rigorous Open Development Environment for Complex Systems Project. Prof Romanovsky is now the Coordinator of the major FP7 Integrated Project on Industrial Deployment of System Engineering Methods Providing High Dependability and Productivity (DEPLOY, 2008-2012). His main research interests are system dependability, fault tolerance, software architectures, exception handling, error recovery, system structuring and verification of fault tolerance. Douglas Rodrigues received the M.Sc. degree in Computer Science and Computational Mathematics at Institute of Mathematics and Computer Science, University of São Paulo and the B.Sc. in Computer Science at Univem - Marília/SP in 2008. His main research topics are: SOA, Web services, performance evaluation, computer security, encryption and digital signature. Antonino Sabetta is a researcher at SAP Research, Sophia-Antipolis, France. Before joining SAP on October 2010, he was a researcher at CNR-ISTI, Pisa, Italy. Antonino received his PhD in Computer Science and Automation Engineering from the University of Rome Tor Vergata, Italy in June 2007. He had previously earned his Laurea cum Laude degree in Engineering in 2003 from the same University. Antonino’s research interests cover various topics in software engineering, including: automated techniques, based on model-driven engineering, to achieve predictive analysis of non-functional properties of software systems; online monitoring; application of model-driven techniques to various design and analysis problems. Antonino serves regularly as a reviewer for the major journals and conferences on
468
About the Contributors
software engineering; he is in the organizing committee of international schools (MDD4DRES, Modeling Wizards) as well as a program committee member of several international conferences and workshops. Aldri Santos is professor at the Department of Informatics of Federal University of Paraná and was visiting researcher at the Department of Computer Science of Federal University of Ceará. Aldri received his Ph.D. in Computer Science from Department of Computer Science of Federal University of Minas Gerais, Belo Horizonte, Brazil. Aldri received both his M.Sc. and B.Sc in Informatics from Federal University of Paraná, Curitiba, Brazil. He is member of the Brazilian Computing Society (SBC) and the IEEE Communication Society (ComSoc). Hailong Sun is an Assistant Professor with the School of Computer Science and Engineering, Beihang University, Beijing, China. He received his Ph.D. in Computer Software and Theory from Beihang University, and B.S. degree in Computer Science from Beijing Jiaotong University in 2001. In 2008, he worked for AT&T Labs-Research as a visiting scholar for about half a year. His research interests include web services, service oriented computing and distributed systems. He is a member of IEEE and ACM. Longji Tang received ME in Computer Science & Engineering and MA in Application Mathematics from Penn State University at 1995. He is a PhD candidate in Software Engineering of Department of Computer Science in University of Texas at Dallas. His research interests include software architecture and design, service-oriented architecture, enterprise service computing and application, cloud service computing, system modeling and formalism. He was a senior software consult worked in several big software projects in Caterpillar and IBM. He is now a senior technical advisor at FedEx IT as well as leader and/or architect in several critical e-commerce projects. Kishor S. Trivedi holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. He has been on the Duke faculty since 1975. He is the author of a well known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by Prentice-Hall; a thoroughly revised second edition (including its Indian edition) of this book has been published by John Wiley. He has also published two other books entitled, Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers and Queueing Networks and Markov Chains, John Wiley. He is a Fellow of the Institute of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society. He has published over 420 articles and has supervised 42 Ph.D. dissertations. He is on the editorial boards of IEEE Transactions on dependable and secure computing, Journal of risk and reliability, international journal of performability engineering and international journal of quality and safety engineering. He is the recipient of IEEE Computer Society Technical Achievement Award for his research on Software Aging and Rejuvenation. His research interests in are in reliability, availability, performance, performability and survivability modeling of computer and communication systems. He works closely with industry in carrying our reliability/availability analysis, providing short courses on reliability, availability, performability modeling and in the development and dissemination of software packages such as SHARPE and SPNP.
469
About the Contributors
Marco Vieira is an Assistant Professor at the University of Coimbra, Portugal, and an Adjunct Associate Teaching Professor at the Carnegie Mellon University, USA. Marco Vieira is an expert on experimental dependability and security assessment and benchmarking. His research interests also include robustness assessment and improvement in SOA, fault injection, security in database systems, software development processes, and software quality assurance, subjects in which he has authored or co-authored more than 70 papers in refereed conferences and journals. He has participated in several research projects, both at the national and European level. Marco Vieira has served on program committees of the major conferences of the dependability and databases areas and acted as referee for many international conferences and journals in the dependability and databases areas. Marco Vieira has currently several projects with industry in areas such as service-oriented architectures, databases, decision support systems, and software development processes. Lai Xu is a lecturer in the software systems research centre at Bournemouth University. Previously she was a senior researcher at SAP research, Switzerland, a research leader of data management group and a senior research scientist at CSIRO ICT Centre, Australia, a post-doctoral researcher at the Institute of Information and computing sciences of the Utrecht University, the Netherlands and the department of computer science of the Free University Amsterdam, the Netherlands. She received her Ph.D. in Computerized Information Systems from Tilburg University, The Netherlands in 2004. Her research interests include service-oriented computing, enterprise systems, business process management, support for business process collaboration and virtual enterprises. Alexander Wahl received his degree “Dipl.-Inform.” from the University of Ulm in 2004. From 2004 to 2007 he worked as software engineer for surgical navigation systems. From 2007 to 2009 he held a position as research assistant for protective lung ventilation at the University Medical Center Freiburg, Germany. Since 2009 he is a research assistant at the computer science department at Furtwangen University of Applied Sciences, Germany. His research interests are distributed systems, quality attributes, software architectures and software engineering in medical technologies. His current activities include the development of services with well-defined quality attributes as well as the improvement of corresponding tool support. Linlin Wu is a PhD candidate under the supervision of Professor Rajkumar Buyya in the CLOUDS Laboratory at the University of Melbourne, Australia. She received the Master of Information Technology from the University of Melbourne and then worked for CA (Computer Associates) Pty Ltd as Quality Assurance Engineer. Then she joined National Australia Bank (NAB) as a Knowledge Optimization Officer. In Melbourne University, she has been awarded with APA scholarship supporting PhD studies. She received the Best Paper Award from AINA 2010 conference for her first publication. Her current research interests include: Service Level Agreement, QoS measurement, Resource Allocation, and Market-oriented Cloud computing.
470
About the Contributors
Jin Zeng is a developing and testing engineer at China Software Testing Center (CSTC) in Beijing of China. He received his Ph.D. in Computer Software and Theory from the School of Computer Science and Engineering of Beihang University in 2010. His research interests include service computing, dynamic evolution of composite service, business process management, and workflow evolution. Now he is engaged in researching cloud testing and middleware testing technologies. Yajing Zhao received the B.S. degree in computer science from Nankai University in 2005. She received the M.S. degree and Ph.D. degree in software engineering from the University of Texas at Dallas in 2007 and 2010, respectively. She is a software engineer working on real-time multi-threaded systems. Her research interests include software architecture, design patterns, loosely coupled software designs, algorithms and performance, web services, semantic web services, Ontology, cyber physical systems, and real-time systems, network security.
471
472
Index
A Abstract Business Process 152-157, 159, 161-166, 168, 171 Abstract Service Design Language (ASDL) 297, 315 Acunetix Web Vulnerability Scanner 375, 404-405, 410, 423 AJAX (Asynchronous Javascript and XML) 117118 AMNESIA 375, 377, 393, 398, 425 Analytic Hierarchy Process (AHP) 182-184, 188 Aneka 10, 16, 22 Aspect Oriented Programming (AOP) 322-323, 326, 329, 336, 338, 395 Availability (A) 1-10, 12-16, 18-51, 53-56, 5861, 63-77, 79-81, 83-132, 134-146, 148-221, 223-224, 226-236, 238-241, 243-315, 317-358, 360-379, 381-417, 419-426
B Bargaining Games 172, 176, 178, 185 Benchmarking 265-267, 278, 284, 292, 374, 379, 402-403, 415, 417-423, 425 Best-Effort Fault Tolerant Routing (BFTR) 350 Biology of Ageing E-Science Integration and Simulation System (BASIS) 4, 15, 20, 49, 123, 125, 131, 157, 174, 179, 185, 239, 250, 267, 271280, 282-284, 288, 291, 293, 310, 365, 384 Business Process Execution Language (BPEL) 32, 103, 115, 131, 139, 171, 244, 327, 335, 369370, 382, 384-386, 398-401 Business Process Modeling Notation (BPMN) 153, 169, 171 Bytecode Instrumentation 338, 375
C Capacity Planning 212-213, 219-220, 223-230, 232, 236, 239-241, 264
Category Partition (CP) 378, 387-390, 423 Certificate Authority (CA) 22, 24-25, 93, 187-188, 241, 263-264, 353, 423 Cloud Computing 1, 3, 8, 14-16, 18-19, 21-22, 24, 26-28, 31, 33-46, 48, 50-51, 114, 239, 243, 263, 291 Coercive Parsing 367 Collaborative Business Process modeling 116, 126 Combined Utility Function (CUF) 182, 184 Component Services Cache (S-Cache) 300 Composite Services Cache (CS-Cache) 300 Computing Infrastructures 1 Content Management Systems (CMS) 125 CP algorithm 389-390 Cross-Site Scripting (XSS) 371-372, 377, 404, 423 Customer Relationship Management (CRM) 117, 119, 125
D Database Management System (DBMS) 286, 331, 336 Data Encryption Standard (DES) 97, 362 DB Query (Database Query) 326 Decision Maker’s (DM) 173, 176-177, 180, 182184, 215, 229 Denial of Service (DoS) 342, 349-351, 355, 367369 Dependability 18, 27, 45, 53-55, 61, 63-66, 70, 7677, 81, 83-84, 86-88, 95-98, 152-153, 159, 172, 174, 189, 209, 243-247, 257, 263-267, 271, 274, 278-279, 283-285, 291-293, 295-303, 306, 309-310, 313-314, 320-321, 336, 340, 342, 344-345, 348, 379, 415, 425 dependability modeling 53-54, 86, 95, 247 Development Tools 189-190, 199 Dijkstra’s Shortest Path Computation Algorithm 318, 328, 339 Diversity Coding approach 353
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Index
DNA Databank, Japan (DDBJ) 267-270 Dynamo-AOP 107, 114
E ebXML 122, 132, 138 ebXML business process specification schema (ebXML BPSS) 122, 132 Enterprise Cloud Service Architecture (ECSA) 2627, 30, 33-36, 42, 46-48, 51 Enterprise Service Bus (ESB) 107, 138 Enterprise Service Computing (ESC) 26-28, 30-34, 45-47 Enterprise Service-Oriented Architecture (ESOA) 26-27, 30, 33-36, 39-40, 42, 46-48, 51 EU project FAST 125 EU project SOA4All 125 Extended Kalman Filter (EKF) 302 eXtensible Access Control Markup Language (XACML) 211, 361, 381-382, 394-400 extreme value theory 54
F False Positive 333, 408-409, 412, 420, 422, 425 fault tree analysis (FTA) 55 fault trees (FT) 55-56, 66, 70-73, 86 Federal No-Fly-List 143 Finite State Processes (FSP) 382, 385 Flow-based Route Access Control (FLAC) 350 F-Measure 417-419, 421, 425 Fortify 360 373, 377, 405, 423 Foundstone WSDigger 375, 377, 404, 423 Fuzzing Techniques 374, 393, 403
G Game Theory 172, 176, 186 Genetic Algorithm 158, 161-163, 165-166, 168, 170 GlassFish 195, 198, 202-203, 205, 209, 272 Google 16, 39, 121, 124, 131, 167, 406, 425 Gridbus Broker 10 Grid Computing 1, 3, 8, 10, 12-15, 21-25, 30, 32, 48, 50, 168, 170, 187, 229, 240, 272, 283-284, 291, 377
H HAF algorithm 303, 305 Hard real-time 320 Heuristic Service Selection 136, 139, 148 High Availability First 305 Hill-climbing 139, 238
HP WebInspect 375, 378, 404-405, 410, 423 HTTP (hypertext transfer protocol) 21, 24, 48-50, 114-115, 122, 131-132, 167, 169, 189, 192, 204, 211, 240-241, 267, 284, 288, 292-294, 319, 335-336, 362, 368-371, 374, 377-380, 391, 397-401, 403-404, 406, 423-424 Hybrid Redundancy 299, 303, 313
I IBM Rational AppScan 375, 378, 404-405, 410, 423 Individual Utility Functions 183 Integrated Development Environments (IDEs) 189191, 195-196, 198-200, 209, 405 IntelliJ IDEA 373, 378, 405, 410, 423 Intermediate Format Language (IF) 109, 385, 399 International Moving Services (IMS) 118-119, 128, 130 Internet Service 49, 212-217, 219-221, 223-224, 226-230, 232-234, 236, 238-239, 270, 273, 292 Intrusion Detection Systems (IDS) 343-345, 349, 369 Intrusion Tolerance (IT) 9, 14, 22-23, 27, 48, 88, 117, 123, 126, 128-129, 290, 292, 344, 346, 351, 355, 367, 382
J JBoss 114, 195, 198, 326
K KAF algorithm 303 KAF architecture 300 Key Performance Indicator (KPI) 2, 33, 36 Key Quality Indicators (KQI) 32-33, 36
L Lines of Code (LoC) 373, 411-412, 417 Lines of Code per Operation (LoC/Op) 411
M Malicious-and Accidental-Fault Tolerance will be Internet Applications (MAFTIA) 344, 358 MANET 351 Market Turbulence 116-117 Markov chain 54-55, 86-87, 95 Mashup 116-131 Maximum Weighted Product (MWP) 183-184 Mean Time To Failure 61, 89, 94, 245, 261
473
Index
Mean Value Analysis (MVA) 221 Message Sequence Charts (MSC) 383, 385, 398 Minimum Weighted Sum of Ranks (MWSR) 183184 Model Driven Architecture (MDA) 196, 211 Model Driven Development (MDD) 196 MoKa 219-220, 228-229, 234-236, 238-239 moving company 120 MPL limit 215, 221, 223 Multi-Attribute Optimisation 172 Multi Attribute Utility Theory (MAUT) 157-158, 161, 168 Multidimensional QoS 154, 156, 159, 161-162, 168
N Nash solution 173, 178-180, 184-185 NetBeans 190-191, 195-196, 200, 202-205, 209, 285, 288, 294
O Object Constraint Language (OCL) 206-207, 211 Object-Oriented Programming (OOP) 335, 338 Open Grid Forum (OGF) 10, 21, 30, 48, 114, 168 Open Source Library for OCL (OSLO) 207 Operations 8, 18, 22-23, 30-32, 34, 46, 48, 125, 141, 152, 158-159, 180, 184, 188, 240, 252, 267-268, 285, 307-309, 311, 318-319, 321, 323, 326-327, 329, 334, 338, 341-342, 344346, 348-352, 367, 369, 372, 374, 376, 383384, 389, 393, 410-411 Organization Based Access Control (OrBAC) 394395, 399 Organization for the Advancement of Structured Information Standards (OASIS) 101, 115, 153, 169, 190, 192, 195, 197, 199, 208, 211, 344, 356, 361, 364-365, 379, 384, 391, 394, 400 Oversize Cryptography 368 Oversize Payload 367-368
P Parasoft 383, 400 PeekaCity 121 Penetration Testing 361, 374, 402-407, 409-415, 418-419, 422-425 Performability Preference 220, 228 Performance Overhead 331 Precision 183, 207, 331, 336, 417-419, 421, 425 Pretty Good Privacy (PGP) 353
474
Process-oriented Mashup 116-118, 122-125, 127128, 130-131 Procurement 125 Product Line Architectures (PLAs) 157 Protocol State Machine (PSM) 383 Public Utilities 1, 159, 175, 177, 180, 182-184
Q QoS-aware Services 189, 199, 209 QoS Broker (QB) 32, 134-135, 137-142, 145, 147151, 174, 188, 246 QoS Distributions 141 QoS Level 29 QoS Metric 31, 151 QoS Profile Compatibility 31 Quality of Service (QoS) 1-2, 4, 6, 8, 12, 18-34, 36-38, 40, 42, 46-52, 101, 103, 106-107, 114, 134-143, 146, 148-163, 165, 167-169, 171, 173-175, 178, 185, 187-194, 196-211, 239-241, 244-246, 260, 262-264, 266, 278, 294, 297, 300, 313-315, 320, 346-347 Quality of Service (QoS) parameters 1-2, 18 Quantitative Assessment 243
R Real-time Computing 336, 338 Recall 417-419, 421, 425-426 Redundancy 53, 55, 63, 290-293, 295-301, 303306, 309-314, 336, 341, 344-346, 349, 352-354 Reflected XSS 371 reliability block diagrams (RBD) 55, 66, 69-70, 74, 79, 82, 84, 86 Request Processing Time (RPT) 267, 274, 276, 279-284 Response Time (RT) 19, 27, 29, 31-32, 45-46, 134143, 151, 193, 197-198, 210, 223, 246, 265270, 274-284, 291-292, 320, 322, 336-337 Robustness Testing 361, 374 Role-Based Access Control (RBAC) 382, 394 Round Trip Time (RTT) 267-271, 273, 276, 279284 Rubinsteins Alternating Offers protocol 10
S SABER 348-349 Sale 119, 125 Secure Ad hoc On-Demand Distance Vector (SAODV) 349 Secured Data based Multi Path (SDMP) 352
Index
Secure Protocol for Reliable Data Delivery (SPREAD) 176, 352, 357 Secure Routing Protocol (SRP) 349 Secure Sockets Layer (SSL) 361-364, 369, 377, 391 Security Access Markup Language (SAML) 211, 361, 394, 400 Security Aware Routing (SAR) 349 Security Vulnerabilities 373, 376, 402-403, 405, 415, 423 Service Availability 217, 239, 245-246, 254, 260, 293, 300, 303-304, 367 Service Composition 32, 43, 48, 50-51, 98, 136137, 149-151, 155, 168, 170-171, 174, 186188, 263, 292, 294-301, 304-305, 309, 313314, 316, 327, 382, 385, 388 Service Computing 26-28, 31-34, 45-46, 48, 88, 90, 117, 189, 295, 340, 354 Service Information Manager (SIM) 300, 303 Service Level Agreement (SLA) 1-10, 12-16, 1824, 26-41, 43-51, 106, 110, 153-154, 157, 197, 211-214, 219-220, 223-224, 227-228, 232, 234, 236, 238-239 Service Level Management (SLM) 26-27, 30, 3238, 40-41, 43-47, 51 Service Level Objectives (SLO) 2, 5, 9, 27-28, 33, 36, 214, 220, 224, 227, 241 Service Oriented Architecture (SOA) 26-30, 33, 35-36, 39-40, 42-48, 50, 98-107, 109-115, 117118, 125, 129, 134, 141, 148, 151, 169, 189, 197, 209-211, 265-267, 271-272, 279, 283-284, 289-292, 295-296, 360-361, 367-368, 376, 378379, 381-382, 384, 386, 396-397, 401 Service-oriented business process 116, 131 Service-oriented computing 1, 22, 25, 50, 132, 170, 172-173, 176, 188, 211, 243, 264, 336 Service Oriented Coverage Testing (SOCT) 386387 Service Oriented Infrastructure (SOI) 18-19, 30, 32, 50 Service Provider Selection 137, 148, 151 Service Providers (SP) 1, 3-4, 6, 8, 13, 15-16, 18, 20, 26-30, 34, 38, 44-46, 105, 108, 128-129, 134-136, 138-139, 141-145, 148-149, 151, 153, 171, 174, 244, 246, 253, 273, 292, 297, 316, 319, 386 Service Quality Dimension 171 Service Redundancy (SR) 293, 295, 297, 299-300, 303, 313 Services selection 150-151, 170, 172, 178, 188, 406 Shortest Path (ShP) 318, 324, 328, 339, 350
Simple Object Access Protocol (SOAP) 28, 105106, 189, 192, 198, 200, 203, 206, 211, 243, 268, 277, 285-286, 317-319, 335, 338, 360365, 367-371, 380, 388-389, 391, 393, 400, 404, 424, 426 Simulated Annealing 153, 158, 161-166, 168-170 Simultaneous Crawl and Audit (SCA) 33, 404 SITAR 348, 358 SLA constraints 213, 219, 224, 228, 234 SLA latency 223 SLA lifecycle 1, 3, 6-7, 9-10, 14, 16, 18, 20, 23, 47-48 SLA management (SLM) 2-4, 8, 10, 15-16, 18-21, 27-28, 30, 32-35, 37-41, 43-51 SLAngMon 107 SLA-oriented management systems 3 SLA@SOI consortium 32 SLA violation 3, 5-6, 8-10, 13-15, 18-21, 228 SOA monitoring 98, 101, 106 SOAP Web Services 338 SOA Test Governance (STG) 99, 104, 109-114 SOA validation 98-99, 101, 105, 109, 112-113, 396 Soft real-time 320 Source Code Editor 199-202, 204 Spoofing Attacks 369 SQL Injection 366, 369, 371, 375, 393, 398, 404, 407-410, 412, 415-417, 420-424 SSL Change Cipher protocol 363 Static Analysis 375, 378, 393, 398-399, 402, 404, 411, 413-415, 417, 419-420, 422-424 Static Code Analysis 361, 373, 375, 403, 405, 410, 412-414, 418, 423-424, 426 Stochastic Modeling 243 Stochastic Petri nets 55, 66, 86-87, 96, 244, 249, 263 Storage Area Network (SAN) 22, 29, 90-91, 241, 263, 423 Stored XSS 371 Sun Java System (SJS) 284-286 Sun Microsystems Internet Data Center Group 2-3, 6 Support Vector Machines (SVM) 321, 335 Survivability 340-341, 344-349, 351, 353-358 Symbolic Transition System (STS) 103, 383 System of Services 177 Systems Biology Markup Language (SMBL) 273, 279
T Tag-Aware text file Fuzz testing Tool (TAFT) 393
475
Index
Techniques for Intrusion-resistant Ad Hoc Routing Algorithms (TIARA) 349-350, 358 Test-driven Development (TDD) 374, 377, 379 Testing Driver Service (TDS) 105 Test Suites 103-104, 384, 388-391 TimeExceededException 325 TimeExceededRuntimeException 325 Timing Failure 291, 318, 320, 322-323, 325, 328329, 333-334, 338 TPC-W benchmark 229 Transport Layer Security (TLS) 361-362, 364, 369, 377, 391 Transport Security Administration (TSA) 143 Travel Agency (TA) 243-245, 249-254, 256, 263264, 388 Two-Level Redundancy 295, 297-299, 313
U Unified Modeling Language (UML) 12, 196-197, 200, 209-211, 383, 398 Uniform Resource Locator (URL) 28, 169, 192, 369, 371-372 Universal Description, Discovery and Integration (UDDI) 19, 28, 101, 105, 114-115, 138, 156, 167, 174, 285, 317, 319, 335, 338, 361, 424 User Level Modeling 250 Utility Architecture 3-4 Utility Computing 1-4, 8-10, 12-15, 19-21, 24-25, 50, 197, 211 Utility Function 134-138, 140, 143, 151, 157-159, 163, 175-178, 182-184, 212-213, 219-220, 223, 239, 246 Utility Threshold (UThreshold) 143-145, 147-148
V Value Trade-offs 173, 176-177, 182, 184 VE-IMS 119-121 Virtual Enterprise (VE) 118-119, 132 VOC 298 Vulnerability 360, 367-369, 371, 374-375, 392-393, 400, 402-405, 407-408, 410-412, 414-415, 417418, 420, 422-426 Vulnerability Detection Coverage 426
W Web applications 29, 203, 240, 243, 245, 248, 334, 367-368, 371-374, 377, 382, 393, 403-404, 410-411, 425
476
Web Service Compositions (WSC) 28, 124, 293, 314-315, 384-385, 396, 398 Web Service Level Agreement (WSLA) 3, 10, 1213, 15, 20, 23, 27, 30-31, 37, 41, 47, 49, 197, 211 Web Service Offering Language (WSOL) 12, 24, 31, 41, 47, 51 Web Services Dependability Assessment Tool (WSsDAT) 274 Web Services Description Language (WSDL) 28, 31, 37, 103-104, 118, 123, 131, 174, 189, 191192, 198, 202, 204, 206-207, 285, 287, 317, 319, 335, 360-361, 369, 382-384, 388-389, 392-393, 396, 401, 406, 424 Web Services Interoperability Project (WSIT) 204 Web Services Timing Failures Detection and Prediction (wsTFDP) 318, 321-324, 326-335 Web Services (WS) 10, 12, 15, 19, 21-24, 27-32, 34, 38, 46, 48-52, 96, 102, 113-115, 117-118, 120-121, 123-125, 130-132, 137, 149-151, 168-170, 174, 186-192, 195-205, 208-211, 240, 243-244, 246-247, 252, 254, 257-259, 261-268, 270-280, 282-294, 296-298, 314-323, 328, 330331, 334-336, 338, 360-361, 363-369, 371-372, 374-387, 391-393, 396-411, 413-417, 420-426 WebSphere 39, 97, 122, 195, 198, 267, 284-287 Weighted Scoring Method (WSM) 182 Widgets 117-118, 120-121, 125, 130 Willow 348, 357 Wired Equivalent Privacy (WEP) 352-353 Wireless Self-organized Networks (WSONs) 340342, 346, 349, 354-355 Wireless Sensor Networks (WSNs) 349, 358 Workflow Management Service (WfMS) 32 WorldTravel 387-388, 391, 401 WS-Authorization 366 WS-Federation 365 WSMO-lite 118, 123, 130, 132 WS-Policy 30, 189-192, 194-195, 197-208, 210, 365, 377 WS-Privacy 365 WS-SecureConversation 365, 379 WS-Security 30, 190, 204, 211, 285, 361, 363-365, 368, 370, 378-379, 391, 400 WS-SecurityPolicy 190, 195, 202, 204, 365, 368, 379 WS-Spoofing 369 WS-TAXI 384, 386-390, 397 wsTFDP framework 322 WS-Trust 365, 379
Index
X
Y
X-CREATE 395 XML (eXtensible markup language) 10, 12, 30-31, 47, 131, 191-192, 194-195, 200, 204-205, 267, 273, 294, 319, 361, 363-365, 367, 369-370, 377-380, 383-384, 387, 389, 391-394, 397, 399, 407 XML schema definition (XSD) 387, 389, 394
Yahoo 121, 124, 132
Z Zillow 121
477