Testing Techniques in Software Engineering: Second Pernambuco Summer School on Software Engineering, PSSE 2007, Recife, Brazil, December 3-7, 2007, Revised Lectures (Lecture Notes in Computer Science)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Paulo Borba | Ana Cavalcanti | Augusto Sampaio | Jim Woodcook

147 downloads 864 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

6153

Paulo Borba Ana Cavalcanti Augusto Sampaio Jim Woodcock (Eds.)

Testing Techniques in Software Engineering Second Pernambuco Summer School on Software Engineering, PSSE 2007 Recife, Brazil, December 3-7, 2007 Revised Lectures

13

Volume Editors Paulo Borba Universidade Federal de Pernambuco Centro de Informática CEP 50732-970, Recife, PE, Brazil E-mail: [email protected] Ana Cavalcanti University of York Department of Computer Science Heslington, York YO10 5DD, UK E-mail: [email protected] Augusto Sampaio Universidade Federal de Pernambuco Centro de Informática CEP 50732-970, Recife, PE, Brazil E-mail: [email protected] Jim Woodcock University of York Department of Computer Science Heslington, York YO10 5DD, UK E-mail: [email protected]

Library of Congress Control Number: 2010929777 CR Subject Classification (1998): D.2.4, D.2, D.1, F.3, K.6.3, F.4.1 LNCS Sublibrary: SL 2 – Programming and Software Engineering ISSN ISBN-10 ISBN-13

0302-9743 3-642-14334-2 Springer Berlin Heidelberg New York 978-3-642-14334-2 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

The Pernambuco School on Software Engineering (PSSE) 2007 was the second in a series of events devoted to the study of advanced computer science and to the promotion of international scientiﬁc collaboration. The main theme in 2007 was testing. Testing is nowadays a key activity for assuring software quality. The summer school and its proceedings were intended to give a detailed tutorial introduction to the scientiﬁc basis of this activity and its state of the art. These proceedings record the contributions from the invited lecturers. Each of the chapters is the result of a thorough revision of the initial notes provided to the participants of the school. The revision was inspired by the synergy generated by the opportunity for the lecturers to present and discuss their work among themselves and with the school’s attendees. The editors have tried to produce a coherent view of the topic by harmonizing these contributions, smoothing out diﬀerences in notation and approach, and providing links between the lectures. We apologize to the authors for any errors introduced by our extensive editing. Although the chapters are linked in several ways, each one is suﬃciently selfcontained to be read in isolation. Nevertheless, Chap. 1 should be read ﬁrst by those interested in an introduction to testing. Chapter 1 introduces the terminology adopted in this book. It also provides an overview of the testing process, and of the types (functional, structural, and so on) and dimensions (unit, integration, and so on) of the testing activity. The main strategies employed in the central activity of test selection are also discussed. Most of the material presented in this introductory chapter is addressed in more depth in the following chapters. Chapter 2 gives an overview of the foundations and practice of testing. It covers functional, structural, and fault-based testing, and discusses automation extensively. Evaluation and comparison are based on experimentation, with emphasis on mutation testing. Chapter 3 discusses test-case generation and selection for reactive systems. The focus is on model-based approaches. A particular tool, Target, and a case study using mobile-phone applications are used for extensive illustration of techniques, strategies, and algorithms. Modelling is an expensive activity in this context, and this chapter also discusses the automatic generation of models from requirements. Chapter 4 discusses the main challenges of testing software product lines. It presents several techniques and an overall process for testing, in a productive way, products that have related functionalities but speciﬁc behavior diﬀerences. This chapter considers testing and validation not only of code but other artifacts as well.

VI

Preface

Chapter 5 describes how to design, implement, and test software using the methodology of Parameterized Unit Testing. This is supported by the tool Pex, an automated test input generator, which leverages dynamic symbolic execution to testing whether the software under test agrees with the speciﬁcation. Pex produces a small test suite with high code coverage from Parameterized Unit Tests. The basic concepts are introduced and the techniques are illustrated with some examples. Deterministic, single-threaded applications are assumed. Chapter 6 summarizes the experience of a leading tool vendor and developer. It addresses the issues governing the design and use of both static and dynamic testing tools. It argues that a well-designed modern tool relies on an underlying mathematical theory: there is now a convergence between formal methods and principled testing. Chapter 7 presents a generic framework for developing testing methods based on formal speciﬁcations, and its specialization to several formal approaches: Finite State Machine, Algebraic Speciﬁcations, Input-Output Transition Systems, and Transition Systems with Priorities. Assuming some testability hypothesis on the system under test, a notion of exhaustive test suite is derived from the semantics of the formal notation and from the deﬁnition of correct implementation. Then, a ﬁnite test suite can be selected from the exhaustive one via some selection hypotheses. Chapter 8 revisits fault-based and mutation testing. It gives a foundational account of this technique using a relational model of programming based on reﬁnement, namely, that of Hoare and He’s Unifying Theories of Programming (UTP). The theory suggests and justiﬁes novel test-generation techniques, which are also discussed. Tool support for the presented techniques is considered. We are grateful to the members of the Organizing Committee, who worked very hard to provide an enjoyable experience for all of us. Without the support of our sponsors, PSSE 2007 could not have been a reality. Their recognition of the importance of this event for the software engineering community in Latin America is greatly appreciated. We would also like to thank all the lecturers for their invaluable technical and scientiﬁc contribution, and for their commitment to the event; the eﬀort of all authors is greatly appreciated. Finally, we are grateful to all the participants of the school. They are the main focus of the whole event.

March 2010

Paulo Borba Ana Cavalcanti Augusto Sampaio Jim Woodcock

Organization

PSSE 2007 was organized by the Centro de Informática, Universidade Federal de Pernambuco (CIn/UFPE), Brazil, in cooperation with the University of York, UK.

Executive Committee Paulo Borba CIn/UFPE Managing Director and Guest Editor Ana Cavalcanti University of York Augusto Sampaio CIn/UFPE Jim Woodcock University of York

Sponsoring Institutions Formal Methods Europe Sociedade Brasileira de Computação, Brazil United Nations University, Macau Universidade Federal de Pernambuco (CIn/UFPE), Brazil University of York, UK

VIII

Organization

Acknowledgements Auri Vincenzi, Márcio Delamaro, Erika Höhn, and José Carlos Maldonado would like to thank the Brazilian Funding Agencies – CNPq, FAPESP and CAPES — and the QualiPSo Project (IST-FP6-IP-034763) for their partial support of the research they report in Chap. 2. Patricia Machado and Augusto Sampaio would like to emphasize that most of their chapter covers results achieved from a research cooperation in software testing between Motorola Inc., CIn-UFPE, and UFCG. They would like to thank the entire group for all the support, eﬀort, criticisms, and suggestions throughout this cooperation. Particularly, their chapter is based on joint papers with Gustavo Cabral, Emanuela Cartaxo, Sidney Nogueira, Alexandre Mota, Dante Torres, Wilkerson Andrade, Laisa Nascimento, and Francisco Oliveira Neto. John McGregor would like to thank Kyungsoo Im and Tacksoo Im for their work on implementations and John Hunt for the implementations from his dissertation. Nikolai Tillmann, Jonathan de Halleux, and Wolfram Schulte would like to thank their past interns and visiting researchers Thorsten Schuett, Christoph Csallner, Tao Xie, Saswat Anand, Dries Vanoverberghe, Anne Clark, Soonho Kong, Kiran Lakhotia, Katia Nepomnyashchaya, and Suresh Thummalapenta for their work and experiments to improve Pex, Nikolaj Bjørner and Leonardo de Moura for their work on the constraint solver Z3, the developers and testers of .NET components and Visual Studio within Microsoft for their advice, and all users of Pex for giving feedback. Marie-Claude Gaudel’s text is extracted from or inspired by previous articles co-authored by the author. She would particularly like to thank Pascale Le Gall for Sect. 2, Richard Lassaigne, Michel de Rougemont, and Frédéric Magniez for Sect. 3, Perry James and Grégory Lestiennes for Sects. 4 and 5, and Ana Cavalcanti for new exciting work on testing CSP reﬁnements. Bernhard K. Aichernig’s work was carried out as part of the EU-funded research project in Framework 6: IST-33826 CREDO (Modeling and analysis of evolutionary structures for distributed services). His Theorem 12 was contributed by He Jifeng, then at UNU-IIST. Several people contributed to the implementations of the theories discussed in Sect. 6 of his chapter. The OCL test-case generator was implemented by Percy Antonio Pari Salas during his fellowship at UNU-IIST. The Spec# test-case generator is an idea of Willibald Krenn, TU Graz, and was realized in the EU project in Framework 7: ICT-216679 MOGENTES (Modelbased Generation of Tests for Dependable Embedded Systems). The ﬁrst LOTOS mutation-testing case study was carried out by Carlo Corrales Delgado during his fellowship at UNU-IIST. The more recent achievements in the mutation testing of protocols are the work of Martin Weiglhofer, TU Graz.

Table of Contents

Software Testing: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patr´ıcia Machado, Auri Vincenzi, and Jos´e Carlos Maldonado Functional, Control and Data Flow, and Mutation Testing: Theory and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Auri Vincenzi, M´ arcio Delamaro, Erika H¨ ohn, and Jos´e Carlos Maldonado

1

18

Automatic Test-Case Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patr´ıcia Machado and Augusto Sampaio

59

Testing a Software Product Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John D. McGregor

104

Parameterized Unit Testing with Pex: Tutorial . . . . . . . . . . . . . . . . . . . . . . . Nikolai Tillmann, Jonathan de Halleux, and Wolfram Schulte

141

Software Tool Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Hennell

203

Software Testing Based on Formal Speciﬁcation . . . . . . . . . . . . . . . . . . . . . . Marie-Claude Gaudel

215

A Systematic Introduction to Mutation Testing in Unifying Theories of Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernhard K. Aichernig

243

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

313

Software Testing: An Overview Patrícia Machado1, Auri Vincenzi3 , and José Carlos Maldonado2 1

3

Universidade Federal de Campina Grande, Brazil [email protected] 2 Universidade de São Paulo, Brazil [email protected] Universidade Federal de Goiás, Brazil [email protected]

The main goal of this chapter is to introduce common terminology and concepts on software testing that is assumed as background in this book. The chapter also presents the multidimensional nature of software testing, showing its different variants and levels of application. After a brief introduction, Section 2 presents a set of basic definitions used in the remaining of this book. Section 3 gives an overview of the essential activities and documents involved in most test processes. Section 4 discusses the kinds of properties we may want to test, including functional, non-functional, and structural properties. In Section 5, we discuss the various dimensions of software testing, covering unit, integration, system, and acceptance testing. Section 6 highlights that different domains have demanded effort from the research community to work on tailored strategies; we discuss object-oriented, component-based, product-line, and reactive-systems testing. Test selection is a main activity of a test process, and we discuss the main strategies in Section 7. We conclude this introduction in Section 8 with some final considerations.

1 Introduction In recent years the interest and importance of software testing have grown, mainly due to the rising demand for higher software quality. Shull et al. [298] present a discussion and an interesting synthesis in the current knowledge of software defect reduction. One point they comment on concerns the software modules that most contribute to defects. They warn that, during development, virtually no modules are defect-free when implemented and about 40% of the modules may be defect-free after their release. Therefore, as was pointed out by Boehm and Basili [47], it is almost impossible to deliver a software product free of defects. Moreover, it is important to observe that the later a fault is detected, the greater the cost of its correction (see Figure 1). In 1987, Boehm [48] evaluated the relative cost of a defect correction against the development phases it detected and concluded that, from requirements to maintenance, the cost-escalation factor ranges from 1 to 100 (Figure 1(a)), where 1 means the relative cost in the requirement phase and 100 refers to the cost in the maintenance phase. More recently, Boehm and Basili [47] provided new and not so dramatic information that the cost-escalation factor for small, noncritical software systems is more likely to be 5:1 than 100:1 (Figure 1(b)). However, even a factor P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 1–17, 2010. c Springer-Verlag Berlin Heidelberg 2010

2

P. Machado, A. Vincenzi, and J.C. Maldonado

0.5-1

Requirement

0.5-1

Requirement

2.5

Project

1.5

Project

5

Coding

2.0

Coding

10

Unit Testing

3.0

Unit Testing

25

Acceptance Testing

4.0

Acceptance Testing

100

Maintenance

5.0

Maintenance

(a) Boehm, 1987 [48].

(b) Boehm e Basili, 2001 [47].

Fig. 1. Cost-escalation factor of defect correction

of 5 can represent a high cost rate, which emphasizes the need to bring verification and validation (V&V) activities earlier in the development. Software development methodologies like Extreme Programming (XP) [86] and Test-Driven Development (TDD) [15] implement this practice. Software testing is a dynamic activity in the sense that it requires a product that can be executed or simulated with a given input and can provide an observable output. Its main goal is to reveal the existence of faults in the product under testing (PUT) during its different development phases. We use the term “product” instead of “program” or “system" because the concepts hereby presented can be equally applied, without loss of generality, to products from specification to the implementation level. There are other complementary static verification and validation activities, such as formal revision, which should be used in conjunction with testing in order to detect a larger number of faults as early as possible (see Chapter 4 and [245], for example). Despite its fundamental importance, software testing is known as one of the most expensive activities in the development process that can take up to 50% of the total cost in software projects [157]. Besides its main objective — to reveal faults — the data collected during testing is also important for debugging, maintenance, reliability assessment, and also for software process improvement. Experimental software engineering [196,299] is a recent field of study aiming at contributing to the improvement of the current state of the art and practice providing cost and benefits evidences on method, technique, criteria and tools in different domain applications. These evidences are relevant for the establishment of effective and efficient verification, validation and testing strategies combining the benefits of these different testing methods, techniques, criteria and tools [26,300,187,229,64].

2 Basic Terminology To decide whether the actual output obtained by exercising a PUT with a given input is correct against the product specification, an “oracle” is needed — in general represented by a domain expert, one which is able to determine whether such an input has revealed a fault or not. Once a fault is detected, the testing activity is usually interrupted and the debugging activity takes place; its objective is to identify, from the incorrect output, the point in the product where the defect is located. Once the fault (the cause) is located and corrected, the testing activity is resumed.

Software Testing: An Overview 1 2 3 4 5 6 7 8 9 10 11 12 13 14

3

public c las s F a c t o r i a l { p u b l i c s t a t i c l o n g compute ( i n t x ) throws N e g a t i v e N u m b e r E x c e p t i o n { i f ( x < 0) { / / S h o u l d be ( x >= 0 ) long r = 1; f o r ( i n t k = 2 ; k <= x ; k ++) { r += k ; / / S h o u l d be r ∗= k ; } return r ; } else { throw new N e g a t i v e N u m b e r E x c e p t i o n ( ) ; } } }

Listing 1.1. Factorial class and the compute method

The IEEE carries out many efforts on standardization. One of them includes the definition of a Standard Glossary of Software Engineering Terminology — IEEE Standard Number 610.12-1990 (R2002) [177]. This glossary differentiates the following terms. 1. Mistake: a human action that produces an incorrect result. Example: an incorrect action taken by the programmer. 2. Fault: an incorrect step, process, or data definition in a computer program. In common usage, the terms “error”, “bug”, and “defect” are used to express this meaning. Example: an incorrect instruction or statement. 3. Error: the difference between a computed, observed, or measured value or condition and the true, specified, or theoretically correct value or condition. 4. Failure: the inability of a system or component to fulfill its required functions within specified performance requirements. Example 1. To illustrate these concepts, we consider the statement (z = y + x ). We suppose that, because of a programmer’s mistake, it is changed to (z = y − x ), thus characterizing a fault. If such a fault is activated (executed) with x = 0, regardless of the value of y, no incorrect output is produced because, although the defect is activated, the fault does not lead to an error and no failure occurs. For any other value of x = 0, the fault activation causes an error on the variable z . Such an error, when propagated until the product output, will result in a failure. In general, faults can be classified as computational faults — the fault causes an incorrect computation, but the sequence of executed statements corresponds to the expected one — and domain faults — the executed sequence of statements does not correspond to the expected one. In both cases, there exist faults that are data sensitive — those that can only be detected by test cases with specific input values; and faults that are not data sensitive — those that are detected by any test case that executes the faulty statement, regardless of the actual input value. The following example illustrates these kinds of faults. Example 2. We consider a method (of some Java program) that should compute the factorial of a given integer number (Listing 1.1). The first fault located at line 4 (it should be x>=0) is responsible for a domain fault because, regardless of the value of x, a wrong

4

P. Machado, A. Vincenzi, and J.C. Maldonado

set of statements is executed. The second fault (supposing the first one is corrected) located at line 7 (it should be r*=k) is responsible for a computational fault since, except for negative values and for x={0,1,3}, the method computes the factorial function incorrectly. We observe that this is also an example of a data sensitive fault, since for x=3 the fault is activated but the correct value of 3!=6 is computed. Therefore, in order to detect the fault, a value of x different from {0,1,3} should be used. Despite its considerable importance, the testing activity is permeated by a series of limitations [175,283,157]. In general, the following problems are undecidable and represent such limitations. – Correctness: there is no general purpose algorithm to prove the correctness of a product. – Equivalence: given two programs, whether they are equivalent or not; or given two paths (sequence of statements) whether they compute the same function or not. – Executability: Given a path (sequence of statements) whether there exists an input data that can execute it or not. – Coincidental Correctness: a product can present coincidentally a correct result for a given input data d ∈ D because one fault masks the error of another. All these limitations bring important considerations in the context of software testing, mainly the impossibility of providing a full automation of all necessary testing activities. Considering product correctness, it is said that a product P with input domain D is correct with respect to its specification S if S (d ) = P(d ) for any data d belonging to D , that is, if the behavior of the product is in accordance with the expected behavior (with respect to the specification) for all input domains. Figure 2 illustrates the input and output domains of a given product but observe that both can be infinite and, in such a case, it is not possible to execute P with its entire input domain. Given two products, P1 and P2 , if P1 (d ) = P2 (d ) for every d ∈ D , it is said that P1 and P2 are equivalent. Since there is no general purpose algorithm to decide the correctness of a product, as discussed above, a common assumption in software testing refers to the existence of an oracle, which can determine, for any input data d ∈ D , if S (d ) = P(d ), considering a reasonable limit of time and effort. An oracle simply decides whether output values are correct against what is specified. Example 3. As an example of equivalent products, we consider the following version of the Factorial program (Listing 1.2). We observe that the only difference in relation

Input Domain∗

Output Domain∗

Product Product ∗

May be infinite (∞)

Fig. 2. Input and output domains of a given program

Software Testing: An Overview 1 2 3 4 5 6 7 8 9 10 11 12 13 14

5

public c las s F a c t o r i a l { p u b l i c s t a t i c l o n g compute ( i n t x ) throws N e g a t i v e N u m b e r E x c e p t i o n { i f ( x >= 0 ) { long r = 1; f o r ( int k = 1 ; k <= x ; k ++) { / / i n t k = 2 ; r ∗= k ; } return r ; } else { throw new N e g a t i v e N u m b e r E x c e p t i o n ( ) ; } } }

Listing 1.2. Functional equivalent compute method

to the previously mentioned corrected version (the one without the two discussed faults) is the assignment in line 6 which initializes k with 1 instead of 2. Although this version performs one additional step when calculating the factorial function, the resulting value is always the same, regardless of the value of x. Thus, this version is equivalent to the previously mentioned correct version. Example 4. To illustrate the executability problem, we consider another method of the Identifier program (see Listing 1.3), namely, validateIdentifier. More details about this program are given in Section 2 of Chapter 2 . We also consider a given testing criterion that requires the statements of this method to be executed so that statements in lines 15 and 21 must be covered by the same test case. The problem is that the statement in line 15 indicates that an invalid identifier was identified, whereas the statement in line 21 indicates the opposite. Therefore, considering the given implementation, to execute statements in lines 15 and 21 in the same execution is impossible, that is, this testing requirement is infeasible. Coverage analysis is mainly used when conducting and evaluating the testing activity. It basically consists of determining the percentage of testing requirements established by a testing criterion that were exercised by the test suite. (We employ the term “test suite” to define a set of test cases.) Based on this information the test suite can be improved by adding new test cases which aim at exercising those requirements not yet covered. In this perspective, the knowledge that some required elements can be infeasible is crucial and in general the determination of the feasibility of a testing requirement demands tester intervention. An exhaustive test executes the PUT with all possible values from its input domain (see Figure 2). It is well known that an exhaustive test is not feasible or practical. This is because, in general, to execute a product with all its possible input values (the entire input domain) is impossible, simply because the input domain can be infinite. Even for finite but large input domains, the exhaustive test can be prohibitive due to time and cost constraints. The only option available is to use only a part of the input domain to execute the PUT. The question to be answered is which values should be selected to maximize the chance of detecting faults. Each of these values are used to create what is called a test case.

6

P. Machado, A. Vincenzi, and J.C. Maldonado 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

p u b l i c b o o lea n v a l i d a t e I d e n t i f i e r ( S t r i n g s ) { char a c h a r ; boolean v a l i d _ i d = f a l s e ; achar = s . charAt ( 0 ) ; v a li d_ i d = valid_s ( achar ) ; i f ( s . length ( ) > 1) { achar = s . charAt ( 1 ) ; int i = 1; while ( i < s . length ( ) − 1) { achar = s . charAt ( i ) ; i f ( ! va l i d_ f ( achar ) ) valid_id = false ; i ++; } }

19

i f ( v a l i d _ i d && ( s . l e n g t h ( ) >= 1 ) && ( s . l e n g t h ( ) < 6 ) ) return true ; else return f a l s e ;

20 21 22 23 24

}

Listing 1.3. validateIdentifier method

The following example is an adaptation of the one provided by Dijkstra [105] in “On the reliability of mechanisms”, where he demonstrates the general impossibility of an exhaustive test and states the famous corollary: “Program testing can only be used to show the presence of bugs, but never to show their absence!” Example 5. We consider a simple Java method which takes two arguments of primitive double type, each one 64 bits long. This method has a clearly finite input domain with 2128 elements (264 ∗264 ), considering all possible combinations. We also assume that we are running this method in a machine capable of performing 240 instructions per second. In this machine, which has a processor speed compatible with those commonly found 128 currently, this method will take ( 2240 = 288 ≈ 1026 ) seconds to be completed. Since 26 19 one year corresponds to approximately 107 seconds, it will be finished on 10 107 = 10 years, clearly an infeasible deadline. Moreover, it is important to remember that there is no general purpose algorithm that can be used to prove the correctness of a product. Even though in general it is not possible to prove product correctness with testing, the test, when conducted in a systematic and clear-sighted way, helps to increase the confidence that the product behaves according to its specification, and also to highlight some minimal characteristics from product quality. Two important questions arise in the context of software testing: “How is the selection of test data performed ?” and “How may one decide when a product P was sufficiently tested?”. Testing criteria for test suite selection and evaluation are crucial to the success of the testing activity. Such criteria provide an indication of how test cases should be selected in order to increase the chances of detecting faults or, when no faults are found, to establish a high level of confidence of correctness. A testing criterion is used to help the tester subdivide the input and output domains and provide a systematic way to select a finite number of elements to compose a test suite. The objective is to create the smallest

Software Testing: An Overview

7

test suite for which the output indicates the largest set of faults. For instance, in Figure 2, the dots in the input domain correspond to a test suite selected according to a given testing criterion. The simplest way to represent a test case is as a tuple (d , S (d )), where d ∈ D and S (d ) represents the corresponding expected output for the input d according to specification S . A more detailed definition states that a test case is a 4-tuple preconditions, input, expected output, execution order. We observe that this definition, resulting from the combination of test case definitions provided by Copeland [85] and McGregor [245], includes the former. The pre-conditions establish the constraints that must hold before the input is provided. The input is a given element from the input domain taken to execute the PUT. The expected output represents what should be the correct output according to the product specification. Finally, the execution order can be classified as “cascade” or “independent”. It is cascade when the order in which each test case is executed is important, that is, there is a correct order for executing each test case because the following test case assumes, as a pre-condition, the resulting state from the execution of the previous ones. On the other hand, if a test case is independent it means that the order in each test case is executed does not matter. As a consequence of this order of execution, cascade test cases are in general shorter than independent test cases. However, once a cascade test case fails it is probable that all other subsequent cascade test cases fail also due to the dependence between them. Given a product P and a test suite T , we adopt the following terminology in relation to testing criteria. – Adequacy criterion of test case: predicate to evaluate T when testing P; and – Selection criterion of test case: procedure to choose test cases in order to test P. Goodenough and Gerhart [144] define software testing adequacy criterion as “a predicate that defines what properties of a program must be exercised to constitute a ‘thorough’ test”. There is a strong correspondence between selection methods and adequacy criteria because, given an adequacy criterion C , there exists a selection method MC which establishes: select T such that T is adequate according to C . Analogously, given a selection method M , there exists an adequacy criterion CM which establishes: T is adequate if it was selected according to M . In this way, it is common to use the term “adequacy criterion” (or simply “testing criterion”) to refer to selection methods [283,228]. Testing techniques and criteria have been defined to systematize the testing activity. A testing technique defines the source of information which will be used, and a testing criterion from such a source of information derives a set of testing requirements. Figure 3 depicts the relations between the several concepts presented so far. As it may be observed, once a testing requirement is obtained it can be used for both test-case generation or test-case evaluation (coverage analysis). Given P, T , and a criterion C , it is said that the test suite T is C -adequate to test P if T contains test cases which, when executed against P, satisfy all the testing requirements established by C . As an example, we consider a given testing criterion that requires the execution of each statement of a unit under testing. This criterion, called All-Statements, generates a list of all unit statements as testing requirements. Based on such a list, the tester tries to find an input to execute a specific statement. On the other hand, the tester can be given several test cases and, in this case, needs to know whether such test cases fulfill

8

P. Machado, A. Vincenzi, and J.C. Maldonado generation Testing Technique

has

Source of defines Information

Testing Criteria

derives

Testing Requirements

Test Cases evaluation

uses

Fig. 3. Depicts our view of the relations between software testing definitions

all the testing requirements demanded by the All-Statements criterion. In the first case, the tester is using the testing requirements to generate test cases. In the second case, the tester is using the testing requirements to evaluate already existing test cases. Once a test suite T satisfies all the requirements of a given testing criterion C , we say that such a test suite is adequate for that criterion, or simply T is C -adequate. Every time a test case is executed against a PUT, some statements of the product are executed while others are not. All the executed parts are said to be covered by the test case and such a coverage can be used as a measure of the test-case quality. Based on the control flow or data flow which exists in the PUT, different coverage patterns can be defined. For instance, considering the criterion mentioned above, a given test suite T1 can cover 70% of the unit statements and a test suite T2 can cover 100% of the statements (T2 is All-Statements-adequate). We observe that the concept of coverage does not apply only to the source code. There may exist a functional criterion which requires specific parts of the product specification to be covered, not its implementation. It is important to observe that, from a given test suite T that is C -adequate, it is possible to obtain, in theory, an infinite number of C -adequate test suites simply by including more test cases in T , that is, any test suite T ⊇ T is also C -adequate. Obviously T will contain redundant test cases in relation to the required elements of C and this is not always desired — due to time and cost constraints to execute the complete test suite. There are also certain situations where we would like to find the minimum test suite Tm so that Tm is C -adequate. This process is called test suite minimization [352]. As defined by Wong et al. [352], if T is a test suite for P, then | T | denotes the number of elements in T . If C is a testing adequacy criterion, C (T ) denotes the set of testing requirements of P covered by T . A suite Tm is the minimal subset of T in terms of the number of test cases if, and only if, C (Tm ) = C (T ) and for all T ⊆ T such that C (T ) = C (T ), | Tm |≤| T |. In more informal terms, by considering a test suite T which covers a set of testing requirements demanded by C , a minimal test suite Tm corresponds to the minimum subset of T which also covers the same set of testing requirements, and there is no other subset of T with fewer elements than Tm covering the same set of testing requirements. As is also mentioned by Wong et al. [352], the relative “cost” of a given test suite can be calculated in different ways. In the definition presented above we use the number of test cases, but the purpose of minimization is to reduce some associated cost of the test suite, for instance, the computation time needed to execute the test suite, instead of its number of elements.

Software Testing: An Overview

9

Another relevant question in this context is: given a test suite T C1 -adequate, is there another criterion C2 which contributes to the improvement of the quality of T ? This question has been investigated both on theoretical and experimental perspectives. In general, we may state that the minimal properties which a testing criterion C should fulfill are as follows [228]. 1. To guarantee, from the control flow perspective, the coverage of all conditional deviations. 2. To require, from the data flow perspective, at least one use of all computational results. 3. To require a finite test suite. Advantages and disadvantages of testing criteria can be evaluated through theoretical and experimental studies. From a theoretical point of view, the subsume relation and the complexity of the testing criteria are the most investigated aspects [339,283,175]. The subsume relation establishes a partial order between testing criteria, characterizing a hierarchy among them. It is said that a criterion C1 subsumes a criterion C2 if, for every program P and any test suite T1 C1 -adequate, T1 is also C2 -adequate and there is a program P and a test suite T2 C2 -adequate such that T2 is not C1 -adequate. Section 3.2 of Chapter 2 presents a subsume relation that considers the main controland data-flow testing criteria. Complexity is defined as the maximum number of test cases required to satisfy a criterion in the worst case. By considering the data flow criteria, studies have shown that the latter have an exponential complexity which motivates experimental studies capable of determining their application cost in practice [228]. Some authors have also explored the testing criteria efficacy from a theoretical point of view. They have worked on the definition of different relations between criteria to capture the capability of such criteria to detect faults, since this capability cannot be expressed by the subsume relation [340,124]. From an experimental point of view, three aspects of the testing criteria are commonly evaluated: cost, efficacy, and strength [341,239,266]. The cost reflects the required effort to use the criterion. It is in general measured by the number of test cases needed to satisfy the criterion. The efficacy corresponds to the capacity of the criterion in detecting faults. The strength refers to the probability of satisfying a given criterion C2 after satisfying a criterion C1 [239]. An important research area known as Experimental Software Engineering has emerged in an attempt to provide evidence of advantages and disadvantages of methods, methodologies, techniques, and tools used during software development processes. In Section 8 of Chapter 2 it is also provided some information on Experimental Software Engineering in the context of verification and validation activities. Evaluation of efficacy test suites is commonly carried out using mutation testing. It appeared in the 1970s at Yale University and at the Georgia Institute of Technology. It was strongly influenced by a classical method for digital circuit testing known as “single fault test model” [126]. One of the first papers describing mutation testing was published in 1978 [101]. This criterion uses a set of products that differ slightly from product P under testing, named mutants, in order to evaluate the adequacy of a test suite T . The goal is to find a set of test cases which is able to reveal the differences between

10

P. Machado, A. Vincenzi, and J.C. Maldonado

P and its mutants, making them behave differently. When a mutant is identified to have a diverse behavior from P it is said to be “dead”, otherwise it is a “live” mutant. A live mutant must be analyzed for one to check whether it is equivalent to P or whether it can be killed by a new test case, thus promoting the improvement of T . Mutants are created based on mutation operators: rules that define the (syntactic) changes to be carried out in P in order to create the mutants. It is widely known that one of the problems with mutation testing is related to the high cost of executing a large number of mutants. Moreover, there is also the problem of deciding mutant equivalence, which in general is undecidable.

3 A Generic Testing Process A testing process involves several activities and related documents that should be produced before and during the software development process. In general, the testing process involves some subprocess: 1) Test Planning; 2) Test Project; 4) Test Execution; and 4) Test Record. Figure 4, adapted from [61], depicts these subprocess, activities to be carried out, and the related artifacts. In the case of testing process artifacts, they can be produced based on the IEEE Standard 829-1998 for Software Test Documentation [176] which provides general guidelines about several kinds of documents to be produced during the testing process. As we can observe, Figure 4 depicts the subprocesses and correlated artifacts of the test process. In the subprocess “Test Planning” all related testing activities are planned and documented in a test plan that should contain information about what parts of the product will be tested, what test levels or phases will be covered (see the next section), which test technique and criteria will be used in each test phase, what are the necessary resources, what is the operational environment, and also the related schedule of each task.

Fig. 4. Generic Testing Process Adapted from IEEE Standard 829-1998 [61]

Software Testing: An Overview

11

Once the test plan is defined it is refined during the “Test Project”, when the test cases are really created aiming at reaching the desired level of quality. Three documents can be generated during the test project: the “Test Project Specification” contains details about how the test will be performed, the “Test Case Specification” document registers the generated test cases for each phase, and the “Test Procedure Specification” document details each step previously created and defined in the “Test Project Specification” document. Once the test cases are defined, they can be run to collect the test results. This activity is performed during the “Test Execution” subprocess. For this purpose, the existence of testing tools is almost mandatory since there are several tasks that, if executed manually, are very error prone and subject to human mistakes. Finally, the results produced during the test execution are documented in the “Test Daily Report” and, if any discrepancy is observed with respect to the software specification, the “Test Incident Report” is filled out and corrective actions can be taken. At the end, a “Test Summary Report” is produced containing a syntheses of the test execution process. All these testing activities have been studied for long time and it can be observed that the earlier they are integrated in the development process the better. A well-known system development model which aggregates verification and validation (V&V) activities with the development process is called the V-Model [279] (see Figure 5. This model suggests that at every development phase a verification and validation activity may be performed mainly by anticipating the fault detection capability as early as possible. It is called V-Model because it is inspired in the traditional cascade development model, but such concept of integrated V&V activities into the development process can be equally applied to any development model. For instance, once the product specification has been completed, if such a specification is based on a formal specification or in a state machine-based model, it is possible to

Requirement Specification

Operational System

User’s Requirement

Acceptance Test Plan Functional and non-functional requirements

System Specification

Acceptance Testing

P st Te

System Test Plan

re

System Project

pa

System Testing

System architecture

t ra ion Project

xe cu ti

Integration Testing

Te st E

Detailed Project

on

Integration Test Plan

Unit Testing

Fig. 5. Software development versus test execution: V-Model (adapted from [61,301])

12

P. Machado, A. Vincenzi, and J.C. Maldonado

use the techniques presented in Chapters 7 and 3 for automatizing the generation of test cases which can be used to test the specification itself and also, later, its implementation. For systematizing the test case generation, it is important to combine two or more testing criteria aiming at selecting elements from several input subdomains, maximizing the change of exposing the faults in the PUT. As mentioned before, the implementation of testing tools for supporting the application of a testing criterion is of great importance for helping the tester to reduce the cost associated with the testing activity and also to make this activity less error prone and more productive. Moreover, testing tools can be used for teaching the testing concepts and also for technology transfer. They also help the execution of experimental studies aiming at comparing the different testing criteria in the definition of incremental, effective and low cost testing strategies.

4 Types of Testing Software requirements are usually classified as functional and non-functional. This naturally poses a classification on testing techniques that are related to them. Alternatively, a classification has often been used in the literature: functional and structural testing. However, this leaves out a very important group of testing techniques: the ones devoted to non-functional properties. In the sequel, we present an alternative classification that is more helpful to identify the different goals of the strategies presented in this book. Testing for functional properties. Strategies in this group focus on testing functional requirements at different phases of the development process (as presented in Section 3) that describe how the PUT should react to particular stimuli. The behaviour of the system is observed, usually by exercising a usage scenario, to decide whether the test is successful or not (whether system behaves as expected or not). Functional testing strategies are included in this group such as the ones presented in this book in Chapters 3 and 4 (model-based testing), and Chapters 6 , 7 and 8 (specification-based testing). Testing for non-functional properties. This involves checking constraints under which the system should operate. Testing strategies here are often applied at system level only, since the properties are often emergent ones, such as performance, timing, safety, security and quality standards. Chapter 4 presents techniques that can be used to evaluate these properties in the context of product line engineering and sort out conflicts on them based on a inspection guided by tests and architecture analysis. Testing for design properties has also been investigated in the literature with the goal of either detecting or anticipating design faults that may lead to software failures. This is discussed in Chapter 8 . Testing for structural properties. Since testing is applied at different phases during the software development process, adequate coverage of the structure is often required. Structural testing techniques that rely on either checking or covering from the architecture to the code structures of the system are included in this group. These are discussed in Chapter 2 . Structural criteria commonly used for structural testing are often used to evaluate the quality of a test suite in general.

Software Testing: An Overview

13

5 Levels or Phases of Testing The development process is divided into several phases — allowing the system engineer to implement its solution step by step — and so is the testing activity. The tester is able to concentrate on various aspects of the software and use different testing criteria in each one [215]. In the context of procedural software, the testing activity can be divided into four incremental phases: unit, integration, system, and acceptance testing [279]. Variations in this pattern are identified for object-oriented and component-based software, as will be discussed later. Unit testing focuses on each unit to ensure that their algorithmic aspects are correctly implemented. The aim is to identify faults related to logic and implementation in each unit. In this phase structural testing is widely used, requiring the execution of specific elements of the control structure in each unit. Mutation testing is also an alternative to unit testing; it is discussed later on in Section 7. In this phase the need to develop drivers and stubs is common (Figure 6). If we consider that F is the unit to be tested, the driver is responsible for coordinating its testing. It gathers the data provided by the tester, passes it to F in the form of arguments, collects the results produced by F , and shows them to the tester. A stub is a unit that replaces another unit used (called) by F during unit testing. Usually, a stub is a unit that simulates the behavior of the used unit with minimum computation effort or data manipulation. The development of drivers and stubs may represent a high overhead to unit testing. Tools like the traditional xUnit frameworks, such as JUnit [236], may provide a test driver for the PUT with the advantage of also providing additional facilities for automating the test execution. Once each unit has been tested the integration phase begins and consequently the integration testing. However, is integration testing really necessary? Why should a product - built from previously tested units - not work adequately? The answer is that unit testing presents limitations and cannot ensure that each unit functions in every single possible situation. For example, a unit may suffer from the adverse influence of another unit. Subfunctions, when combined, may produce unexpected results and global data structures may raise problems. After being integrated, the software works as a whole and must be submitted to system testing. The goal is to ensure that the software and the other elements that are part of the system (hardware and database, for instance) are adequately combined and

input

output

driver

Unit Under Testing (F )

stub1

stub2

···

stubn

Fig. 6. Required environment for unit testing

14

P. Machado, A. Vincenzi, and J.C. Maldonado

adequate function and performance are obtained. Functional testing has been the most used in this phase [279]. Acceptance testing is, in general, performed by the user, who checks whether the product meets the expectations. Functional testing is also the most widely used for acceptance testing. All previous kinds of tests are run during the software development process, and information obtained from them is useful for other software engineering activities, like debugging and maintenance. Any required change in the software after its release demands some tests to be rerun to make sure the changes did not introduce any collateral effect in the previous working functionalities. This kind of testing is called regression testing. The ideal situation is to validate the changed project by rerunning all previous test cases that execute the modified parts of the code. Several works are developed providing guidance on how to select a subset of the previous test suite for regression testing [202,353,270,269].

6 Domain Specific Approaches Research on software testing also focus on domain specific concerns. The reason is that the task of testing is rather complex whose success usually depends on investigating particular features of the PUT. On one hand, the tester can usually run only very few test cases that need to be carefully selected. On the other hand, the features of each domain usually determines what is more important to be tested. Some domains that have classically been considered and for which a number of approaches have been developed are briefly described in the sequel. This list is not exhaustive, because the aim here is only to classify the different approaches presented in this book. Object-oriented software testing. Approaches focus on concepts and properties related to classes and objects, such as interface, inheritance, polymorphism, dynamic binding, dependencies between classes and methods, and life cycle of objects. Techniques and tools have been developed for exploring both functional and structural aspects of objectoriented software. For instance, Chapter 6 presents a tool for unit testing based on the code and assertions to be met. On the other hand, Chapter 4 discusses functional issues of object-oriented testing in the context of product lines. Moreover, Chapter 7 presents fundamental concepts on testing from algebraic specifications that have been used as basis for a number of specification-based testing approaches for object-oriented software as well as interface testing in general. Finally, the testing theory presented in Chapter 8 has also been applied to test generation from OCL constraints. Component-based software testing. In this case, approaches are usually integrated with component-based software development methods, where a component can be defined as a unit of composition that is identified by a contract specification. The concerns here are more abstract than the ones considered by object-oriented testing. Testing is aimed at checking whether contracts can be satisfied, independent of the internal structure of the component. A component is committed to fulfill its behaviour as long as the dependencies it requires can also be fulfilled by the environment where it is deployed. Therefore, both the component and the environment have to be tested. Chapter 3 briefly reviews a test-case generation strategy based on Markov chains for component testing.

Software Testing: An Overview

15

Product-line software testing. Independently of the technology used, the main challenge here is how to define, select and maintain test suites that are appropriate for each member of a family of products by starting from general test suites and taking variabilities into account. The main concerns are how different variation mechanisms as well as the architectural design can guide test-case selection. Chapter 4 presents an overview of testing in product lines along with an example. Reactive-systems testing. Reactive systems interact with their environment by accepting inputs and producing outputs. These systems may also be composed of a number of concurrent processes and networked distributed services, where interruptions in a flow of execution can occur at any time. Testing approaches for these systems are usually defined at system level where observable behaviours are the information used to decide on success or failure of a test. The main concerns are related to synchronisation, scheduling, timing and interruption as well as properties such as livelock and deadlock. Chapters 3 and 7 present fundamental concepts and test-case generation algorithms based on inputoutput labelled transition systems as specification formalism. Chapter 3 also presents approaches for dealing with interruptions, particularly considering process algebras.

7 Test Selection Test case selection is a key activity that it is executed at different stages during the software life cycle. The problem of selecting the most cost-effective test suite has been a major research subject in the software testing area since the 1970s. Strategies from adhoc and determinist selection to fully automated solutions have already been proposed and applied in industry. They usually focus on specific criteria that guide the tester to measure the adequacy of a test suite and also to decide when to stop testing. When selection strategies are applied for defining a representative test suite from another one, the term test suite reduction is more commonly applied [287]. Strategies for test-case selection can be classified according to the kind of criteria that is addressed. A quick overview of them is given in the sequel. Based on faults. Fault detection is the main goal of the tester that opt for this kind of strategy. Fault-based criteria focus on detecting certain kinds of faults that are usually defined by a fault model [41]. The selected test cases are the ones that are capable of revealing the faults. A popular technique in this category is mutation testing. Chapter 8 presents a theory of fault-based testing that uses refinement techniques and applies mutation testing at a specification level to anticipate faults that may be included in the design of the software. Fault-based testing, particularly mutation testing, is also covered in Chapter 2 . Based on structural coverage. Strategies are based on structural coverage criteria, such as control-flow, data-flow and transition-based, measured at code level (white-box testing) and also at the abstract-model level of the application (model-based testing). Structural criteria have originally been defined for white-box testing. Model-based testing have inherited and extended them to express aspects of the software that needs to be tested not particularly aiming at covering code fragments, but covering the high-level structure of the specification [324]. Test-generation algorithms presented in Chapter 3

16

P. Machado, A. Vincenzi, and J.C. Maldonado

focus on transition coverage of test models, whereas Chapter 2 presents commonly applied structural coverage criteria focusing on control-flow and data-flow. Based on property coverage (test purpose). Given a specification and a PUT, any testing activity is, explicitly or not, based on a satisfaction relation (often called conformance relation): does the PUT satisfy the specification? The tests are derived from the specification on the basis of the satisfaction relation, and often on the basis of some additional knowledge of the PUT and of its operational environment called testability hypothesis. Property-based testing is a kind of conformance testing where the selection criteria is associated with a given property that needs to be checked in the software [225], and is verified on the specification if it exists. This property is often represented as a test purpose that targets testing at a particular functionality or behaviour, limiting the scope of the specification from where test cases are selected. In other words, test purposes can be used to focus test selection at specific parts of a specification. Test selection strategies based on test purpose are presented in Chapter 3 . Test purposes are often expressed using a formal notation so that automatic test selection is possible. Moreover, test purposes are often defined from test objectives that are descriptions, at requirements level, of what should be tested. However, they can also represent safety and security properties to be checked. Based on controlling the size of the test suite. The main goal is to select a minimum yet representative set of test cases from a test suite. The algorithms applied are usually based on statistics, heuristics, and coverage of test requirements, such as how to choose the minimum suite that achieves a given coverage criteria [158,287]. They can be classified according their goal as follows. – Test-case selection: algorithms that select a subset of the original test suite to achieve a given goal. The subset may not have the same coverage capability as the original test suite [25]; – Test-suite reduction: algorithms that select a representative subset of the original test suite that provides the same coverage as the original suite [287]; – Test-case prioritisation: algorithms that schedule test cases for regression testing in an order that attempts to maximise some objective function [288]. Chapter 3 presents a test-case selection strategy based on similarity of test cases. The idea is, given a number of test cases as goal, to select the most different ones. This chapter also reviews work on statistical testing where probabilities assigned to transitions of a Markov chain guide the random selection of the most representative test cases. Based on formal specifications. Sentences in a formal specification may guide test-case selection by precisely pointing out behaviour that needs to be tested. From a different perspective when compared to traditional structural coverage, the main goal here is to cover all possible behaviours that can be observed from the specification. A technique called unfolding of sentences is usually applied, such as that presented in Chapter 7 in the context of algebraic specifications. The idea is to replace a function call by its definition. Since this may lead to an infinite number of test cases, selection hypotheses are usually considered to make it possible for a finite test suite to be selected.

Software Testing: An Overview

17

8 Final Remarks This chapter presents basic terminology and concepts that constitute essential knowledge for understanding the remaining of the chapters of this book. Standard classifications as well as novel classifications are presented to put into context the different aspects of the modern practice of software testing. The scope and importance of testing has been recognised in the last decades. From a terminal activity of the development process that had little attention, software testing evolved to an essential activity that starts from the beginning of the development process to support planning, specification and design. It is now a way of systematically investigating the problem domain and design issues, by aiming at testability and, consequently, feasibility, as well as a way of establishing concrete goals towards validation of the solutions that are proposed as early as possible in the development process. The practice of software testing has become more and more feasible and effective due to the maturity of current methods of testing selection, maintenance and evaluation of test suites that are usually supported by tools. Therefore, current research in the area has mostly been devoted to specific domains, empirical evaluation of solutions, integration of solutions with development processes, tools automation and technology transfer. Besides the classical task of detecting fault, current practices of software testing, mainly the ones that bring them close to the development process, have also great influence on the final quality of the development process by early detection of faults in specification and design documents during testing planning and specification. This is due to the fact that the testing perspective is usually more investigative and rigorous than the development one, and it is more likely to detect lack of completeness, consistency and conformance between different artifacts. This book presents interesting approaches that can be applied in practice to achieve these benefits.

Functional, Control and Data Flow, and Mutation Testing: Theory and Practice Auri Vincenzi1 , Márcio Delamaro2, Erika Höhn2 , and José Carlos Maldonado2 1

2

Instituto de Informática, Universidade Federal de Goiás Goiânia, Goiás, Brazil [email protected] Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo São Carlos, São Paulo, Brazil {delamaro,hohn,jcmaldon}@icmc.usp.br

The growth of user request for higher software quality has motivated the definition of methods and techniques in order to improve the way software is developed. Several works have investigated a variety of testing criteria in an attempt to obtain a testing strategy with lower application costs and higher efficacy in detecting faults. The aim of this chapter is to present the theoretical and practical aspects related to the software testing activity. A synthesis of functional, structural, and fault-based testing techniques is presented. A comparison of the testing criteria (cost, efficacy, and strength) is also considered from the theoretical and experimental points of view. The importance of testing automation is discussed, characterizing the main efforts of the testing community in this direction. Emphasis is given to state-of-practice tools. Efforts from the academia is also discussed. The testing activity and its related problems are presented and illustrated through practical examples with the support of different testing tools, to provide information about software testing in terms of both theory and practice. The need for a systematic evaluation of these criteria and tools under the perspective of experimental software engineering is also discussed. After a short introduction, Section 2 briefly describes a software product named Identifier, which is used to illustrate the testing concepts discussed in this chapter. In Section 3, a synthesis of the functional, structural, and fault-based testing techniques is presented. Sections 4 to 6 provide some examples of different testing criteria, considering the three testing techniques. We comment on their complementarity and show how they can be used in an incremental testing strategy. The importance of testing automation is also discussed in these sections, characterizing the main efforts of the testing community in this direction. In Section 7 the need for a systematic evaluation of these criteria and tools under the perspective of experimental software engineering is discussed. We illustrate that section by describing an experimental evaluation of different test suites generated by random, functional, and Pex [263] (described in detail in Chapter 5) against mutation testing. Finally, Section 8 presents the final considerations of this chapter.

1 Introduction Regarding the functional techniques mentioned in Chapter 1, testing requirements are obtained from the software specification (see Figure 2 in Chapter 1). The structural P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 18–58, 2010. c Springer-Verlag Berlin Heidelberg 2010

Functional, Control and Data Flow, and Mutation Testing

19

techniques use implementation features to obtain such requirements, and the fault-based technique uses information about common faults that may appear during the development. It is important to notice that these testing techniques are complementary and the question to be answered is not “Which is the one to be used?” but “How can they be used in a coordinated way taking advantage of each other?”. Each of the above mentioned techniques has several criteria that define specific requirements which should be satisfied by a test suite. In this way, requirements determined by a testing criterion can be used either for test-suite evaluation or test-suite generation. But why is a testing criterion required? The answer is that, as discussed in Chapter 1, in general executing the PUT with its entire input domain is not always possible or practical because the input and output domain may be infinite or too large. Testing techniques and criteria are mechanisms available to assess testing quality. In this scenario, a testing criterion is used to help the tester subdivide the input and output domain and provide a systematic way to select a finite number of test cases to satisfy such a testing criterion. The objective is to create the smallest test suite to which the output indicates the largest set of faults. In general, the application of a testing criterion without the support of a testing tool is an unproductive and error-prone activity. Moreover, the existence of a testing tool is very useful in conducting experimental studies, teaching testing concepts, and transferring technology. Regardless of the kind of software and the way it is produced, the use of software testing techniques, criteria, and supporting tools is crucial to ensure the desired level of quality and reliability. Since 2007, a large project, called QualiPSo (www.qualipso.org) is under development. QualiPSo is a unique alliance of European, Brazilian, and Chinese industry players, governments, and academics that was created to help industries and governments fuel innovation and competitiveness with Open Source Software (OSS). In order to meet this goal, the QualiPSo consortium intends to define and implement technologies, processes, and policies in order to facilitate the development and deployment of OSS components, with the same level of reliability traditionally offered by proprietary software. QualiPSo is the largest ever Open Source initiative funded by the European Commission, under EU’s sixth framework program (FP6), as part of the Information Society Technologies (IST) initiative. As part of the Brazilian group working in the context of QualiPSo, we intend to cooperate by making our own tools, such as JaBUTi presented in this chapter, available as OSS to increase their use by both OSS and nonOSS development communities [282].

2 The Illustrative Example We use the same didactic example throughout this chapter: the Identifier program that was already mentioned in Example 4 of Chapter 1. Though simple, the example contains the necessary elements to illustrate the concepts used in this chapter. Furthermore, we also provide some exercises using a more complex program.

20

A. Vincenzi et al. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

> cd i d e n t i f i e r i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r Usage : i d e n t i f i e r . I d e n t i f i e r < s t r i n g > i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " ab c1 2 " Valid i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " c o n t ∗1 " Invalid i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " 1 soma " Invalid i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " a123456 " Invalid

Listing 2.1. Executing identifier.Identifier

The Identifier program implements the following specification. “The program determines if a given identifier is valid or not in a variant of Pascal language, called Silly Pascal. A valid identifier must begin with a letter and contain only letters or digits. Moreover, it has at least one and no more than six character length.” Listing 2.1 shows five executions of a possible implementation of such a specification in Java, shown in Listing 2.2. Observe that, with the exception of the first call, the program implementation behaves according to the specification for all executions, thus judging the four identifiers correctly. The Identifier class implemented in Listing 2.2 has four methods. It has a known number of faults which will be used to illustrate the adequacy and effectiveness of various testing criteria. The validateIdentifier method is the most important since it is responsible for deciding whether a given String corresponds to a valid or invalid identifier, returning true or false, respectively. On the other hand, valid_s and valid_f are utility methods used to decide whether the start and the following characters taken one by one are valid or not according to the specification. Finally, the main method allows the program to be called from the command line with a single parameter – a string –, printing “Valid” or “Invalid”, depending on the program answer for the given string as a valid or invalid identifier, respectively.

3 Testing Techniques 3.1 Functional Testing Functional or black box testing is so named because the software is handled as a box with unknown content. Only the external side is visible. Hence the tester basically uses the specification to obtain testing requirements or test data without any concern for implementation [256]. A high-quality specification that matches the client’s requirements is fundamental to support the application of functional testing. Examples of such criteria [279] are equivalence partition, boundary value, cause-effect graph, and categorypartition method [265].

Functional, Control and Data Flow, and Mutation Testing 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

21

p u b l i c b o o lea n v a l i d a t e I d e n t i f i e r ( S t r i n g s ) { char a c h a r ; boolean v a l i d _ i d = f a l s e ; achar = s . charAt ( 0 ) ; v a li d_ i d = valid_s ( achar ) ; i f ( s . length ( ) > 1) { achar = s . charAt ( 1 ) ; int i = 1; while ( i < s . length ( ) − 1) { achar = s . charAt ( i ) ; i f ( ! va l i d_ f ( achar ) ) valid_id = false ; i ++; } }

19

i f ( v a l i d _ i d && ( s . l e n g t h ( ) >= 1 ) && ( s . l e n g t h ( ) < 6 ) ) return true ; else return f a l s e ;

20 21 22 23 24

}

25 26 27 28 29 30 31 32

p u b l i c b o o lea n v a l i d _ s ( char ch ) { i f ( ( ( ch >= ’A ’ ) && ( ch <= ’Z ’ ) ) | | ( ( ch >= ’ a ’ ) && ( ch <= ’ z ’ ) ) ) return true ; else return f a l s e ; }

33 34 35 36 37 38 39 40 41

p u b l i c b o o lea n v a l i d _ f ( char ch ) { i f ( ( ( ch >= ’A ’ ) && ( ch <= ’Z ’ ) ) | | ( ( ch >= ’ a ’ ) && ( ch <= ’ z ’ ) ) | | ( ( ch >= ’ 0 ’ ) && ( ch <= ’ 9 ’ ) ) ) return true ; else return f a l s e ; }

42 43 44 45 46 47 48 49 50 51 52 53 54 55

p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) { i f ( a r g s . l e n g t h == 0 ) { System . o u t . p r i n t l n ( " Usage : i d e n t i f i e r . I d e n t i f i e r < s t r i n g > " ) ; } else { I d e n t i f i e r i d = new I d e n t i f i e r ( ) ; i f ( id . v a l i d a t e I d e n t i f i e r ( args [ 0 ] ) ) { System . o u t . p r i n t l n ( " V a l i d " ) ; } else { System . o u t . p r i n t l n ( " I n v a l i d " ) ; } } }

Listing 2.2. Identifier class: methods implementation

One of the biggest advantages of functional testing is that its criteria can be applied virtually without any modification at any testing phase, as is illustrated in Figure 1. Moreover, since functional criteria are based solely on specification, they are also platform- and language-independent, that is, they can be used to test procedural, object-oriented, and aspect-oriented programs as well as software components [265,41].

22

A. Vincenzi et al.

Fig. 1. Black-box testing applicability [85]

Regarding disadvantages, the use of functional techniques may make it difficult to quantify the testing activity. That is because it is not possible to ensure that certain essential parts of the product’s implementation have been exercised. Another problem is that specifications (mainly non-formal) may be incomplete or ambiguous, as will be the test suite created based on them. In Section 4 some functional testing criteria for unit testing are described and illustrated. 3.2 Structural Testing Structural testing techniques, also known as white box (as opposed to black box) take into consideration implementation or structural aspects in order to determine testing requirements. Structural testing faces several constraints and disadvantages, such as the need to determine infeasible testing requirements, for example, infeasible paths and data-flow associations [339,283,175]. These constraints pose serious problems to testing automation. Nevertheless, this technique is seen as complementary to functional testing [279] and information obtained with its application is also relevant to several software engineering activities such as maintenance, debugging, software reliability estimation, and process improvement [162,279,157]. A common approach to applying structural testing is to abstract the PUT using a graphical representation from where required elements are extracted by the testing criteria. For instance, for unit testing, each unit is abstracted as a Control Flow Graph – CFG (also called Program Graph) to represent the PUT. A product P represented by a CFG has a correspondence between the nodes of the graph and blocks of code, and between the edges of the graph and possible control-flow transfers between two blocks of code. It is possible to select elements from the CFG to be exercised during testing, thus characterizing structural testing. For integration testing a different kind of graph is used, and so on. The first structural criteria were based exclusively on control-flow structures. The best known are All-Nodes, All-Edges, and All-Paths [256]. In the mid 1970s the data-flow-based criteria [166] first appeared. They require interactions among variable definitions and variable uses to be exercised [166,283,228]. The reasoning behind the definition of such criteria is the indication that, even for small programs, control-flow-based testing fails to be effective in revealing even trivial faults. The use of data-flow criteria provides a hierarchy of criteria from All-Edges to All-Paths, in an attempt to make testing a more rigorous activity. Among the best known data-flow criteria are those introduced by Rapps and Weyuker in the mid 1980s, such as All-Defs, All-Uses, All-C-Uses, and All-P-Uses [283].

Functional, Control and Data Flow, and Mutation Testing

23

At the beginning of the 1990s Maldonado [228] presented a family of criteria named Potential-Uses as well as the corresponding feasible criteria obtained with the elimination of infeasible associations. These criteria are based on associations between a variable definition and the possible points in the program where it can be used, not necessarily requiring its actual use. Several extensions of data-flow criteria can be found in the literature, either for integration testing of procedural programs [160,327] or for unit and integration of objectoriented programs [184]. However, it is more common to use such criteria during the unit testing phase, during which it is important to detect logical or conditional faults. In terms of advantages and disadvantages, they are considered complementary to functional testing criteria. Since structural testing is based on an implementation of the product, its criteria can only detect faults that exist in such a product. This means that, if such an implementation does not effect a given part of a product specification, structural testing criteria in general will not detect this fault. On the other hand, given an implemented functionality, structural testing criteria enable us to detect how well it is implemented and whether there are as yet uncovered parts of its implementation. In Section 5 we provide more information regarding some structural testing criteria for unit testing – considering both control-flow and data-flow criteria – and the corresponding abstraction graph employed by them. 3.3 Fault-Based Testing The fault-based techniques use information about faults which are frequently found in software developments and about specific types of faults that we may want to uncover [101]. Two criteria that typically concentrate on faults are error seeding and mutation testing. Error seeding introduces a known number of artificial faults in the program under testing before it is tested. After the test, from the total number of faults found and the rate between natural/artificial faults, it is possible to estimate the number of remaining natural faults. The problems with this approach are: artificial faults may hide natural faults; in order to obtain a statistically reliable result it is necessary to use programs that can support 10,000 faults or more; and this technique is based on the assumption that faults are uniformly distributed in the program, which in general is not the case – real programs present long pieces of simple code with few faults and small pieces reveal a high complexity and an elevated concentration of faults [59]. Mutation testing has already been discussed in Chapter 1. Extensions of this criterion have also been proposed for integration testing as well as for program specifications. Delamaro et al. [96] defined the interface mutation criterion that applies the mutation concept to the integration testing phase. With that criterion a new set of mutation operators that model integration errors was proposed. In the context of test specifications, mutation can be used to test Finite State Machines [115], Petri Nets [116], Statecharts [114,309], algebraic specifications [357], Estelle [280,302], and SDL [310]. Recently, researchers have also been investigating the use of mutation testing for the object-oriented paradigm. Kim et al. [192] proposed the use of a technique called Hazard and Operability Studies (HAZOP) to determine a set of mutation operators for Java. In general, this approach does not differ significantly from traditional mutation

24

A. Vincenzi et al.

with respect to the creation of mutation operators, but introduces a more rigorous and disciplined way to do it. The technique identifies those point candidates for mutation in the target language grammar and then mutation operators are created based on predefined “guide words.” A more extensive set of mutation operators for Java (inter-class mutation operators) which includes those proposed in [192] is defined in [220]. Researchers have also explored mutation testing within the context of distributed components communicating through CORBA [138]. In addition, Delamaro et al. [98] defined a set of mutation operators to deal with concurrent aspects of Java programs. In Section 6 we provide more information about mutation testing for unit testing and illustrate its use with a practical example.

4 Functional Testing Criteria In this section, we first present the state of practice of the functional testing techniques. A general overview of its most common testing criteria is illustrated with the Identifier program (described in Section 2) and with the support of a testing tool. We conclude by presenting the state of the art with respect to such testing technique. 4.1 State of Practice Functional testing involves two main steps: to identify the functionalities the product implementation should perform; and to create test cases that are capable of verifying whether such functionalities are fulfilled correctly according to the product specification [279]. In order to identify product functionalities, the tester has to consult the product specification and thus improve it, facilitate the application of functional testing, and enhance the quality of the generated functional test suite. Two examples of functional testing criteria are discussed in the sequel. Equivalence partition. Input conditions are represented by he restrictions that are imposed for each product input domain. From the input conditions identified from the product specification, each input domain is divided into valid and invalid equivalence partitions. The name “equivalence partition” refers to the criterion that considers each element of a given partition as equivalent in the sense that, if a certain element of a given partition is able to detect a fault, every other element of that same partition is also able to detect the same fault. Clearly such an assumption is too strong in practice, since it is quite common to have data-sensitive faults (see Section 2 of Chapter 1) that require a specific value inside an equivalence partition in order to be detected. Nevertheless, despite this limitation, this criterion is effective in systematizing the creation of a subset of the input domain, hence reducing the cost of software testing. Once the set of partitions is defined, the tester selects the smallest number of test cases necessary to cover the valid partitions and one test case to cover each of the invalid partitions. The use of equivalence partition allows us to exam product requirements more systematically and to restrict the number of existing test cases. Some authors consider that the output domain of the product under testing should also be taken into account when establishing equivalence classes.

Functional, Control and Data Flow, and Mutation Testing

25

Table 1. Equivalence classes for the Identifier program Input conditions Identifier’s size t

Valid Classes Invalid Classes 1t 6 t <1 t >6 (1) (2) (3) First character c is a letter Yes No (4) (5) Only contains valid characters Yes No (6) (7)

Boundary value analysis. It is complementary to the previous criterion in the sense that, instead of selecting any element of a given equivalence class, it requires that test cases be selected based on the boundaries of each equivalence class, since previous studies have shown that many faults are related to values on the boundary of equivalence partitions. In the same way, the product output may also be considered and input values which produce outputs in the equivalence partition boundaries are also required. More information about these and others functional testing criteria can be found elsewhere [33,85]. In order to illustrate the application of the equivalence partition criterion, we consider the specification of the Identifier program. Table 1 summarizes our understanding of how the product input domain (according to its specification) can be partitioned into valid and invalid classes. We observe that, according to the Identifier specification, we identified three input conditions: the size of the identifier; the initial character of the identifier; and the remaining characters of the identifier. With respect to the identifier’s size, it may range from 1 to 6 in case of a valid identifier and from less than 1 or more than 6 in case of invalid identifiers. We numbered each equivalence class from (1) to (7) in order to facilitate their reference. As mentioned in Section 2 of Chapter 1, a testing criterion aims at deriving a set of testing requirements. With regards to the equivalence partition criterion, such testing requirements correspond to the set of equivalence partitions. From such partitions we can generate a test suite that is adequate for such a criterion or evaluate if a given test suite is adequate for such a criterion. For instance, according to the criterion’s description, the tester needs to find the smallest number of test cases in order to cover the valid equivalence partitions. The test case (a1, Valid ) covers valid classes (1), (4), and (6) since a1 is a two-character-long identifier, which starts with a valid character a, and whose remaining characters contain only 1 digit. Therefore, a1 is considered a Valid identifier according to product specification. Subsequently, we have to create one test case for each invalid partition. This is important since invalid partitions are in general related to a special implementation code to produce differentiated error messages, and combining several invalid partitions in a single test case may conceal such error messages. For this reason, four additional test cases are needed: – ("", Invalid): covers the equivalence partition (2). “""” is an empty identifier with size 0. – (A1b2C3d, Invalid): covers the equivalence partition (3). “A1b2C3d” is sevencharacter long.

26

A. Vincenzi et al.

– (2B3, Invalid): covers the equivalence partition (5). “2” is not a valid first character. – (Z#12, Invalid): covers the equivalence partition (7). “#” is neither a character nor a digit. Therefore, the test suite just described corresponds to an adequate test suite for the equivalence partition criterion, that is, it is Equivalence Partition-adequate. In later references to it, we call it TEquivalence Partition . The next step is to execute the product implementation with such test cases in order to verify whether the obtained outputs correspond to the expected ones. Every time the product implementation changes it is necessary to execute the test cases again to ensure the fault was corrected and that it does not introduce any collateral effects in the implementation. For small programs and test suites with fewer test cases this can be done by hand, but for large programs and test suites with several hundreds of test cases this can be a very time-consuming and error-prone task and should be performed by an automatic tool. 4.2 Automation In this section we present a tool named JUnit [236], which can be used to automate test-case execution. JUnit originated from a previous similar framework for SmallTalk, called SUnit, developed by Beck [29]. Later Beck, together with Gamma, developed the first version of JUnit in 1999. Since 1999, the JUnit framework has evolved and it is used extensively on open source projects and commercial products. During the preparation of this chapter, JUnit is in version 4.4 and this is the version that we are using. Besides its wide acceptance, JUnit has also been extended to several languages such as Ada (AUnit) C# (NUnit), Python (PyUnit), Delphi (DUnit), and even Fortran (fUnit). All of these variations are known in an abstract way as xUnit. JUnit is not the only tool which enables automatic test execution. The following web sites provide more examples (not only for Java): Open Source Testing Tools in Java – java-source.net/open-source/testing-tools, Software Testing FAQs – www.testingfaqs.org, and Open Source Developer Testing Tools – www.opensourcetesting.org. There are other options available. We are using JUnit here simply because of its popularity and convenience, since most IDE (Integrated Development Environment) provide plug-ins to support the execution of JUnit test suites. For instance, in the case of Eclipse, the regular version has already been installed with JDT (Java Development Tools) and is supported by JUnit. Test-suite creation. A JUnit test suite is only a Java class and a test case is a method inside such a class that has been identified by a custom JUnit annotation @Test. This is the most basic notion we need to know to start building JUnit test suites. As an example, we convert our test suite in the previous section for the Identifier program into a JUnit test suite. We start by creating a test suite named IdentifierTestCase with a single test case, validate1. Listing 2.3 represents a JUnit test suite containing a single test case corresponding to (a1, Valid ). Before commenting on the source code, it is important to understand that JUnit works like a driver, by allowing a given method of a certain class to be tested. With this in mind, the basic steps to create a test case are:

Functional, Control and Data Flow, and Mutation Testing 1

27

package i d e n t i f i e r ;

2 3 4

import o r g . j u n i t . T e s t ; import o r g . j u n i t . A s s e r t ;

5 6

public c las s I d e n t i f i e r T e s t C a s e {

7

protected I d e n t i f i e r id ;

8 9

@Test public void v a l i d a t e 1 ( ) { i d = new I d e n t i f i e r ( ) ; b o o lea n r e s u l t = i d . v a l i d a t e I d e n t i f i e r ( " a1 " ) ; A s s e r t . a s s e r t E q u a l s ( t ru e , r e s u l t ) ; }

10 11 12 13 14 15 16

}

Listing 2.3. IdentifierTestCase class with one test case

1. Create an object of the class under testing (required to test instance methods); 2. Invoke the method under testing with the desired input value; 3. Verify whether the obtained output corresponds to the specified one. We observe that here we need an oracle to decide from a given input what is the correct output according to the specification. Now, let us analyze the test suite of Listing 2.3. – It is common practice to include the test suite in the same package of the application under testing so that methods with package visibility may be tested (line 1). – The imports on lines 3 and 4 are required by the use of the @Test annotation (line 10) and the Assert.assertEquals method (line 15). – Line 6 defines the test suite name, IdentifierTestCase, and consequently the name of the Java file IdentifierTestCase.java. This file is saved in the directory identifier/src/identifier/IdentifierTestCase.java. – Since we are testing the instance method validateIdentifier, we need an instance of the Identifier class to call this method; in line 8 we declare an instance variable id which stores a reference to such an object. – The @Test annotation (line 10) indicates that the method following it corresponds to a test case. – A test case must be a public method with no parameters, such as that declared in the signature in line 11. – Then, in line 12 the content of our test case starts. We observe that the first measure taken is to create an instance of the class that contains the methods we wish to test (line 12). – In lines 13 and 14 we invoke the instance method validateIdentifier with the input value a1 and store the returned value in the result variable. – Finally, we use an assertion mechanism provided by JUnit to analyze whether the expected value corresponds to the obtained one. The Assert.assertEquals method has two parameters: the first is the expected value for that input with respect to the product specification and the second one is the resulting value obtained after the execution of the method under testing. With such a mechanism the framework

28

A. Vincenzi et al.

allows the automatic execution and verification of test cases, but it is important to observe that neither the input value nor the expected result are automatically identified by the JUnit. For these tasks the framework requires tester intervention. Test-suite compilation and execution. The necessary steps for test-suite compilation and execution are not different from the ones required to compile and execute other Java classes. When executing this test suite a fault is found and Listing 2.4 presented in the next section shows one possible way to correct it. We include a conditional statement before the use of charAt(0) (line 7), assuring that only non-empty identifiers are evaluated. To conclude this section, it is important to observe that the objective of JUnit and other similar frameworks consists only in providing a mechanism for automating the test case execution. It does not provide any guidance for the tester on identifying which input must be selected to execute a given PUT. This is the real problem of testing and testing criteria are the best option to solve this problem. We also observe that no coverage information regarding which part of the source code was executed or not was provided since this is not the aim of functional testing, but of structural testing which is discussed in Section 5. In the following section we provide a brief discussion about the state of the art in functional testing. 4.3 State of the Art Even though functional testing does not intend to cover specific parts of the product implementation, it is difficult not to relate input domains and their partitions with statement conditions of the product implementation. In this sense, if the product implementation respects the product specification and if the functional testing is effectively performed, a high code coverage may be obtained, thus reducing the cost of structural testing and also ensuring that the product implementation satisfies its specification. An experiment was performed some time ago by Steve et al. [214] with the use of a criterion which they named systematic functional testing. It is basically a combination of two functional testing criteria – equivalence partition and boundary value analysis – with some modifications that aim at obtaining more than one element of each equivalence class and some guidelines about how to consider numerical values, illegal values, real numbers, variable range, arrays, and so on. The idea behind this experiment is to evaluate the adequacy of functional and random test suites with respect to mutation testing. Therefore, different test suites were generated and their ability to kill mutants was evaluated. We consider Cal, a simple calendar program available on Unix operation systems. We can observe that the test suite generated by Systematic Functional Testing killed 100% of non-equivalent mutants, while the test suites based on other criteria and applied to the same program revealed significantly lower scores. Steve et al. concluded that it is necessary to repeat the same experiment for a number of programs in order to see whether the results obtained from applying it to Cal are consistently repeated. Obviously there are many other studies related to the state of the art of functional testing in different areas such as testing from UML specifications, testing from models, testing from formal specifications, and so on. All of them have their applicability and

Functional, Control and Data Flow, and Mutation Testing

29

limitation. The general suggestion consists in studying the applicability of each testing criteria within a given context, product, and domain, extracting the advantages of each and attempting to minimize their disadvantages.

5 Structural Testing The structural testing techniques offer a series of disadvantages due to the inherent limitations of the testing activity, as mentioned in Section 2 of Chapter 1. Such limitations introduce many problems to automating the software testing process through structural testing [228]. Regardless of this disadvantage, the present techniques are seen as complementary to functional testing [279] and the information provided by its structural criteria is valuable to many other software engineering activities, such as maintenance, debugging, software trustability, and others [162,157]. 5.1 State of Practice As already said, in general, the majority of structural testing criteria use an abstraction of the internal structure of the product represented by a graph. In particular, when testing a product at unit level, we have a control-flow graph (CFG). In more detail, a product P can be decomposed in a set of disjoint blocks of statements so that the execution of the first statement inside a given block leads to the execution of all other statements in that block in the order they appear. All statements in a block, except the first, have a single predecessor and exactly one successor (except the last statement). This means that there is no control-flow deviation from or to statements in the middle of the block. The control-flow graph (G = (N , E , s)) representing a program P establishes a correspondence between nodes and blocks and indicates the possible control-flow deviations between blocks using edges. Therefore, a CFG is a directed graph with a single input node s ∈ N and one or more exit nodes. We observe that this definition allows a graph with a single node, which is both the input and the output node. Each block has the following properties: once its first statement is executed, all other statements are also executed in sequence; and there is no control-flow deviation for statements in the middle of a block. From the CFG, different elements can be selected for execution, characterizing the structural testing. For instance, in the implementation of the validateIdentifier method presented in Listing 2.4, the comments on the left indicate the number of the block to which the statement belongs. Figure 2 shows an example of the corresponding CFG that abstracts the control structures of the program. From the CFG we can establish requirements to be fulfilled by a test suite, for instance, the execution of each vertex or each edge at least once. Because such criteria employ only characteristics related to the control structure of the program to determine the set of testing requirements, they are called “control-flow-based criteria”. If G = (N , E , s) is a CFG where N is a set of nodes, E a set of edges, and s the input node, then a path is a finite sequence of nodes (n1 , n2 , . . . , nk ), k ≥ 2, so that there is an edge from ni to ni + 1 for i = 1, 2, . . . , k − 1. A path consists of a simple path if all the nodes that compose such a path are distinct, except possibly the first and

30

A. Vincenzi et al. 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

p u b l i c b o o lea n v a l i d a t e I d e n t i f i e r ( S t r i n g s ) { char a c h a r ; / ∗ 01 ∗ / b o o l e a n v a l i d _ i d = f a l s e ; / ∗ 01 ∗ / i f ( s . l e n g t h ( ) > 0 ) { / ∗ 02 ∗ / achar = s . charAt ( 0 ) ; / ∗ 02 ∗ / va l id _i d = valid_s ( achar ) ; / ∗ 02 ∗ / i f ( s . length ( ) > 1) { / ∗ 03 ∗ / achar = s . charAt ( 1 ) ; / ∗ 03 ∗ / int i = 1; / ∗ 04 ∗ / while ( i < s . length ( ) − 1) { / ∗ 05 ∗ / achar = s . charAt ( i ) ; / ∗ 05 ∗ / i f ( ! v a l id _f ( achar ) ) / ∗ 06 ∗ / valid_id = false ; / ∗ 07 ∗ / i ++; } } } / ∗ 08 ∗ / / ∗ 09 ∗ / / ∗ 10 ∗ / i f ( v a l i d _ i d && ( s . l e n g t h ( ) >= 1 ) && ( s . l e n g t h ( ) < 6 ) ) / ∗ 11 ∗ / return true ; else / ∗ 12 ∗ / return f a l s e ; }

Listing 2.4. Commented validateIdentifier method Table 2. Testing requirements and criterion relationship based on the CFG (Figure 2) Element

Testing requirement

Criterion

Nodes 6 All-Nodes Edge (8,12) All-Edges Loop (4,5,6,7,4) Boundary-Interior Path (1,2,3,4,8,12) All-Paths Variable definition valid_id=false All-Defs Predicative use of a variable while(i<s.length()-1) All-P-Uses Computational use of a variable i++ All-C-Uses

the last nodes; if all the nodes are distinct it is said that the path is a loop-free simple path. In a complete path the first node is the input node and the last node is an output node of G. Considering the graph shown in Figure 2, the path (2,8,9,10) is a loop-free simple path and the path (1,2,3,4,5,6,7,4,8,12) is a complete path. We observe that the path (5,6,7,4,8,9,10,11) is infeasible and so is any complete path which includes it, that is, there is no input value which causes the execution of such a path. This is because an identifier is defined as invalid (valid_id=false) in node 6 and only valid identifiers can reach node 11. Control-flow based testing criteria. The control-flow-based testing criteria uses different elements of the CFG to determine testing requirements. Table 2 illustrates some required elements and the associated criterion. Example 1. In the case of the CFG presented in Figure 2, considering All-Nodes, it is possible to execute all of its nodes with two test cases. For instance, (1#@, Invalid) passes through the nodes (1,2,3,4,5,6,7,4,8,12) and the test case (i, Valid) passes through the nodes (1,2,8,9,10,11), covering all testing requirements demanded by the

Functional, Control and Data Flow, and Mutation Testing

31

1

2

3

4

5

6

8

9

7

10

12

11

Fig. 2. Control-flow graph of the validateIdentifier method

All-Nodes criterion. Therefore, the test suite TAll-Nodes = {(1#@, Invalid),(i, Valid)} is All-Nodes-adequate. On the other hand, such a test suite is not adequate with respect to All-Edges since the edges (1, 8), (5, 7), (9, 12), and (10, 12) are not executed by any test case in TAll-Nodes . This illustrates the complementary aspect of the testing criteria and also that one criterion may demand more effort to be satisfied than another. In this example, one additional test case is required to cover All-Edges. For instance, the suite TAll-Edges = TAll-Nodes ∪ {("",Invalid), (A1b2C3d,Invalid))} is adequate for All-Edges since ("", Invalid) passes through (1, 8), and (A1b2C3d, Invalid)) passes through nodes (1,2,3,4,5,7,4,5,7,4,· · ·,4,8,12), including the edges (5, 7) and (10, 12), and the edge (9, 12) is infeasible due to the short circuit on evaluating expressions involving the “&&” operator (logical and). Short circuit is used by programming languages to accelerate the evaluation of the logical “and” expression. With this mechanism, once the first logical expression is false, there is no need to evaluate the remaining logical expression since “false” and “anything” is always false for logical “and”. In our example, the second expression is only evaluated as false if an identifier of size zero is provided, but a zero size identifier makes valid_id to assume the value false, so that edge (8, 12) is to be executed instead of (9, 12). The Java operator “&” can be used in a logical expression to disable the short-circuit mechanism. Actually, it is safe to remove the second logical expression (s.length()>=1) from the statement condition at line 22 (see Listing 2.4) with no collateral effect to product functionality. However, in the present text we will maintain this “redundant” logical condition for teaching purposes. As previously mentioned, this relation between testing criteria defines the so-called subsume relation. For the control-flow based testing criteria, the strongest criterion is the All-Paths criterion, which includes the All-Edges criterion which, in its turn, includes All-Nodes. Later in this section we will provide more details of the subsume relation.

32

A. Vincenzi et al.

Due to the complementary aspect of the testing techniques, the resulting test suite from the application of functional testing criteria may be used as the initial test suite for structural testing. Since in general such a test suite is not enough to satisfy a structural criterion, new test cases are generated and included in the test suite until the desired level of coverage is reached, hence exploring their complementarities. A problem related to the structural testing criteria is the general impossibility of automatically determining whether a given path is feasible, that is, there is no general purpose algorithm which, given any complete path, decides if such a path is executable and, if so, what are the input values which cause the execution of such a path [326]. Therefore, tester intervention is required to determine both feasible and infeasible paths of the PUT. Data-flow-based testing criteria. The introduction of these criteria “bridges the gap” between All-Paths and All-Edges, in an attempt to make the test more rigorous since, in general, the All-Paths criterion is not practical due to the infinite number of possible paths. According to Ural [323], data-flow-based criteria are more adequate in detecting some classes of faults, such as computational faults. Once data dependencies are identified, functional segments of the product are required to be exercised by the testing requirements. Rapps & Weyuker proposed the use of a Definition-Use Graph (Def-Use Graph, for short) or DUG, which consists of an extension of the CFG [283]. The DUG contains information about the data flow of the PUT, characterizing associations between statements in which a value is assigned to a variable (known as “definition”) and statements in which the value of that variable is used (known as “use”). For instance, by considering the CFG of Figure 2 and extending it with information about variable definitions and use presented in the implementation of the validateIdentifie method (see Listing 2.4), we obtain the DUG presented in Figure 3. More generally, the occurrence of variables in a program can be classified as a variable definition, a variable use, or an undefined variable. Usually, the different types of occurrences of variables are defined by a data-flow model. Considering the model defined by Maldonado [228], a variable definition occurs when a value is stored in a memory location. In general, in a product implementation, the occurrence of a variable is a definition when the variable: is on the left side of an assignment statement; is an input parameter; or is an output parameter, passed by reference during a unit call. The occurrence of a variable is a use when it is not a definition. There are two different types of uses: computational use (c-use) and predicative use (puse). The former directly affects the computation being performed or allows the result from a previous definition to be observed; the latter directly affects the control flow of the product implementation. We note that the c-use is associated with a node and the p-use is associated with the outgoing edges. A variable is considered undefined when it is not possible to access its value or when its location in the memory is not defined. If x is a variable occurring in a given program, a path c = (i, n1, . . . , nm , j ), m ≥ 0 containing no definition of x in nodes n1, . . . , nm is called a definition-clear path (defclear path) with respect to x from node i to node j and from node i to the edge (nm , j ). A node i has a global definition of a variable x if there is a definition of x in i and there is a def-clear path from a node or an edge with a c-use or a p-use of x . A c-use of x is a

Functional, Control and Data Flow, and Mutation Testing

33

d = {s,valid_id}

1

s s d = {valid_id,achar} achar s d = {achar,i} s

2

s 3

4

s,i d = {achar} s,i

s,i

5

achar

8

achar 6

valid_id

valid_id

d = {valid_id}

9

s

7

d = {i} i

s 10

s d = {setofdefinitions} variableassignedtonode(c-use) variableassignedtoedge(p-use)

12

s 11

Fig. 3. Def-use graph of validateIdentifier method

global c-use if there is no definition of x in the same node preceding the c-use. A data flow association is represented by a triple x, i, j when corresponding to a c-use and by a triple x, i, (j , k ) where x is a variable, i is a node containing a global definition of x , and j /(j , k ) is a node/edge with a c-use/p-use of x . Example 2. In order to exemplify the previous concepts, we consider the DUG of Figure 3 and the corresponding source code of the validateIdentifier method of Listing 2.4. There are three different kinds of information annotated on the DUG: a set of defined variables d assigned to each node, variables in a c-use assigned to nodes, and variables in a p-use assigned to edges. We observe that at node 1 there are definitions of the variables s and valid_id, since at lines 4 and 6 respectively there are statements causing such definitions. At nodes 2 and 5 there are c-uses of the variables achar (line 9), s, and i (line 14). Finally, we have p-uses of variables assigned to the outgoing edges of decision nodes. For instance, node 4 is a decision node and has the p-use of variables i and s, which means that such variables decide which edge is going to be taken – (4, 5) or (4, 8) – due to the conditional statement at line 13. To give an example of a c-use association, we consider the variable s defined at node 1. There is a c-use of such a variable at node 3 represented by the association s, 1, 3. Similarly, valid_id, 1, (8, 9) and valid_id, 1, (8, 12) correspond to the puse associations of the variable valid_id defined at node 1 and its p-uses on edges (8, 9) and (8, 12). The path (1,8,12) is a def-clear path with respect to valid_id defined at node 1 that covers the association valid_id, 1, (8, 12,). On the other hand, the path (1,2,8,12) is not a def-clear path with respect to valid_id defined at node 1 since there is a redefinition of valid_id at node 2, and, in this way, when reaching the p-use of valid_id at edge (8, 12), such a use does not correspond to the value of valid_id defined at node 1 but to the one defined at node 2. In order to

34

A. Vincenzi et al.

cover the data-flow association with respect to the definition of valid_id at node 1, we have to find a test case that follows any complete path not passing through node 2. The most basic data-flow-based criteria is the All-Defs criterion which is part of the Rapps & Weyuker family criteria [283]. Among the remaining criteria of this family the most used and investigated is the All-Uses criterion. – All-Defs: requires a data-flow association for each variable definition to be exercised at least once by a def-clear path with respect to a c-use or p-use. – All-Uses: requires all data-flow associations between a variable definition and all its subsequent uses (c-uses and p-uses) to be exercised by at least one def-clear path. Example 3. To exercise the definition of the variable valid_id defined at node 6, according to the All-Defs criterion, the following paths can be executed: (6,7,4,8,12) or (6,7,4,8,9). However, we must have in mind that the path (6,7,4,8,91) is infeasible, as are all complete paths which include it. Therefore, considering this example, there is only one possible executable path to satisfy the testing requirement passing through (6,7,4,8,12), and the test case (1#@, Invalid), for instance, follows such a path. In order to satisfy the All-Defs criterion this analysis has to be done for each variable definition in the PUT. For the All-Uses criterion, with respect to the same definition, the following associations are required: valid_id, 6, (8, 9) and valid_id, 6, (8, 12) . As previously discussed, the association valid_id, 6, (8, 9) is infeasible and can be dropped off, while the other can be covered by the same test case. This analysis has to be carried out for all other variable definitions and their corresponding def-use associations in order to satisfy the All-Uses criterion. The majority of the data-flow-based testing criteria that requires coverage of a certain element (path, association, and so on) demands the explicit occurrence of a variable use and does not necessarily guarantee the subsumption of the All-Edges criterion in the presence of infeasible paths, which occurs frequently. With the introduction of the concept of potential use, a family of Potential-Use testing criteria were defined [228]. What distinguishes such criteria from the ones mentioned before is that they introduce testing requirements regardless of the explicit occurrence of a use with respect to a given definition. It is enough that a use of such a variable “may exist”, that is, existing a def-clear path with respect to a given variable until some node or edge characterizes a potential use and an association is eventually required. Similarly to the remaining data-flow-based criteria, the Potential-Uses criteria use the DUG as the basis for deriving the set of testing requirements (potential-use associations). Actually, all that is needed is an extension of the CFG, called Definition Graph (DEG), so that each node contains the information about its set of defined variables. Figure 4 illustrates the DEG for the validateIdentifier method. The All-Pot-Uses criterion requires at least a def-clear path with respect to a given variable x defined in a node i for all node and all edge possible to be reached from i to be exercised. Example 4. For instance, the potential associations s, 1, 6, achar, 3, (8, 9) , and achar, 3, (8, 12) are required by the All-Pot-Uses criterion, but are not needed by

Functional, Control and Data Flow, and Mutation Testing

1

35

d = {s,valid_id}

d = {valid_id,achar} 2

d = {achar,i}

3

4

d = {achar} 5

6

8

d = {valid_id}

9

d = {i} 7

d = {setofdefinitions}

10

12

11

Fig. 4. Def graph of validateIdentifier method

the other data-flow-based criterion. Moreover, since each data=flow association is also a potential data-flow association, the associations required by the All-Uses criterion are a subset of the potential associations required by the All-Pot-Uses criterion. In other words, this means that the All-Pot-Uses subsumes the All-Uses by definition. The subsume relation is an important property of the testing criteria and is used to evaluate them from a theoretical point of view. As discussed before, the All-Edges criterion, for instance, subsumes the All-Nodes criterion, that is, any test suite All-Edgesadequate is also necessarily All-Nodes-adequate, but the opposite does not hold. When it is not possible to establish such a subsume order between two criteria, such as the All-Defs and All-Edges criteria, it is said that such criteria are incomparable [283]. It is important to observe that the Potential-Uses criteria are the only data-flow-based criteria that, even in the presence of infeasible paths, satisfy the minimum properties required by a test criterion, and no other data-flow-based testing criterion subsumes them. Figure 5 depicts both situations with respect to the subsume relation. In Figure 5(a) the relationship among the criteria does not consider the presence of infeasible paths, which, as mentioned previously, is quite common in the majority of real product implementations. On the other hand, Figure 5(b) presents the subsume relation considering the presence of infeasible paths. As previously mentioned, one of the disadvantages of structural testing is the existence of required infeasible paths. There is also the problem of missing paths, that is, when a given functionality is not implemented, the structural testing criteria cannot select testing requirements to test such a functionality because there is no corresponding implementation and, therefore, no test case is required to test it. Nevertheless, such criteria rigorously establish the testing requirements to be satisfied by the test suite in terms of paths, def-use associations, or other structural elements, allowing the objective

36

A. Vincenzi et al. All-Paths

All-Pot-Du-Paths All-Pot-Uses/Du All-Du-Paths All-Pot-Uses

All-Paths

All-Uses All-Pot-Du-Paths

All-C-Uses/Some-P-Use

All-P-Uses/Some-C-Use

All-Pot-Uses/Du

All-Defs All-C-Uses

All-Du-Paths All-Pot-Uses

All-P-Uses All-Uses

All-Edges

All-Edges

All-C-Uses/Some-P-Use

All-P-Uses/Some-C-Use All-Nodes All-Defs

All-Nodes

(a) Does not consider infeasible paths.

All-C-Uses

All-P-Uses

(b) Considers infeasible paths.

Fig. 5. The relationship among data-flow based testing criteria

measure of the adequacy of a test suite on testing a product. The strict definition of structural testing criteria facilitates the automation of their application. 5.2 Automation In this section we discuss the basic use of structural and data-flow-based testing criteria. In order to illustrate the use of control-flow-based testing, we use a tool named Emma (emma.sourceforge.net), which supports the use of the All-Nodes criterion. Similarly to JUnit, there are several options of testing tools available, open source or not. We employ Emma because it is easer to use and allows full integration with JUnit. Moreover, it can also be integrated with IDEs via plug-ins. For instance, in the case of Eclipse, the EclEmma plugin (www.eclemma.org) allows such an integration. Below we describe the basic functionalities of Emma version 2.0.5312 and how to use it through a command line, since our intention is to present the concepts of the tool, instead of the knowledge of how to use it in a particular IDE. To show the complementarity between control- and data-flow-based testing criteria, we use JaBUTi (incubadora.fapesp.br/projects/jabuti), which supports the application of All-Nodes, All-Edges, All-Uses and All-Pot-Uses criteria for Java bytecode [330]. Emma testing tool. It supports the application of the All-Nodes criterion. One of its benefits is that it can be integrated with JUnit so that the tester is able to evaluate how much of the PUT was executed by that particular JUnit test suite.

Functional, Control and Data Flow, and Mutation Testing

37

In order to execute the PUT and collect execution trace information, Emma uses a specialized class loader called emmarun, which is responsible to instrument the product before its execution. Each execution corresponds to a new test case and the tester can generate a testing report in order to evaluate the progress of the testing activity. There are different levels of reports, as is illustrated by Figures 6, 7, and 8. The report details different “levels of coverage” as explained below. – class: corresponds to the total number of classes under testing versus the number of executed classes during testing. Since we have just one class, the tool considers it executed and, therefore, we executed 100% of the classes under testing. – method: corresponds to the total number of methods of the classes under testing versus the number of executed methods during testing. In our example, Identifier has five methods and three of them were executed during this test (60%). – block: corresponds to the total number of blocks of all methods of all classes versus the number of blocks actually covered during testing. In our example, the five Identifier methods have a total of 121 blocks and 29 of these were covered (24%). – line: remember that a block is a set of statements; therefore, based on the calculated blocks it is possible to infer the total number of lines in relation to all methods of all classes under testing. This coverage means the total number of lines in the PUT versus the coverage lines during testing. In our case, Identifier has 29 executable lines and 9.2 were executed during testing (32%). It is important to note that Emma works at bytecode level and it computes the number of source-code lines based on the bytecode information. Because of this conversion, it is possible to have a fraction in the number of executed lines, since a single source-code line can be broken down into several bytecode instructions and these may belong to different CFG blocks. These blocks are not always executed together due to product implementation conditions. On a more detailed analysis, we observe that the Equivalence Partition-adequate test suite used previously was able to cover all statements of four out of five methods of the Identifier class, as illustrated in Figure 7. The only method which reveals 68% block coverage is the main method, which is natural since the tests are run by JUnit. To reach 100% of coverage with respect to all methods of the Identifier class, two additional test cases are required to be executed via the main method: one calling the identifier.Identifier with no parameter in order to cover line 47, and another calling it with a valid identifier in order to cover line 52. JaBUTi testing tool. To illustrate the complementary aspect of the testing criteria, we use another testing tool named JaBUTi (incubadora.fapesp.br/projects/jabuti/). This tool was originally developed to support the use of the All-Nodes, All-Edges, All-Uses, and All-Pot-Uses criteria for the unit level when testing Java product implementations [330]. Since its definition in 2003, the original tool has been extended for testing products other than its original purpose. Figure 9 shows the current JaBUTi family described below. All of the elements share the JaBUTi core which includes the static analysis of bytecode and other basic features. They are all implemented in Java and run as

38

A. Vincenzi et al.

Fig. 6. Emma overall report after JUnit execution, excluding third-party packages and test suites

Fig. 7. Identifier method coverage after the execution of functional test cases

Fig. 8. main method coverage

desktop applications. With our participation in the QualiPSo Project we aim at making such tools available as open source softwares and migrating them to work in a service-oriented architecture, supported by the QualiPSo Factory, which is currently under development.

Functional, Control and Data Flow, and Mutation Testing JaBUTi/MA [99]

39

JaBUTi/ME [100]

JaBUTi/Web [119]

JaBUTi/DB [257]

JaBUTi [330]

JaBUTi/Integration [123]

JaBUTi/AJ [207] eXVantage [360]

Fig. 9. JaBUTi family

– JaBUTi/MA [99] is an extension of JaBUTi which enables the use of structural testing criteria to test mobile agents when they are running on their real environment. – JaBUTi/ME [100] applies the same concepts in the structural testing of J2ME (Java Micro Edition) applications both during their development via emulators and after the application, when they are deployed to the target devices like PDAs (Personal Digital Assistants) and mobile phones. – JaBUTi/DB [257] implements the data-flow testing criteria defined by Spoto et al. [303] specifically for applications that manipulate persistent data in the form of relational databases. – JaBUTi/AJ [207] is a JaBUTi extension which provides support for applying structural testing on aspect-oriented products. – JaBUTi/Integration [123] is an extension of JaBUTi which provides support for the application of structural testing at integration level. – JaBUTi/Web [119] is an initiative of extending the JaBUTi testing tool and its corresponding criteria to test Java-based Web applications. – eXVantage [360] (www.research.avayalabs.com) is in fact a reduced version of JaBUTi which implements only its control-flow-based testing criteria. With this reduction, eXVantage does not need to worry about several time-consuming tasks for calculation and evaluation of data-flow-based testing criteria, thus improving the general performance of the tool. The first step of JaBUTi is the creation of a hierarchical abstraction of the program being tested, in which the tester indicates which parts of the product should really be tested and which should be ignored, that is, excluded from the program structure during the instrumentation process. Such information is stored in a testing project which allows the testing activity to be stopped and resumed at any time. Once the testing project is created, the tester has eight structural criteria to work with (radio buttons below the main menu in Figure 10). These criteria are summarized in Table 3. By selecting a criterion, the tester visualizes information about the program concerning the selected criterion. For example, by using the All-Nodesei criterion we are able to see the source code (if available – see Figure 10), the bytecode, or the def-use graph (see Figure 10). In either case, the tester is provided with hints about which testing requirement should be covered in order to achieve a higher coverage.

40

A. Vincenzi et al.

Fig. 10. Source code and DUG visualization for All-Nodesei criterion Table 3. Testing criteria implemented by JaBUTi Name All-Nodesei All-Edgesei All-Usesei All-Pot-Usesei All-Nodesed All-Edgesed All-Usesed All-Pot-Usesed

Meaning all nodes, regardless of exceptions all edges, regardless of exceptions all uses, regardless of exceptions all potential-uses, regardless of exceptions

Description requires the execution of each node in the graph that can be executed without the occurrence of an exception requires the execution of each edge in the graph that can be executed without the occurrence of an exception requires the coverage of each def-use pair that can be executed without the occurrence of an exception requires the coverage of each def-potential-use [228] pair that can be executed without the occurrence of an exception

same criteria, dependent on ex- respectively require the coverage of nodes, edges, def-use pairs as ceptions well as the coverage of nodes, edges, def-use pairs, and def-potentialuse pairs that can only be executed with the occurrence of an exception

We observe that, although this DUG has a similar layout to the DUG presented in Figure 3, the labels assigned to the nodes of this DUG are in general the offset of the first bytecode instruction in the block. For instance, node 9 in the DUG of Figure 10 represents the instructions from offset 0 (“Start PC”) to 26 (“End PC”) at bytecode level, or lines 8 to 10 at source code level (“Corresponding Source Lines”). This node 9 corresponds to node 2 in the DUG of Figure 3. Actually, the latter was manually generated by editing the DUG generated by JaBUTi. The tester can manage testing requirements, for example, by marking a requirement as infeasible. A testing requirement may be covered by the execution of a test case. This is done “outside” the tool by a test driver that instruments the PUT and then starts

Functional, Control and Data Flow, and Mutation Testing

41

Fig. 11. Summary report by method after the execution of the JUnit test suite

Fig. 12. Summary report by criterion after 100% of statement coverage

the instrumented program in the same way as emmarun of the Emma tool. JaBUTi also allows the import and evaluation of the coverage of JUnit test suites against their criteria. For instance, Figure 11 shows one of JaBUTi’s testing reports after the execution of our previous JUnit test suite. Since we wish to obtain 100% of code coverage with respect to the complete source code, two additional test cases were added to the JUnit test suite, imported and evaluated by JaBUTi. Therefore, the test suite TAll-Nodesei is an All-Nodesei -adequate test suite made up of the following set of test cases: TAll-Nodesei = TEquivalence Partition ∪ {("", Invalid), (a1, Valid)}. Figure 12 shows the summary report by criterion after the execution of the two additional test cases. As we can observe, despite obtaining 100% of statement coverage, the second strongest testing criterion supported by JaBUTi (All-Nodesei ) reaches only 88% of coverage, followed by All-Usesei with 85% of coverage and All-Pot-Usesei with 73% of coverage. Therefore, if there are enough resources in terms of time and cost, the tester may continue with the next criterion (All-Nodesei ) and verify how its testing requirements can be used to improve the test suite quality. In this case, the tester needs to provide

42

A. Vincenzi et al.

additional test cases by forcing each boolean expression to be evaluated as true and false at least once, and so on. By clicking over any of the highlighted decisions, JaBUTi changes the color of the chosen decision point and highlights the corresponding branches of such a decision point. For instance, by clicking over the decision point in line 10, Figure 13 shows the coverage status for such a decision. As we can observe, the true branch is already covered since the statements inside the if statements are marked in white (which means “covered”). On the other hand, the statement outside the if statement appears in a different color, an indication that it is not yet covered. If we take a look at the source code and at our test suite, we are able to conclude that there is a missing test case considering an identifier (valid or invalid) with a single character so that the if statement can evaluate to false. We can use the test case (c, Valid), for instance, to improve the current test suite and to cover such an edge.

Fig. 13. False branch not yet covered: All-Edgesei criterion

By analyzing all the other uncovered edges, we have to include two additional test cases in TAll-Nodesei to make it All-Edgesei -adequate. Below, we give the test suite TAll-Edgesei , which is an All-Edgesei -adequate test suite composed by the following set of test cases: TAll-Edgesei = TAll-Nodesei ∪ {(c, Valid), ({, Invalid), (a{b, Invalid)}.

Functional, Control and Data Flow, and Mutation Testing

43

Fig. 14. Summary report by criterion after running TAll-Edgesei

Figure 14 gives the resulting coverage obtained with respect to all remaining JaBUTi testing criteria. Observe that after covering 100% of the All-Nodes and All-Edges criteria, the coverage concerning the data-flow-based testing criteria is below 95%. At this point, considering the strongest criterion of JaBUTi (All-Pot-Usesei ), there are 34 uncovered data-flow associations, all of which are shown in Table 4. The variable names L@x are used internally by JaBUTi to designate local variables 0, 1, 2, · · · , n. The real names of such variables can be identified from the bytecode if the class file is compiled with the -g parameter with the aim of generating all debugging information. To facilitate the identification, we provided a key below the table with the corresponding source-code name of each variable. Since the edge (76, 95) is infeasible, all associations related with such an edge are also infeasible and can be dropped off. Such infeasible associations are marked with the symbol “×”. The associations L@3, 0, cdot also fail to be feasible, since there is no def-clear path to cover them or because the only way to take the edge (0, 72) is via a zero length Table 4. Uncovered data-flow association for All-Pot-Usesei criterion Required Associations Infeasible Required Associations Infeasible 01) L@0, 0, (76, 95) × 18) L@4, 29, (84, 95) × 02) L@1, 0, (76, 95) × 19) L@4, 29, (76, 95) × 03) L@3, 0, 93 × 20) L@4, 29, (72, 95) 04) L@3, 0, (84, 93) × 21) L@2, 49, (76, 95) × 05) L@3, 0, (84, 95) × 22) L@3, 64, (49, 64) 06) L@3, 0, (76, 84) × 23) L@3, 64, 93 × 07) L@3, 0, (76, 95) × 24) L@3, 64, (84, 93) × 08) L@3, 0, (72, 76) × 25) L@3, 64, (84, 95) × 09) L@3, 0, (9, 29) × 26) L@3, 64, (76, 84) × 10) L@3, 0, (9, 72) × 27) L@3, 64, (76, 95) × 11) L@2, 29, 95 28) L@3, 64, (72, 76) × 12) L@2, 29, (84, 95) × 29) L@4, 66, (49, 64) 13) L@2, 29, (76, 95) × 30) L@4, 66, (76, 95) × 14) L@2, 29, (72, 95) 31) L@4, 66, 64 15) L@2, 29, (49, 66) × 32) L@2, 9, (84, 95) × 16) L@2, 29, (49, 64) × 33) L@2, 9, (76, 95) × 17) L@4, 29, 95 34) L@3, 9, (76, 95) × Key: L@0 - this; L@1 - s; L@2 - achar; L@3 - valid_id; L@4 - i.

44

A. Vincenzi et al.

identifier which is always considered invalid when following the edge (72, 95). All other possible uses with respect to the variable L@3 defined at node 0 are infeasible. The potential associations L@2, 29, 95, L@2, 29, (72, 95) , L@4, 29, 95, and L@4, 29, (72, 95) can be translated to source code representation as achar, 3, 12, achar, 3, (8, 12) , i, 3, 12, and i, 3, (8, 12) , respectively (see Listing 2.4 and Figure 4). We observe that all of them require an invalid test case with two characters to be covered. For instance, the test case (%%,Invalid) covers such associations. By continuing these analyses, we found one additional test case (%%%a,Invalid) to cover the feasible associations. All the remaining are infeasible and therefore after running such test cases we obtained 100% of coverage with respect to all JaBUTi testing criteria. In summary, the test suite TAll-Pot-Usesei below is an All-Pot-Usesei -adequate test suite composed by the following set of test cases: TAll-Pot-Usesei = TAll-Edgesei ∪ {(%%,Invalid), (%%%a,Invalid)}. It must be highlighted that we found an All-Pot-Usesei -adequate test suite and, even though there are at least two faults in the validateIdentifier method, this test suite was not able to detect any of them. 5.3 State of the Art Structural testing criteria have been used mainly for unit testing once their testing requirements are easier to be calculated at this level of testing. Several efforts to extend them for integration testing can be identified. Haley and Zweben proposed a testing criterion to select paths into a module that should be tested at integration level based on its interface [152]. Linnenkugel and Müllerburg presented a family of criteria which extends traditional control- and data-flow-based unit-testing criteria to integration testing [215]. Harrold and Soffa proposed a technique to determine interprocedural def-use associations allowing the application of data-flow-based testing at integration level [160]. Jin and Offutt defined some criteria based on a coupling classification between modules [181]. Vilela, supported by the concept of potential use, extended the family of Potential-Uses criteria to integration testing [327]. Harrold and Rothermel [159] extended data flow testing to the context of objectoriented products at class level. The authors commented that some data-flow based testing criteria originally proposed for testing procedural products [283,160] can be used for testing both individual methods as well as the interactions between methods inside the same class. However, such criteria do not consider data-flow associations whenever a user invokes sequences of methods in an arbitrary order. To solve this problem, Harrold and Rothermel proposed an approach which allows different types of data-flow interactions between classes. This approach uses traditional (procedural) data-flow testing for testing methods and for testing method interactions inside the same class. To test methods which are visible outside the current class and can be called by other classes, a new representation called CCFG (class control-flow graph) was defined and developed. From the CCFG, new testing requirements at inter-method, intra-class, and inter-class level can be determined [159].

Functional, Control and Data Flow, and Mutation Testing

45

Vincenzi et al. have also investigated the use of control- and data-flow-based testing criteria on testing object-oriented and component-based products [328]. By aiming at the development of a common solution to deal with both kinds of products, they decided to carry out the static analysis of Java programs to determine the testing requirements directly at bytecode level since, in general, when testing software components the product source code is not always available. With such an approach, regardless of the availability of the source code, it is possible to derive a set of testing requirements concerning different structural testing criteria, which can be used for the evaluation or the development of a test suite. The JaBUTi testing tool described above was developed to support such an approach, allowing the application of control- and data-flow based criteria at bytecode level.

6 Fault-Based Testing Technique 6.1 State of Practice The mutation testing technique was briefly described in Chapter 1. The basic idea behind mutation testing is the competent programmer hypothesis, which states that a good programmer writes correct or close-to-correct programs. Assuming this hypothesis is valid, we can say that errors are introduced in a program through small syntactic deviations (faults) that lead its execution to an incorrect behavior. In order to reveal such errors, mutation testing identifies the most common of such deviations and, by applying small changes to the PUT, encourages the tester to construct test cases that reveal that such modifications create incorrect programs [2]. A second hypothesis explored by mutation testing is the coupling effect. It states that complex errors result from the composition of simple ones. Thus, it is expected – and some experimental studies [58] have confirmed this – that test suites which reveal simple faults are also able to discover complex errors. Thus, a single mutation is applied to the PUT P, that is, each mutant has a single syntactic transformation related to the original program. A mutant with k changes is referred to as a k -mutant. Mutants higher than 1-mutant have not been used in the literature and will therefore not be explored in this chapter. By considering these two hypotheses, the tester should provide a PUT P and a test suite T whose adequacy is to be assessed. The PUT is executed against T and, if a failure occurs, a fault has been revealed and the test is over. If no problem is observed, P may still have hidden faults that T is unable to reveal. In this case, P is submitted to a set of “mutation operators” which transform P into P1 , P2 , . . . , Pn , called mutants of P. Mutation operators are rules that model the most frequent faults or syntactic deviations related to a given programming language. A mutation operator defines the (single) change to be applied to the original program in order to create a mutant. It is designed for a target language and should fulfill one of the following objectives: create a simple syntactic change based on typical errors made by programmers (changing the name of a variable, for instance); or force test cases to have a desired property (covering a given branch in the program, for instance) [266].

46

A. Vincenzi et al.

Mutants are executed against the same test suite T . Dead mutants are those for which the result differs from P on at least one test case in T . The others, that is, the ones for which the results are the same for each test case of T , are called live mutants. The ideal situation would reveal all the mutants dead, which would indicate that T is adequate for testing P (according to that set of mutants). As already mentioned, serious problem in this approach is the possible existence of equivalent mutants. In some cases, the syntactic change used to create the mutant does not result in a behavioral change, and for every element in the input domain, P and the mutant always compute the same results. A few heuristics have been proposed to automatically identify equivalent mutants, but this is not always possible [59]. The problem of deciding whether two programs compute the same function is known to be undecidable. Therefore, in general, mutation testing requires tester intervention in order to determine equivalent mutants in the same way as structural testing requires the determination of infeasible paths. After executing the mutants and – probably via tester intervention – identifying the equivalent ones, an objective measure of test-suite adequacy is provided by the mutation score. DM (P, T ) ms(P, T ) = M (P) − EM (P) where DM (P, T ): number of mutants killed by the test suite T ; M (P): total number of mutants; EM (P): number of mutants equivalent to P. The mutation score ranges from 0 to 1. The higher it is, the more adequate is the test suite. The value of DM (P, T ) depends only on the test used to execute P and the mutants, and the value of EM (P) is obtained as the tester, manually or aided by heuristics, decides that a given live mutant is in fact equivalent. Besides the existence of equivalent mutants, the cost to apply the criterion - represented mainly by the effort to execute the mutants against the test suite - is regarded as a serious obstacle to the adoption of mutation testing. 6.2 Automation Several are the initiatives on developing testing tools intended to support the application of mutation testing criteria [102]. Some of them support the application of mutation testing for Java products like MuJava [221] and Jester [156], but none offers all the functionalities provided by Proteum’s tool family [230]. As regards testing product implementations, PROTEUM/IM 2.0 [97], developed at the ICMC-USP - Brazil, is the only mutation testing tool which supports the application of mutation testing for C programs at unit and integration levels. It consists of an integrated testing environment composed by Proteum [95] which supports the application of the mutation analysis criterion, and PROTEUM/IM [96], which supports the

Functional, Control and Data Flow, and Mutation Testing

47

application of the interface mutation criterion. Moreover, due to its multi-language architecture, it may also be configured to test products written in other languages. Basically, PROTEUM/IM offers the tester all the necessary resources for evaluating or generating test suites based on the mutation testing criterion. Based on the information provided by PROTEUM/IM 2.0, the tester is able to improve the quality of T until an adequate test suite is obtained. The resources provided by PROTEUM/IM 2.0 allow the execution of the following operations: define the test cases; execute the PUT; select mutation operators used to generate the mutants; generate mutants; execute mutants; analyse alive mutants; and compute the mutation score. Some of these functions are completely automated by the tool (such as the execution of mutants) while others require the aid of the tester in order to be carried out (such as the identification of equivalent mutants). In addition, some features have been added to the tool in order to facilitate the execution of experimental studies. For instance, the tool allows the tester to choose to execute the mutants with all test suites, even if it has already been distinguished. With this kind of test session (called a research session), additional data can be gathered regarding the efficacy of mutation operators or the determination of strategies for test case minimization [95]. An important point to consider when applying mutation testing criteria is the decision over which set or subset of mutation operators should be used. PROTEUM/IM 2.0 implements two different sets of mutation operators, one composed by 75 operators that perform mutations at unit level and another composed by 33 operators for integration testing. Mutation operators for unit testing are divided into four classes: statement mutations, operator mutations, variable mutations, and constant mutations, depending on the syntactical entity manipulated by the mutation operator. It is possible to select operators according to the classes or faults to be addressed, allowing the creation of mutants to be done stepwise or even to be divided between testers working independently. Table 5 shows a few operators for each of these classes. The use of PROTEUM/IM 2.0 is based on “testing sessions” in which the tester can perform operations in steps, create a session, interrupt it, and resume it. For a C version of the Identifier product, a test session can be conducted using a GUI or testing scripts. The first allows the tester to learn and explore the concepts related to mutation testing and to the tool itself. In addition, it offers a better means to visualize test cases and mutants, thus facilitating some tasks such as the identification of equivalent mutants. Conducting a test session using the graphical interface is probably easier, but less flexible than using command line scripts. The graphical interface requires constant Table 5. Example of mutation operators for unit testing of C products Operator Description u-SSDL u-ORRN u-VTWD u-Ccsr u-SWDD u-SMTC u-OLBN u-Cccr u-VDTR

Removes a statement from the program. Replaces a relational operator. Replaces the reference to a scalar by its predecessor and successor. Replaces the reference to a scalar by a constant. Replaces a while by a do-while. Breaks a loop execution after two executions. Replaces a logical operator by a bitwise operator. Replaces a constant by another. Forces each reference to a scalar to be: negative, positive, and zero.

48

A. Vincenzi et al.

intervention on the part of the tester while the scripts allow the execution of long test sessions in batch mode. In script mode, the tester is able to construct a script specifying how the test should be carried out and the tool just follows such a script without requiring tester intervention. On the other hand, this mode requires an additional effort for the creation of the script and a complete knowledge of both the concepts behind mutation testing and the commands provided by PROTEUM/IM 2.0. Test scripts are very useful for performing experimental studies which require several repetitive steps to be employed several times to obtain statistically significant information, and should be carried out by experienced users. We now evaluate the adequacy of our previous test suites against the mutation testing criterion, considering a C version of Identifier. To perform such an evaluation, only a subset of unit mutation operators is used. This subset (presented in Table 5) is called the “essential set of mutation operators” in the sense that an adequate test suite regarding this subset of mutation operators should be adequate or almost adequate to the entire set [21]. For our example, the complete set of mutation operators - when applied to the Identifier product - generates 1,135 mutants, while the essential set of mutation operators generates 447, which amounts to a reduction of 60.6%. We observe that this is an important reduction since one of the biggest advantages of mutation testing is the large number of mutants that can be executed and analyzed for a possible equivalence. In our specific case, before evaluating the adequacy of the test suites against mutation testing, we determined by hand all the equivalent mutants, 80 out of 447, as can be seen in Figure 15(a). Figure 16 shows the original product (on the left) and one of its equivalent mutants (on the right), as well as the way the tester can determine an equivalent mutant using the PROTEUM/IM 2.0 graphical interface. By disregarding the equivalent mutants, we are using 367 non-equivalent mutants during the evaluations of the adequacy of the test suites. At first, only the adequate test suite for the equivalent partition criterion was imported and evaluated against the mutation testing criterion; Figure 15(a) shows the status of the test session after the mutant execution. We observe that such a test suite killed 187 out of 367 mutants, hence determining a mutation score of 0.509. Subsequently, we evaluated the test suites TAll-Nodesei , TAll-Edgesei , and TAll-Pot-Usesei , which produced the status reports shown in Figures 15(b), 15(c), and 15(d), respectively. There was no increment in the mutation score from test suite for equivalent partition criterion to the test suite for All-Nodesei , and from the suite for All-Edgesei to the suite for All-Pot-Usesei . Therefore, at the end, after all test cases adequate for the All-Pot-Usesei criterion were executed, 128 out of 367 mutants are still alive. If we interrupt the testing activity at this point, we are considering that any of these 128 mutants could be considered “correct” with respect to our current testing activity, since there is no test case capable of distinguishing their behavior from the original product. During the analysis of these live mutants we observed that 119 were alive due to missing test cases, therefore further 23 test cases were required to kill them. The TMutation Testing adequate test suite is as follows.

Functional, Control and Data Flow, and Mutation Testing

(a)

TEquivalence Partition × Mutation Testing.

(c)

TAll-Edgesei × Mutation Testing.

(b)

(d)

49

TAll-Nodesei × Mutation Testing.

TAll-Pot-Usesei × Mutation Testing.

Fig. 15. Test-session status window

TMutation Testing = TAll-Pot-Usesei ∪ { (1#,Invalid), (#-%a,Invalid), (zzzz,Valid), (aAA,Valid), (A1234,Valid), (ZZZZ,Valid), (AAAA,Valid), (aa09a,Valid), ([,Invalid), (X::,Invalid), (X18,Invalid), (X[[a,Invalid), (X{{a,Invalid), (aax#a,Invalid), (aaa,Valid), ({,Invalid),(a,Valid), (a#a,Valid), (111,Invalid), (\‘111,Invalid),(a11,Invalid), (a\‘11,Invalid), (a/a,Invalid)} After executing the mutants with these test cases, we obtain the status report shown in Figure 17. As we can be see in this figure, there are still 9 live mutants and the maximum mutation score is 0.975. However, a mutation score of 1.00 cannot be obtained for such a product since it has two remaining faults that were not detected even by an All-Pot-Usesei adequate test suite. These live mutants are known as fault-reveling. A mutant is said to be fault-revealing if for any test case t so that P(t ) = M (t ) we can conclude that P(t ) is not in accordance with the expected result, that is, the presence of a fault is revealed. The faultrevealing mutants were generated by the Cccr and ORRN mutation operator and any test case developed by us to kill any of them actually detects the presence of a fault in the original product. Moreover, it is important to observe that these mutation operators belong to the essential set so that, besides the cost reduction, the efficacy of these subset of mutants was not compromised in this case.

50

A. Vincenzi et al.

Fig. 16. Mutant visualization

Fig. 17. Status report after TMutation Testing execution

For instance, Listing 2.5 illustrates two such mutants (we present them in the same source code for the sake of brevity), each of them revealing a different fault. The mutations can be observed in lines 20 and 29. Listing 2.6 shows the implementation of the validateIdentifier function without such faults. Once these faults are corrected we can apply the mutation testing criterion once more to verify whether a mutation score of 1.00 can be reached and, if so, we can incrementally apply the remaining mutation operators depending on our time and cost constraints. 6.3 State of the Art Experimental studies have provided evidence that mutation testing is among the most promising criteria in terms of fault detection [354,266]. However, mutation testing often imposes unacceptable demands on computing and human resources because of the large number of mutants that need to be executed and analyzed for a possible equivalence with respect to the original program. Besides the existence of equivalent mutants, the cost of applying the criterion represented mainly by the effort of executing the mutants against the test suite is pointed out as a serious obstacle to the adoption of mutation testing.

Functional, Control and Data Flow, and Mutation Testing 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

51

i n t v a l i d a t e I d e n t i f i e r ( char ∗ s ) { char a c h a r ; i n t i , v a l i d _ i d = FALSE ; i f ( s t r l e n ( s ) > 0) { achar = s [ 0 ] ; v a li d_ i d = valid_s ( achar ) ; i f ( s t r l e n ( s ) > 1) { achar = s [ 1 ] ; i = 1; w h i l e ( i < s t r l e n ( s ) − 0 ) { / / O r ig . : ( i < s t r l e n ( s ) −1) achar = s [ i ] ; i f ( ! va l i d_ f ( achar ) ) v a l i d _ i d = FALSE ; i ++; } } } / ∗ O r ig . : ( s t r l e n ( s ) < 6 ) ∗ / i f ( v a l i d _ i d && ( s t r l e n ( s ) >= 1 ) && ( s t r l e n ( s ) <= 6 ) ) r e t u r n TRUE ; else r e t u r n FALSE ; }

Listing 2.5. Fault revealing mutants: line 20 (Cccr); line 29 (ORRN). operator

Several approaches have been taken to overcome this problem.Randomly Selected Mutation, Constrained Mutation, and Selective Mutation [266,254,21] are alternatives that may reduce the number of generated mutants, but if we wish to obtain an adequate test suite, equivalent mutants must still be determined. Initially, a few special hardware architectures were proposed [200,238,78]. They represent interesting theoretical works but are not really useful in practice. The most successful approaches propose alternative criteria in which the number of mutants is reduced by some kind of rule, for instance, the use of x % of randomly selected mutants or the use of only a few specific mutation operators. As already discussed, the definition of sufficient mutation operator sets has also been explored, aiming at reducing the number of mutants without loss to criterion effectiveness. With respect to the determination of equivalent mutants, Offutt and Craft [264] and Offutt and Pan [267] considered the problems of test-data generation and equivalentmutant detection by using both constraint-based techniques and compiler optimizations. The idea explored by Offutt and Craft [264] consisted in implementing a set of compiler-optimization heuristics and in evaluating them. This approach looks at the mutants which, compared to the original program, implement traditional “peep-hole” compiler optimizations. Compiler optimizations are designed to create faster though equivalent programs, so that a mutant which implements a compiler optimization is, by definition, an equivalent mutant. The set of implemented heuristics they used was able to detect about 10% of equivalent mutants [21]. DeMillo and Offutt [103] used the concept of constraint to develop an automatic way to generate test cases. The idea they proposed was that, by solving a set of constraints, it is possible to generate a test case that kills any given mutant. Even if this set of constraints is not completely satisfied, it is also useful to determine equivalent mutants. Experimental studies showed that this approach could achieve a detection rate of equivalent mutants of about 50% [267].

52

A. Vincenzi et al. 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

i n t v a l i d a t e I d e n t i f i e r ( char ∗ s ) { char a c h a r ; i n t i , v a l i d _ i d = FALSE ; i f ( s t r l e n ( s ) > 0) { achar = s [ 0 ] ; v a li d_ i d = valid_s ( achar ) ; i f ( s t r l e n ( s ) > 1) { achar = s [ 1 ] ; i = 1; while ( i < s t r l e n ( s ) ) { achar = s [ i ] ; i f ( ! va l i d_ f ( achar ) ) v a l i d _ i d = FALSE ; i ++; } } }

28

i f ( v a l i d _ i d && ( s t r l e n ( s ) <= 6 ) ) r e t u r n TRUE ; else r e t u r n FALSE ;

29 30 31 32 33

}

Listing 2.6. Function validateIdentifier with removed faults

Hierons et al. [171] show how amorphous slicing [154] can be used to assist the human analysis of live mutants, rather than as a way of automatically determining equivalent mutants. Another study by Harman et al. [155] reveals the relationship between program dependence and mutation testing. The idea of the authors is to combine dependenceanalysis tools with existing mutation-testing tools, supporting the test-data generation and the determination of equivalent mutants. They also proposed a new mutation-testing process, which starts and ends with dependence-analysis phases. The pre-analysis phase removes a class of equivalent mutants from further analysis, while the post-analysis phase is used to reduce the human effort spent in studying the few mutants that evade the automated phases of the process. Vincenzi et al. [329] use Bayesian Learning as a mechanism to reduce the effort needed to analyze the live mutants instead of providing a way to automatically detect the equivalent ones. The idea of the proposed approach is to provide guidelines to facilitate the determination of equivalent mutants as well as the identification of non-equivalents, which is useful to improve the test suite. Based on historical data collected in previous experiments [21], the Bayeasian-Learning-Based EquivalentDetection Technique (BaLBEDeT) uses the Brute-Force algorithm to estimate the most promising group of mutants that should be analyzed. Mutation operators are actually the representation of a fault model considering a specific domain. In the above example we used the fault model developed to represent common faults in the context of the C programming language at unit level, that is, a C function. As a representation of a fault model, mutation testing is frequently used to evaluate the quality of other testing criteria and, according to Andrews [14], it is a good way to evaluate the efficacy of a given test suite or criterion.

Functional, Control and Data Flow, and Mutation Testing

53

7 Experimental Software Engineering Experimental studies are important in software engineering because we need objective and significant results with regards to comprehension, control, prediction, and improvement of the software development process [349]. This is also the case for testing techniques, criteria, and tools. It is possible to obtain evidence and data from experimental studies. According to Kitchenham et al. [196], software engineering must find means with which the best evidence of predominant research can be integrated with practical experience and with human values in the decision-making process, in relation to software development and maintenance. A technology should not be rated as completely good or bad, only suitable or not in some circumstances and for some organizations. Therefore, professionals should find experimental research studies about a given technology and evaluate these studies with respect to their particular circumstances [109]. Aiming at identifying an appropriate level of evidence as to a given technology characterization, software engineering employs two types of studies: primary studies and secondary studies [227]. Primary studies aim at characterizing a given technology within a specific context [227] and the most common types of studies are surveys, case studies, and controlled experiments [349]. An example of primary study is Basili and Selby’s controlled experiment [26]. Secondary studies aim at identifying, evaluating, and interpreting all available research relevant to a particular research question, topic area, or phenomenon of interest [272,195]. A type of secondary study is the systematic reviews [53]. Some examples of systematic reviews can be seen in [247]. Both primary and secondary studies have an underlying process [349,42]. In 2003, we developed a controlled experiment at USP intended to evaluate the efficacy of random testing, functional testing – boundary analysis and equivalence partition – and systematic functional testing [214]. The efficacy of several test suites generated by the previous criteria was evaluated against mutation testing, that is, we used mutation testing as a fault model, as discussed in [14], so that we could assess the capability of the generated test suite on exposing such faults. In an attempt to extend such an experiment, besides random, functional, and systematic functional test suites, Tillman provides us with additional test suites generated by Pex [263] – a tool for automatic test suite generation based on symbolic execution. The extended experiment is described below, considering a general experiment description framework. Example 5 (Extended Experiment from Linkman et al. [214]). Techniques: Random testing; functional testing: boundary value and equivalence partition; systematic functional testing; symbolic execution (used by Pex); and mutation testing (mutation analysis and interface mutation) for test suite evaluation. Software artifacts: Unix Cal program. Main objectives: To evaluate the adequacy of random, functional, and automatically generated test suites against mutation testing. Different test suites were generated and their effectiveness on killing the non-equivalent mutants of mutation analysis and interface mutation was determined.

54

A. Vincenzi et al.

Subject characterization: Six subjects with knowledge of the PUT and of functional and random testing were selected. Experimental project: No training for the subjects was necessary since they already had the required knowledge about the testing criteria to be applied. The only program artifact available to the subjects was the textual specification of the Cal program (as a Unix man page), and the generation and submission of a test suite were requested based on boundary analysis and equivalence partition criteria. How the techniques were used: Systematic functional testing: Based on the program specification, one of the subjects generated a test suite referenced as TSSFT , following the guidelines of such criterion, with 76 test cases. Boundary analysis and equivalence partition: Once again, based on the program specification, another five test suites were generated by applying such criteria for five different subjects. The generated test suites were referenced as TSPB1 , TSPB2 , TSPB3 , and TSPB4 with 14, 21, 15, 21 and 14 test cases each, respectively. Random testing: From a pool of 162 test cases generated by Wong [351], seven test suites with different cardinalities were incrementally derived. First, from the pool we obtained 10 test cases at random to compose the first test suite TSRA10 with 10 elements. Then, from the remaining 152 test cases we obtained 10 additional test cases to create the second test suite TSRA20 with 20 elements, and so on until TSRA80 , which is a test suite with 80 elements. Test suites using Pex: By using Pex, which is described in detail in Chapter 6, four different test suites were generated using different parameterizations of Pex. The first two test suites TSPXS 1 and TSPXS 2 correspond to the symbolic execution of Cal considering its input as an array of strings. In this situation, TSPXS 1 contains 9 essential generated test cases and TSPXS 2 contains 232 test cases. Then, two additional test suites, TSPXI 1 and TSPXI 2 , were generated with the symbolic execution of Cal considering its input as an array of integer numbers instead of an array of strings. TSPXI 1 is composed of 13 test cases and TSPXI 2 is composed of 83 test cases. Similarly to what we did for random testing, we used TSPXS 2 and TSPXI 2 as a pool of test cases and incrementally created eight subsets of test cases getting the first 10, then the next 10, and so on, until we reached 80 test cases in the test suites. The latter are referenced as TSPXS 210 to TSPXS 290 and as TSPXI 210 to TSPXI 280 , respectively. Mutation testing: We generated all possible mutants for the Cal program using the complete set of unit and interface mutation operators implemented by PROTEUM/IM 2.0. After the mutant generation, all of them were analyzed by hand and the equivalent mutants were identified and eliminated. A total of 8,891 mutants were generated and 885 were identified as equivalent. Therefore, the remaining 8,006 non-equivalent mutants were used as a set of possible faults and all of them could be killed by a test case. Main obtained results: Table 6 illustrates the number of live mutants, the percentage of live mutants with respect to the total number of generated mutants, and the resulting mutation score for each test suite. As we can observe, only the TSSFT test suite obtained

Functional, Control and Data Flow, and Mutation Testing

55

from the systematic functional testing criterion killed all non-equivalent mutants and determined a mutation score of 1.0000. Only one additional test suite determined a mutation score greater than 0.98: TSPB3 . On average, considering the test suites with the same cardinality, the functional-based test suites – the ones generated based on boundary analysis and equivalence partition – determined mutation scores above random or automatically generated test suites. For instance, the number of test cases in TSPB1 to TSPB5 range from 14 to 21 and the maximum and minimum mutation scores are 0.9859 and 0.8734, respectively. If we analyze random test suites from 10 to 50 test cases, TSRA10 to TSRA50 , all of them determined mutation scores below 0.87. Only after 60 random test cases the mutation score increases over 0.9, but even when running the entire random test suite TSRA with 162 test cases, the maximum mutation score is 0.9735. As concerns the automated test suites generated by Pex and based on string arrays, the first Pex test suite TSPXS 1 is the one which determined the lowest mutation score: 0.7872. The entire TSPXS 2 test suite with 232 test cases determined a mutation score of 0.9599. The generated test suites based on integer arrays revealed a better performance and determined mutation scores in the same range of random test suites with the same cardinalities, around 0.96, but with a test suite with 30 test cases, TSPX30 determined a mutation score above 0.95 which is only obtained by random test suites with at least 70 test cases (TSRA70 ). When considering that, of the 76 test cases of the TSSFT test suite, only 21 were effective, that is, killed at least one non-equivalent mutant, and by comparing the mutation score obtained with test suites with such a cardinality, such as TSRA20 , TSRA30 , TSPXS 220 , TSPXS 230 , TSPXI 220 , and TSPXI 230 , with mutation scores of 0.8297, 0.8472, 0.8125, 0.8134, 0.8592, and 0.9599, respectively, we observe that the differences in the mutation scores range from 4.01% for TSPXI 230 to 18.75% for TSPXS 220 with respect to the mutation score determined by TSSFT . Obviously this is just one example of a controlled experiment and a broad experiment involving more subjects and programs should be carried out. Therefore the discussion presented above should be carefully analyzed. Despite the low cost of applying random or automatic test-suite generation using Pex, it can be used as a starting point for creating an initial test suite within an incremental testing strategy. Such a test suite can be improved later based on a functional testing strategy considering the guidelines provided by systematic functional testing, which includes the boundary analysis and equivalence partition criteria. There are other examples of controlled experiments such as the Extended Basili and Selby experiment run in 2000, in ICMC/USP [229]. This experiment was based on Basili and Selby [26] and Kamsties and Lott [188] with the addition of incremental testing, including All-Nodes, All-Edges, All-Uses, All-Potential-Uses, and criteria of the structural technique and of mutation analysis. Replications of experiments. Replications are necessary to build a knowledge body with experimental results, increasing their confidence under other conditions or formulating new questions through research [349].

56

A. Vincenzi et al. Table 6. Test-suite coverage against mutation testing Test suite # Live Mutants % Live Mutants Mutation Score # Test Cases TSPB1 708 7.96 0.9116 14 TSPB2 1014 11.4 0.8734 21 TSPB3 113 1.27 0.9859 15 TSPB4 578 6.5 0.9278 21 TSPB5 836 9.4 0.8956 14 TSSFT

0

0

1.0000

TSRA10 TSRA20 TSRA30 TSRA40 TSRA50 TSRA60 TSRA70 TSRA80 TSRA90 TSRA100 TSRA110 TSRA120 TSRA130 TSRA140 TSRA150 TSRA160 TSRA

1457 1364 1224 1212 1103 521 250 250 250 248 248 248 248 248 248 248 212

16.39 15.34 13.77 13.63 12.41 5.86 2.81 2.81 2.81 2.79 2.79 2.79 2.79 2.79 2.79 2.79 2.38

0.8181 0.8297 0.8472 0.8487 0.8623 0.9350 0.9688 0.9688 0.9688 0.9690 0.9690 0.9690 0.9690 0.9690 0.9690 0.9690 0.9735

TSPXS 1

1704

19.17

0.7872

9

TSPXS 210 TSPXS 220 TSPXS 230 TSPXS 240 TSPXS 250 TSPXS 260 TSPXS 270 TSPXS 280 TSPXS 290

1700 1501 1494 1458 823 799 799 799 535

19.12 16.88 16.8 16.4 9.26 8.99 8.99 8.99 6.02

0.7877 0.8125 0.8134 0.8179 0.8972 0.9002 0.9002 0.9002 0.9332

10 20 30 40 50 60 70 80 90

TSPXS 2

321

3.61

0.9599

232

TSPXI 1

484

5.44

0.9396

13

1387 1127 321 306 306 275 274 270 270

15.6 12.68 3.61 3.44 3.44 3.09 3.08 3.04 3.04

0.8268 0.8592 0.9599 0.9618 0.9618 0.9657 0.9658 0.9663 0.9663

10 20 30 40 50 60 70 80 82

TSPXI 210 TSPXI 220 TSPXI 230 TSPXI 240 TSPXI 250 TSPXI 260 TSPXI 270 TSPXI 280 TSPX

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 162

Functional, Control and Data Flow, and Mutation Testing

57

Basili et al. [27] categorized some types of replications, which we describe below. – Replications that do not alter the hypothesis: • Replications that repeat the original experiment as closely as possible. • Replications that alter the manner in which the first experiment was run. – Replications that alter the hypothesis: • Replications that alter design issues. • Replications that alter factors of the experiment setting. – Replications that reformulate the goals and the experiment hypothesis Another classification was proposed by Mendonça et al. [64], based on the relationship of the original experimenters and the replicators. – Replications are conducted by the same researchers who conducted the original experiment (Type I). – Replicators and original experimenters are at different locations but the former coordinate their efforts with the original experimenters (Type II). – Replicators and original experimenters are at different locations and the former do not coordinate their efforts with the original team (Type III). Therefore, Experimental Software Engineering, Experiment Replication, and Evidencebased Software Engineering should be used together to improve the body of knowledge in the Software Engineering field, since they provide the means to understand better the quality and relationship between different methods, methodologies, techniques, criteria, and tools employed and developed so far.

8 Conclusions In this chapter, we have presented the theoretical and practical aspects of functional, structural, and fault-based testing techniques, so that we provide a general overview of basic and advanced concepts of software testing. We have presented some software testing criteria and related concepts, emphasizing the most used or prominent: the data flow- and the mutation-based testing criteria. The automated support for functional, structural, and fault-based testing criteria was illustrated by JUnit, Emma, JaBUTi, PROTEUM/IM 2.0, and Pex. Moreover, other initiatives and automation efforts of these and other criteria were mentioned due to the relevance of this topic for improving the quality and productivity of the software testing activity. Furthermore, our research group, as one of the partners in the QualiPSo Project, is performing several activities related with open source software. One of our objectives is to make our tools available as open source software so that they can be used and improved by the community. We highlighted the complementary aspect of the different testing techniques and criteria as well as the relevance of conducting experimental studies in order to build a knowledge base which can be used as the basis for the establishment of incremental testing strategies, combining the different criteria and taking the advantages of each. On such strategies, weaker criteria – and maybe less effective ones – are first applied to evaluate the quality of an initial test suite or to generate one. Following, depending on time

58

A. Vincenzi et al.

and cost constraints, stronger criteria – and maybe more effective ones – can be used to improve the quality of the test suite and consequently the confidence in the PUT. Finally, we provided a discussion about the importance of experimental software engineering. Experimental studies are carried out to evaluate different aspects of the testing criteria, such as cost, effectiveness, and strength, aiming at the establishment of lower cost and highly effective testing strategies which are able to transform the state of practice related to the use of testing tools and criteria.

Automatic Test-Case Generation Patrícia Machado1 and Augusto Sampaio2 1

Universidade Federal de Campina Grande, Brazil [email protected] 2 Universidade Federal of Pernambuco, Brazil [email protected]

This chapter is an introduction to the theory, techniques, and tool support for automatic test-case generation. We discuss how test models can be generated, for instance, from requirements specifications, and present different criteria and strategies for generating and selecting test cases from these models. The TARGET tool is presented and used in this the chapter for illustrating test-case generation techniques along with a case study in the domain of mobile-phone applications. TARGET generates abstract test cases from use-case specifications presented as templates whose contents are described using a controlled natural language. Section 2 presents a generic model-based testing (MBT) process model, briefly discusses models that are suitable for test generation, and introduces MBT in the mobile-phone applications domain. Section 3 presents models that are commonly used for test-case generation along with test-case generation strategies and algorithms. Section 4 presents different strategies for test-case selection from test models such as labelled transition systems. Test selection allows for a customised yet automatic choice of test cases among the possible ones to be generated according to a given coverage goal. The focus is on test-purpose selection, random selection and similarity-based selection. Section 5 presents a strategy for generating test models from use-case specifications. The specifications are formalised according to use-case templates that have been designed with two main goals: to be used by analysts for requirements documentation and also for test designers who can complement requirements with testing goals and concerns. The use-case specification is translated into a labelled transition system from which test cases can be generated. Section 6 presents the TARGET tool. This tool generates abstract test cases from use-case specifications with controlled natural language according to the templates presented in Section 5. TARGET has been developed for supporting feature and featureinteraction testing of mobile-phone applications. However, it can be largely applied for functional testing of reactive systems whose behavior can be expressed by case scenarios. The use of the TARGET tool is illustrated in a case study. Section 7 presents perspectives on the use MBT and test generation in practice and the obstacles yet to be faced.

1 Introduction As explained in Chapter 1, a testing process usually includes activities such as planning, specification, execution and result analysis. During planning, testing goals, resources P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 59–103, 2010. c Springer-Verlag Berlin Heidelberg 2010

60

P. Machado and A. Sampaio

required, and constraints are set. From this, the total testing effort can be estimated as well as how many tests are needed and feasible to construct and maintain. Then, during specification, test cases are defined and documented, usually in a high-level language tailored for test-case specification. This activity may also include test-code creation if tests are automated. Once test cases and the system under test (SUT) are available, test cases are executed. The execution activity is necessarily followed by analysis of the results obtained. From this analysis, conclusions regarding the need for more tests along with fixing defects are drawn. Abstraction is a key concept used in any of these activities. For planning, this is necessary so that a general understanding of what needs to be tested is acquired [246]. For specification, abstraction is crucial; it allows testers focus on specific characteristics to be tested and also be systematic by guaranteeing effective coverage of a given characteristic [41]. As previously discussed in Chapters 1 and 2, test-case selection is always a challenge. Finally, during execution and analysis, abstraction is fundamental for allowing intended behavior to be observed in locus. A real software exhibits a number of behaviors, but most of them are irrelevant for a particular test-case execution. The use of models for testing can formalise the use of abstraction during these different activities. High-level models can be used to represent general functionalities of an application, making them explicit for test planning. Also, test cases can be automatically derived from models, favouring an effective test coverage of intended software functionalities, and improving reliability and productivity in the testing process, particularly if testing activities are automated. Moreover, test execution can be automated and result analysis can be more precisely determined, providing accurate results. These are the main reasons that justify the wide use of models as part of testing methods and tools ranging from purely functional to purely structural approaches. In this chapter, we focus on automatic test-case generation, and particularly on ModelBased Testing (MBT) [111]. This is a testing approach that makes use of models, usually named as test models, for representing the behavior we want to test at a target implementation. From the test model, test cases are generated according to a given coverage goal. When formal models for generating test cases and viewing the SUT are considered, this approach is commonly known as conformance testing [320]. The main promises of MBT are as follows. – Increased effectiveness in testing, by leading to a focus on and fair coverage of what needs to be tested; – Test-costs reduction, by possible reuse of development artifacts and automation of test-artifacts generation; – Increased reliability in testing due to the fact that most of the test artifacts are automatically generated in contrast with manually coding them. As already discussed in the previous chapter, test automation is staple for making testing processes feasible in practice, and this is particularly true for MBT processes [161]. Such processes, even though highly dependent on key decision-making, are usually composed of a huge number of automatic, repetitive and reproducible activities that can be error-prone when executed by hand. For instance, test cases coding can be just a direct translation of abstract specifications of conditions and scenarios of execution. In addition, due to the huge amount of artifacts produced in these processes such as

Automatic Test-Case Generation

61

test-case documents and code, it is usually difficult to keep track of the evolution of test suites in response to changes in the application. In this sense, automatic test-case generation is particularly interesting, since changes made at abstract models are automatically passed to the test cases simply by re-generating them. In other words, test cases are re-generated to meet changes in requirements that are reflected in the test models. As a result, traceability may become automatic and under control. Moreover, automatic generation is staple to precisely address coverage goals that are needed to allow for a systematic and complete investigation of a given functionality we want to test in the SUT. Differently from application-code generation, automatic test-case code generation is usually possible; even full code generation is possible, since there is no creative step, for instance, to provide algorithms [20]. Assuming that test models are reasonably complete and consistent, the main obstacle that may impede full generation is the absence of data refinement automatic mappings (from abstract types to concrete ones) that may need to be supplied when not formally defined by the development process, as well as platform dependent concerns. Test-execution architectures and procedures can be quite standard. In this sense, a number of patterns have been proposed [147,16,148]. Apart from test cases, all testing artifacts that are needed at test execution, from test cases to test components and drivers that are needed for executing the test cases, can be automatically generated from development models, particularly, by using model-driven engineering approaches [16,54,212]. The main goal of automatic test-case generation is to generate all possible combinations of test cases from a given test model according to coverage criteria. Either test cases are documented in a test-case description language such as TTCN [361] or test cases are translated into a target language for automated execution. For the sake of simplicity, in this chapter, we focus on how test-generation algorithms produce test suites without regarding test-case coding in a particular programming language. From abstract test-case descriptions, test cases can either be manually executed or translated into a target programming language. Automatic test-case algorithms are usually exhaustive with respect to the coverage goal defined. Therefore, depending on the magnitude of the model, these algorithms may produce a considerable number of test cases. If a coverage criterion is not properly chosen, the process can generate too many test cases that are infeasible to be considered for practical execution. Also, most of the test cases can be redundant in the sense of exercising common features of the SUT (for instance, the same lines of code) and revealing common sets of defects. Therefore, other than structural coverage criteria, test-case generation may need to be combined with selection strategies that may help to focus test generation at particular functionalities of interest; minimise redundancy on test suites; and limit the size of test suites. These selection strategies usually combine statistics and semantic analysis of models. In this chapter, we introduce and motivate model-based test-case generation by introducing theories and techniques together with a tool and a case study. Classical test-case synthesis and generation algorithms are presented along with selection techniques. We discuss how test models can be generated, for instance, from requirements specifications and present different criteria and strategies for generating and selecting test cases

62

P. Machado and A. Sampaio

from these models. The TARGET tool is presented and used in this chapter for illustrating test-case generation techniques along with a case study in the domain of mobilephone applications. A distinguishing feature of our approach is the use of templates to describe use cases from which the test model is derived and test cases are generated. These templates are filled with text in a Controlled Natural Language (CNL, a subset of English with a fixed grammar) relevant for the domain of mobile applications. The use of CNL allows the text to be mechanically processed.

2 Model-Based testing Figure 1 shows a typical MBT process model in which test case generation is applied [111]. The model is composed of the following phases. 1. Building the Model. In this phase, a mental representation of the system’s requirements is formed and mapped to the model. The goal is to produce a test-ready model, that is, one that contains enough information for making automatic generation possible [41]. Development models and requirements documents can often be used as a starting point. However, test models usually need to be more detailed and complete. 2. Generating Test Cases. Test cases are derived from the model. The automation of this phase depends on the nature of the model. Formal models usually make it possible. 3. Running Test Cases. Consists in running the test cases generated previously. 4. Collecting and Analysing Results. Execution time and test cases results are collected and analysed to help to decide whether the tests are enough or more test cases are needed, and whether the model needs to be improved. To be effective, test case generation in an MBT process usually has the following requirements. – A reasonable understanding of the SUT. This is usually achieved by studying the application domain and the requirements to be met. – A suitable model for representing the requirements for the system that suits the application domain and testing goals, and that also makes test generation possible. It is also important to make sure that test hypotheses that underly test generation from the kind of model chosen can be met. For instance, it is usually required that the SUT can be modelled using a particular language (test hypothesis in [320]) to make reasoning relating specifications and implementations possible. – A systematic procedure for building the model either by hand or by automatic generation from requirements documents, for instance, by natural language translation. – Test generation algorithms that can be automatically applied according to coverage and test selection criteria. – A validation strategy from which the test cases generated can be assessed regarding whether they are consistent and feasible with respect to the SUT. The results of applying this strategy may lead to the need for model refinement and repetition of test-cases generation or test-case execution. In formal testing frameworks, consistency is usually guaranteed by test-generation algorithms. Feasibility can also be automatically assessed by using statistical tools.

Automatic Test-Case Generation

63

Fig. 1. A Model-Based Testing Process Model [258]

2.1 Test Models In this chapter, we define test model as an abstract model that is developed with the main purpose of test-case generation for testing applications against an expected behavior. This can be built from scratch or derived from development models of an application. A suitable test model must support both manual and automatic generation so that both deterministic and automatic choices of test cases can be made [41]. Also, the model must be testable in the sense that we can produce ready-to-run test cases from it. Moreover, the model must meet the following requirements. – Be a complete and accurate representation of all features to be tested, allowing valid coverage conclusions. – Abstract detail that is not the focus of the test and would make the cost of testing prohibitive. – Preserve detail that is key for revealing faults of interest and also for setting up test execution and make test-oracle generation possible. Finding out what is the right

64

P. Machado and A. Sampaio

abstraction level for models is crucial. Too much detail may lead to state space explosion, whereas too little may make full test generation impossible. – Represent all stimulus and states that are visible externally. It is also crucial that an executable means of determining them in the SUT is available. Additionally, it is important that the model is based on a well-founded test theory such as the ones presented in [36,320], for instance, and has a formal semantics, so that valid properties of test suites and conformance relations can be defined to make the process of analysing results more precise. It is often the case that success in testing does not mean conformance of the SUT with the model. This is the usual scenario. However, failure in testing does not always mean non-conformance of the SUT with the model either. This happens usually when the test suites are faulty or they are not adequate for the testing goal. Test-generation algorithms based on formal models can often avoid this second situation, whereas the former is usually acceptable and untreatable. Furthermore, formal models enforce absence of conflicts, structural errors and behavior errors. They are also free from ambiguities. This favours generation of test cases that are valid and directly applicable. Such models can also be automatically verified, traced and analysed. 2.2 MBT in the Mobile-Phone Applications Domain The mobile-phone applications domain is characterised by applications composed of a number of features that are highly interactive, having their flow of execution guided mostly by external input. A feature is a clustering of individual requirements that describe a cohesive, identifiable unit of functionality [321]. Mobile phone applications have two specific types of testing (besides the conventional ones): feature and interaction testing. Feature testing is the process of validating feature requirements by testing whereas interaction testing is the process of validating integration of features to compose an application as well as their possible ways of interaction with other features according to interaction requirements. Because features are usually developed and tested in isolation, or with a particular service [68], feature testing is very important to reduce the number of defects that escape from one phase to another during the test process. In addition, interaction testing usually assumes that features conform to their requirements. Feature and interaction testing are functional testing techniques, but have some particularities [258]. In the mobile-applications area, for example, the majority of test cases are manually executed. This fact increases the probability of human error during testcases execution. Therefore, test cases usually have a very small number of steps. Another characteristic of feature testing is the concept of redundancy, specially when two test cases have the same steps but in a different order, distinguishing them. However, these test cases can be considered redundant because the order of steps is not important, for instance, when filling a form. Therefore, test-case generation and selection needs further investigation. In addition, time-to-market demands development time to be minimised and, consequently, feature testing needs to be executed with minimumtime requirements. Moreover, it is necessary to acquire an adequate level of knowledge

Automatic Test-Case Generation

65

of requirements and also of the application domain to devise and execute test cases. Finally, it is usually necessary to know potential applications behavior because the feature can interact with other features. In this context, MBT seems to be a promising approach, since the testing process effort can be considerably reduced by automatic test-case generation. Also, both feature and interaction testing have as main goal to validate conformance with requirements. As we present in Section 5, it is possible to automatically generate test models from use-case specifications of feature requirements and this strategy is extended in [120] to support interaction requirements specification. In addition, MBT can make it possible to pursue a systematic investigation of possible combinations of test cases to be explored during interaction testing (as we discuss later on in this section). Another point that favours the application of MBT in this context is that requirements coverage can be guaranteed. Following feature testing, interaction testing is often executed to check behaviors that are expected when features are composed to form applications. In this context, we call interactions either the scenarios where a feature functionality depends on the other feature (static interaction) or the scenarios where there are combinations of independent behaviors [217] (dynamic interactions). Static interaction occurs when a feature, in order to execute a certain functionality, uses a service or a data of another feature, that is, a feature depends on the other feature. We note that there is no parallelism; the first feature is executing, uses a service of another feature and stops. On the other hand, dynamic interaction occurs when a feature or a network service event interrupts the execution of another feature that is running in the foreground. As a result, the executions of features are intermingled. These interactions are considered as independent behavior combinations. In this chapter, we call testing of dynamic interactions simply interruption testing. As an example of a static interaction, we consider the following situation: when we finish writing a message we can choose a contact from the phonebook to send the message. In this case, there is a static interaction between the message feature and the phonebook feature. An example of dynamic interaction is when a user is composing a text message and suddenly a call arrives, characterising the interaction between the sending-message and the incoming-call features. In the last example, we identify the flow that is executing (called the main flow) and the flow that caused the interruption. In this case, composing message is the main flow and the interruption is caused by the incoming-call flow. The possible number of combinations of allowed interruptions at different points of a flow of execution of a feature is huge. MBT can handle this complexity by automatic generation once features and possible interactions are specified. However this cannot be done by using conventional notations and strategies. This is due to the fact that exhaustive specification can be required. Section 3.3 presents a test model capable of representing interruptions. This model focus on allowing an interruptions specification to be defined and plugged at different points of execution of the interrupted model, so that test cases can be effectively generated and selected.

66

P. Machado and A. Sampaio

In summary, to provide an effective MBT solution for interruption testing, it is crucial to define a test model capable of representing such interruptions, and consequently, make the automatic test-case generation process possible. In addition, the model has to be composable, allowing interruptions to be combined at different points of possibly different flows of execution. Moreover, due to the huge amount of the possible test cases, selection strategies need to be applied to reduce the size of the test suites.

3 Test Models and Test Generation In this section, test models that can be used for test-case generation are presented together with corresponding test case generation strategies. We focus on Markov chains, labelled transition systems (LTSs) and CSP models. Markov chains are used to describe systems as stochastic processes for statistical testing. On the other hand, LTSs are largely applied as test models of reactive systems. These models usually define the semantics of more abstract formalisms for the specification of reactive systems and a variety of tools are available for generating them. Test-case generation algorithms from LTS and tools to support this generation are also available [320,180]. Finally, the CSP process algebra [172] is a formal specification language primarily designed to describe the behavior of concurrent and distributed systems. The fundamental idea of CSP is communication based on events; events are abstractions of real-world actions. Therefore, CSP is very appropriate for modelling test behavior of reactive systems with the additional advantage of making it possible for complex patterns of concurrency to be modelled. 3.1 Markov Chains The use of Markov chains as test models has been proposed by Whitaker and Thomason [342] as a basis for statistical testing. This style of testing is based on the selection of samples that can represent a population of test cases according to probabilities that are attributed to the occurrence of events during software execution. Since these samples are randomly selected, valid statistical inferences can be made for the expected performance of the system in operation (see Figure 2). Statistical testing considers the testing process as an statistical experiment with test selection based on an operational profile that covers all possible uses and their probability of occurrence [317]. This is the model adopted for statistical testing as part of the certification process of the Cleanroom Reference Model (CRM) [281]. This is also applied in many other testing approaches such as the AFCT method [20] for component test based on UML models. The model and algorithms used in this method are illustrated in the sequel. The model. A usage chain for a software system consists of nodes that represent usage states, that is, the system visible modes of operation, and arcs that represent stimuli that cause transitions between states along with transition probabilities. In the CRM model, arcs are associated with inputs received by the system, whereas in the AFCT method they are associated with operation calls. Two special states are added: a start and a terminate state (for limiting test-cases length during test-case generation) [281]. Probabilities are required to represent the best estimate of real usage for the software.

Automatic Test-Case Generation

67

Fig. 2. Statistical Testing [317]

The idea behind this model is to capture the stochastic nature of software. For instance, the probability of occurrence of inputs may vary significatively depending on the state being considered. From usage models, any number of test cases can be generated randomly guided by different probability distributions. The process of constructing the usage chain usually involves three steps: identify and represent states and transitions in a state transition diagram; assign probabilities to the transitions; and validate the model by transition-coverage analysis. From the usage chain, test suites can be selected as representative samples of testcase suites for the application. Additionally, statistical methods can be applied to help test planning and reliability assessment by calculating measures such as probability of a transition appearing in a test sequence, expected number of sequences to cover a state or transition, expectation and variance of state or transition first passage, and number of sequences needed to cover all states or transitions. This information can help to define optimal test allocation, estimate and variance of overall failure rate, probability of finding a failure, expected number of failures, and also help to decide when to stop testing. Figure 3 presents an example of a chain that is automatically generated by the SPACES tool [20] for a functionality RemoveCar in a car-rental system with an even probability distribution. From this graphical view, probability distribution of outgoing arcs of a node can be changed as long as the change keeps the distribution consistent. States are incrementally numbered during model generation with 0 as the start state and 1 as the terminate state. The sequence of transitions that cover states 0, 2, 3, 4, 1 (in this order) is the main flow of the use case that involves calls and returns between the CarRentalSystem and the CarManager classes. This simple example has only one alternate flow that is denoted by the transitions to and from state 5 in which the car is nonexistent in the system.

68

P. Machado and A. Sampaio

Fig. 3. A usage model generated from UML sequence diagrams

Test generation. Rather than manually created from scratch, the transition diagram of a usage chain can be automatically generated from development artifacts, starting with an even distribution of probabilities. Then, various probability distributions can be applied to this model. In the CRM, this is generated directly from the canonical sequences (statebox definition) defined during software specification. In the AFCT method, this is generated from UML interaction diagrams that may also be defined in the component specification with additional information required by a testing specification. In the AFCT method, test cases are generated from the usage model by extracting selected paths from the graph to define the set of test cases. Figure 4 presents the algorithm that selects paths from the usage model. This algorithm, inspired by the CRM, explores paths that begin at the start node and finish at the end node, by applying a probability-distribution function. The algorithm also ignores paths that were already explored, avoiding duplicated paths. The function selectPaths receives a usage model as input together with the number n of paths to be selected. The function selectableLinks monitors the visited transitions (those that were fully explored), avoiding repetition of paths. When exploring the non-visited transitions, the associated probability is considered and is crucial to define the selected path. 3.2 Input-Output Labelled Transition Systems Input-output labelled transition systems have been largely applied as test models and a number of well-founded test-generation algorithms have already been proposed. This section focus on the model and algorithms defined for the TGV tool. TGV (Test Generation with Verification technology) [180] is a black-box testing tool that provides automatic synthesis of conformance test cases for non-deterministic reactive systems. It has already been applied to industrial experiments [118]. Test-cases synthesis is based on verification techniques such as synchronous product, on-the-fly verification and traversal algorithms. The test model and generation algorithms of TGV are briefly and informally

Automatic Test-Case Generation

69

Fig. 4. Algorithm for selecting test cases in a usage model

described and illustrated by examples in this section. The complete theory that underlies the algorithms can be found in [180]. The model. A reactive system can be modelled as a transition system where transitions represent input and output actions as well as internal actions. From a testing perspective, the idea is to represent the behavior of the system as sequences of events that can be used as stimuli for the system (to control its execution during testing) and also events that are expected to be observed during test execution. Formally, an Input-Output Labelled Transition System (IOLTS) is a quadruple M = (Q M , AM , →M , q0M ), where: – Q M is a finite non-empty set of states; – q0M ∈ Q M is the initial state; M M M – AM = AM is the alphabet of actions with AM I ∪ AO ∪ I I the input alphabet, AO M the output alphabet and I the internal actions alphabet; – →M ⊆ Q M × AM × Q M is the transition relation.

70

P. Machado and A. Sampaio

Fig. 5. An IOLTS Specification [180]

The IOLTS provides distinction between events of the system that are controllable by the environment and those that are only observable by the environment, that is, inputs and outputs, respectively. Internal actions are also defined. Figure 5 shows an example of an IOLTS specification. We note that output labels have "!" as suffixes whereas input labels have "?" as suffixes. Internal actions have τ prefix. States are numbered from 0 to 9. TGV is based on the ioco theory presented in [318]. This theory defines a conformance relation between implementation and specification. It states that a SUT is iococorrect with respect to its specification if either (i) the SUT can never produce an output which could not have been produced by the specification in the same situation (that is, after the same sequence of inputs and outputs), or (ii) the SUT may only be quiescent if the specification can do so. Quiescence is defined as the absence of outputs (like a deadlock or a livelock situation), and it may be observed by timers. Since we cannot reason about a SUT, the TGV theory assumes that it can be modelled by an IOLTS: SUT = (Q SUT , ASUT , →SUT , q0SUT ) compatible with the alphabets of a specification S = (Q S , AS , →S , q0S ) from which test cases are to be generated, that and ASO ⊆ ASUT . The SUT must be (weakly) input complete, that is, ASI ⊆ ASUT I O is, all inputs must be accepted, possibly after internal actions. In other words, the SUT never refuses an invalid input by either ignoring the request or answering negatively. The TGV tool generates abstract test cases from IOLTS specifications guided by a formal test purpose that targets the generation process at particular behaviors. Test purposes are modelled by deterministic and complete (each state allows all actions) IOLTSs extended with marked states: Accept tp (define sequences to be included in test cases) and Refuse tp (define sequences not be included in the test cases). The use of Refuse tp reduces the scope of test generation. Figure 6 shows a test purpose that

Automatic Test-Case Generation

71

Fig. 6. A Test Purpose compatible with the specification of Figure 5 [180]

is compatible with the specification of Figure 5. Sequences of actions in which labels end with 5 or z , before a y, are not to be considered, and, when y occurs, it has to be followed by a z . Test generation. For the sake of simplicity, we present only an overview of test-case generation as performed by the TGV tool. More details on test-case synthesis and selection algorithms can be found in [180]. Figure 7 shows the general infrastructure in which the TGV tool operates. As input, TGV receives the specification as a model of the system behavior S and a test purpose TP and produces abstract test cases TC as result. From Figure 7, we see that the first step is to perform a synchronous product between the specification S of the system and the test purpose TP, resulting in a third IOLTS named SP. This solves the problem of identifying the behaviors of S accepted or refused by TP. Basically, SP is an intersection of S and TP that contains only states that are reachable from its initial state defined as (q0S , q0TP ) (the initial states of S and TP respectively); and the transition relation is defined as (q S , q TP ) −→SP (q S , q TP ) ⇐⇒ q S −→S q s ∧ q TP −→TP q TP a

a

Fig. 7. TGV infrastructure [180]

a

72

P. Machado and A. Sampaio

Fig. 8. The suspension automaton of the specification presented in Figure 5 [180]

As TP is complete, all behaviors of S are preserved in SP, including quiescence. As the ioco relation can be established based on observable behaviors, TGV defines another IOLTS which contains only observable actions of the system and provides a way to explicitly define the quiescent states. This IOLTS is called SPvis . This is built from the suspension automaton of SP, named Δ(SP), that explicitly represents quiescence like livelock (an infinite sequence of internal loops) and output lock (the system keeps waiting for an input from the environment). Figure 8 presents the suspension automaton of the specification presented in Figure 5. The quiescence transition !δ is added to states 0, 2 and 4 since they are livelocks. Also, this transition is added to states 1, 7 and 9 as they are output quiescent. From the suspension automaton Δ(SP), SPvis is created by a determinization operation, that is, SPvis = det (Δ(SP)). Roughly, this operation returns a deterministic automaton from Δ(SP) that does not include internal actions, states are meta-states defined as subsets of the states of Δ(SP), and transitions are defined for each reachable trace between meta-states. For instance, the initial state of the determinization of the suspension automaton in Figure 8 corresponds to the set of states {0, 1, 2, 3}. Also, ?b is fireable from state {0, 1, 2, 3} to state {4, 8}. From SPvis , the selection procedure (see Figure 7) builds another IOLTS, called CTG, representing the complete test graph defining the test purpose. The CTG is built through extraction of the accepted behaviors (identified from the accept states) and inversion of inputs and outputs, to provide the link between the SUT and the test cases. Sequences that do not lead to marked states are discarded. Figure 9 shows the CTG generated from the specification in Figure 5 and the test purpose in Figure 6. The CTG has three additional states:

Automatic Test-Case Generation

73

Fig. 9. Complete test graph generated from the specification in Figure 5 and the test purpose in Figure 6 [180]

– Pass, which ends traces that denote acceptable behavior during test case execution; – Fail, which ends traces that denote unacceptable behavior during test case execution; and – Inconc, which ends traces that are not being considered by the test purpose (sequences that end with the refuse state in SP VIS ). Concluding the process, the so-called controllability conflicts are suppressed to obtain the test cases. These controllability conflicts are represented by the presence of choice between outputs or between inputs and outputs in some states. These conflicts are solved by pruning some transitions. In this case, either one output is kept and all other inputs and outputs are pruned, or all inputs are kept and outputs are pruned. Optionally, a test graph TG may be constructed during the selection phase by suppressing some conflicts on-the-fly. An example of a test case selected from the CTG in Figure 9 is shown in Figure 10. The abstract test cases generated by TGV are sound and exhaustive with respect to the ioco relation. A test suite is sound when all correct SUTs and possibly some incorrect SUTs pass it. On the other hand, a test suite is exhaustive when all incorrect SUTs and possibly some correct SUTs fail it [320]. However, the generated CTG may contain an infinite number of possible test cases and the tool does not provide a mechanism to select the tests to be implemented. In practice, it is difficult to achieve soundness and

74

P. Machado and A. Sampaio

Fig. 10. A possible test case from the CTG presented in Figure 9 [180]

exhaustiveness of selected parts of the CTG. Also, the tool does not provide mechanisms to implement and execute the test cases either. However, TGV is part of a wide project, called AGEDIS [161], which aims at the development of a methodology and tools for automated model-driven test generation and execution for distributed systems. 3.3 Annotated Labelled Transition Systems In this section, we present an LTS model that has been created to represent feature and feature interruption behavior (see Section 2.2) along with a test-case generation algorithm [70]. The model has been developed for the domain of applications composed of interactive features, such as mobile-phone applications. The model. There are situations where it is important to distinguish between types of input actions because they may demand a different treatment during test-case generation. For instance, application-level interruptions often require that another application executes in place of the current one that will resume its execution after the interruption is handled. Also, test execution may demand that some conditions may be met during test execution. These conditions are often not part of the test scenario itself, but they can be represented as setup information. Annotated Labelled Transition Systems (ALTS) are a special kind of input-output transition systems that supports distinction of input and output actions by using annotations. Formally, an ALTS is a quadruple < Q , R, T , q0 >, where:

Automatic Test-Case Generation

75

– Q is a countable, non-empty set of states; – R = A ∪ N is a countable set of labels, where A is a countable set of actions and N is a countable set of annotations; – T ⊆ Q × R × Q is a transition relation; and – q0 ∈ Q is an initial state. Annotations are inserted into the LTS with specific goals: guide the test-case generation process, by making the focus on particular interruptions easier; make it possible for interruption models to be plugged without interfering with the main model; guide testcase documentation; make it possible for conditions to be associated with actions; and indicate points where interruptions can be reasonably observed externally.

Fig. 11. ALTS for the Remove Message Behavior of a Message Feature

76

P. Machado and A. Sampaio

Figure 11 presents an ALTS model that represents the behavior of removing a message from the inbox. Three annotations are used in the ALTS model: steps, conditions, and expectedResults. The steps annotation is used to indicate input actions, the conditions associated with input actions are indicated using the conditions annotation, and finally the expected results are indicated using the expectedResults annotation. Figure 12 shows an ALTS model that represents two features where a feature, named Incoming Message, interrupts the flow of execution of the feature in Figure 11 at node 4. This is annotated by using the begininterruption and endinterruption labels. This annotation connects the LTS models of the two feature behaviors and indicate at which point interruption can happen and what is the expected behavior if the interruption occurs or not. The idea is that feature can be specified separately and then combined at the allowed points of interruption by composing the ALTSs with the interruption annotations. Test-generation algorithm. A feature test case consists of a path extracted from the test model in the scope of a given feature without considering the interruptions. On the other hand, a feature-interruption test case is a path that includes an interruption. The same basic algorithm presented in the sequel is going to be used to extract test cases, but its application is different depending on the kind of test case (feature or feature interruption). A path can be obtained from the ALTS model, using Depth First Search (DFS), by traversing an ALTS starting from the initial state. As a general coverage criterion, all labelled transitions need to be covered, that is, all ALTS labelled transitions need to be visited at least once. Since we are considering functional testing, total coverage is a reasonable and feasible goal for feature testing, to guarantee a thorough investigation of the feature functionalities [246]. Nevertheless, this is not always feasible for featureinterruption testing due to the possibly infinite number of combinations of interruptions that can happen at different points of different features. Therefore, test-case selection techniques must also be applied (This is discussed in Section 4). In Figure 13, the algorithm to generate test cases is shown. For the execution of the algorithm, we use three parameters: vertex is a vertex in the model, indicating the current vertex during the DFS; path is a sequence of edges from the model, indicating the path visited during the DFS; and interruptionModel is a graph G[V , E ], used to separate the edges from the feature model and the feature-interruption model already connected. We begin the extraction from the root (the initial node of the ALTS model), verifying if the current vertex indicates the end of a path in the model, indicating that the test case has been extracted. In this case, it needs to be recorded. (The LTS-BT tool [70] records the extracted test cases, in tables inside a file.) If the current vertex does not indicate the end of a path, then each of its descendants are visited using the DFS algorithm. To visit each of its descendant, the edge between the current vertex and its descendant is analysed. The search proceeds only if: (i) the edge does not belong to the current analysed path (the edge has been already “visited" ), or (ii) it is an edge from the feature interruption model (an edge with the endInterruption label). Due to these conditions, two scenarios are encountered:

Automatic Test-Case Generation

Fig. 12. ALTS with Interruption

77

78

P. Machado and A. Sampaio

Fig. 13. Test Case Generation Algorithm

Fig. 14. Feature test case

– Conditions (i) and (ii) are not satisfied: the search stops, recording the entire path as a test case. In this case, the recursion step of the algorithm returns to the next branch that needs to be analysed, continuing the DFS algorithm. – Condition (i) or (ii) is satisfied: the edge between the vertex and its descendent is added to the test case and the DFS continues until it finds the end of the path, which happens when either a leaf in the graph or an edge going back to the root of the graph is found. These constraints over the extraction, when using the DFS approach, are required to avoid a burst of paths during the test-case extraction caused by the loops in the test model. This may reduce the number of extracted test cases, but without these constraints, the number of paths extracted becomes infeasible, while most of them may be obtained by combining the extracted test cases. Also, practice has shown that these excluded paths generally add redundancy to the test suite, that is, they do not add test cases that would lead to uncover additional faults. To demonstrate the application of the algorithm presented above, we extracted test cases from the test model of Figure 12. Figure 14 presents a feature test case and

Automatic Test-Case Generation

79

Fig. 15. Feature interruption test case

Figure 15 presents a feature-interruption test case. To generate feature test cases, the interruption transitions of the model are not reached, but to generate feature-interruption test cases the search goes through them. 3.4 Process Algebra While LTSs and Finite State Machines are the main models used to automate test generation, they are very concrete models and often adopted as the operational semantics of more abstract process algebras like CSP [172,286], CCS [251] and LOTOS [240]. In contexts where a process algebra is adopted as the specification formalism, automated test-case generation usually involves translating a specification into an operational model to generate the test cases, which are themselves expressed in terms of the concrete model, as illustrated in the previous sections. Here we summarise an approach for the automated guided generation of sound test cases [262], using the CSP process algebra. More generally, this approach characterises a testing theory in terms of CSP: test models, test purposes, test cases, test execution, test verdicts, soundness and an implementation relation are defined in terms of CSP processes and refinement notions. Test-case generation, particularly, is achieved as counter-examples of refinement checking, and is mechanised using the CSP model checker, FDR [285]. The approach is currently based on the trace semantics of CSP. The Model. CSP (Communicating Sequential Processes) is a process algebra that provides a rich set of operators to describe the behavior of concurrent and distributed systems. We introduce some of the operators of CSP through the following example.

80

P. Machado and A. Sampaio

S0 = t → S9 2 t → S2 S 2 = t → S 0 2 (c → S 6 2 b → S 4) S4 = z → S2 2 t → S4 2 t → S8 S6 = y → S7 S7 = c → S6 S8 = y → S0 S9 = a → S8 SYSTEM = S 0 \ {t } The process SYSTEM captures part of the behavior of the IOLTS presented in Section 3.2 (see Figure 5). The alphabet of a CSP process is the set of events it can communicate. In this chapter, we assume that each process alphabet is split into two disjoint sets (input and output), since the approach to generate test cases from CSP specifications is based on that of Tretmans [320] for IOLTS. Thus, given a CSP process P, its alphabet is αP = αPi ∪ αPo . For instance, αSYSTEMi = {a, b, c} and αSYSTEMo = {y, z }. The fragment c → S 6 2 b → S 4 of the process S 2 uses the external choice operator (2) to indicate that it can communicate c and behave like S 6 (S 2/c = y → S 7), or communicate b and behave as S 4 (S 2/b = z → S 2 2 t → S 4 2 t → S 8). The notation P/s indicates the behavior of the process P after performing the trace s. The process S 2/b, t is non-deterministic, behaving recursively as S 4, or as S 8; the decision is made internally since S 4 offers t in two branches of the choice. The set initials(P) contains the initial events offered by the process P. Thus, initials(S 2/b) = {z , t }. Moreover, the special event t is used in this specification exclusively to create non-deterministic behavior using the hiding operator (\): the process S 0 \ {t } behaves like S 0, but all occurrences of t become internal (invisible) events. Some additional CSP constructs are used in the rest of this section. The process Stop specifies a broken process (deadlock), and Skip a process that communicates an event and terminates successfully. The process P; Q behaves like P until it terminates successfully, when the control passes to Q . The process P |[ Σ ]| Q stands for the generalised parallel composition of the processes P and Q with synchronisation set Σ. This expression states that the processes P and Q have to agree (synchronise) on events that belong to Σ. Each process can evolve independently for events that are not in Σ. The parallel composition P Q represents the interleaving between the processes P and Q . In this case, both processes communicate any event freely. A process used later is RUN (s) = 2 e : s • e → RUN (s). The construction

2e : s

• e → P represents the indexed external choice of all events (e) that belong to the set s, which, after communicating e, behaves as P. Therefore RUN recursively offers all the events in s. Trace semantics is the simplest model for a CSP process, and is adopted in this approach to characterise a testing theory. The traces of a process P, represented by traces(P), correspond to the set of all possible sequences of visible events (traces) that P can produce. For instance, traces(S 6) is the infinite set that contains the traces ,

Automatic Test-Case Generation

81

y, y, c∗ and y, c ∗ y, where w ∗ means zero or more occurrences of w , and s1 s2 indicates the concatenation of sequences s1 and s2. It is possible to compare the trace semantics of two processes by a refinement verification. This can be automatically performed with FDR. A process Q trace-refines a process P, which we write as P τ Q , if, and only if, traces(P) ⊇ traces(Q ). For instance, the process S 9 refines S 0 (S 0 τ S 9) since the traces of S 9 are contained in those of S 0. However, the relation S 0 τ S 8 does hold, since y ∈ traces(S 8) but y ∈ traces(S 0). Test-generation algorithms. This approach introduces cspioco (CSP Input-Output Conformance), the implementation relation that is the basis for the test-case generation strategy. The test hypothesis for this relation assumes that the SUT behaves like some CSP process, say SUT . In this section, we consider SUT to be a CSP process representing the implementation, and S a specification, such that αSi ⊆ αSUTi and αSo ⊆ αSUTo . Informally, if SUT cspioco S , the set of output events of the implementation, after performing any trace s of the specification, is a subset of the outputs performed by S after s. Formally, SUT cspioco S ⇔ (∀ s : traces(S ) • out (SUT , s) ⊆ out (S , s)) where out (P, s) = initials(P/s) ∩ αPo , if s ∈ traces(P), and out (P, s) = ∅, otherwise. The following result establishes that cspioco can be alternatively characterised in terms of process refinement: the relation SUT cspioco S holds if, and only if, the following refinement holds. S τ (S RUN (αSUTo )) |[ αSUT ]| SUT The intuition for this refinement expression is as follows. If we consider an input event that occurs in SUT , but not in S , then on the right-hand side of the refinement, the parallel composition cannot progress through this event, so it is refused. Because refused events are ignored in the traces model, new SUT inputs are allowed by the above refinement. As a consequence, partial specifications are allowed. The objective of the interleaving with the process RUN (αSUTo ) is to avoid that the right-hand side process refuses output events that the implementation can perform but S cannot. Thus, RUN (αSUTo ) allows that such outputs be communicated to SUT . Finally, if SUT can perform such output events, then they appear in the traces of the right-hand side process, what falsifies the traces refinement. In summary, the expression on the right-hand side captures new inputs performed by SUT generating deadlock from the trace where the input has occurred, in such a way that any event that comes after is allowed. Furthermore, it keeps in the traces all the output events of SUT for the inputs from S , therefore allowing a comparison in the traces models. The rest of this section discusses how to obtain a set of test scenarios from the specification process, and how test cases can be mechanically generated via refinement checking using FDR, where test scenarios correspond to counterexamples of refinement verifications. The first step of the test-selection approach is to mark certain traces of the specification according to a test purpose, also specified as a CSP process. This can be directly achieved using parallel composition. Assuming that there is a test scenario that can be

82

P. Machado and A. Sampaio

selected by TP from S , the parallel composition of S with a test purpose TP (parallel S product), with synchronisation set αS , is PPTP = S |[ αS ]| TP. The process TP synchronises on all events offered by S until the test purpose matches a test scenario, when TP communicates a special event mark ∈ MARKS . At this point, the process TP deadlocks, and consequently PP deadlocks as well. This makes the parallel product to produce traces ts = t mark , such that t ∈ traces(S ), where ts is the test scenario to be selected. Because ts ∈ traces(S ), we have that S traces(S ) traces(PPTP ). Thus, using FDR, the shortest counterexample for the S refinement S τ PPTP , say ts1 , can be produced. If S does not contain scenarios specified by TP, no mark event is communicated, the parallel product does not deadS lock and the relation S τ PPTP is true. To obtain subsequent test scenarios, we use the function P that receives as input a sequence of events and generates a process whose maximum trace corresponds to the input sequence. For instance, P(a, b, c) yields the process a → b → c → Stop. The reason for using Stop, rather than Skip, is that Stop does not generate any visible event in the traces model, while Skip generates the event . The second counterexample is selected from S using the previous refinement, augmented by the process formed by the counterexample ts1 (that is, P(ts1 )) as an alternative to S on the left-hand side. The second test scenario can then be generated as the S . As traces(S 2 P(ts1 )) is counter-example to the refinement S 2 P(ts1 ) τ PPTP equivalent to traces(S ) ∪ ts1 , then ts1 cannot be a counterexample of the second iteration refinement. Thus, if the refinement does not hold again, then we can get a different trace ts2 as a counterexample. The iterations can be repeated until the desired test suite (regarding the test purpose) is obtained. For example, this limit can be based on a fixed number of tests or on some coverage criteria. If the refinement checking of some iteration holds, there are no more test scenarios to select. In general, the (n + 1)th test scenario can be generated as a counterexample of the following refinement. S S 2 P(ts1 ) 2 P(ts2 )... 2 P(tsn ) τ PPTP

In the context of our example, consider the following test purpose: TP = ANY ({y}, UNTIL({z }, ACCEPT (1))) 2 ANY ({z }, REFUSE (1)) 2 NOT (αSYSTEM , {y, z }, TP1) The process TP is defined in terms of some primitive processes (namely, ANY , NOT , ACCEPT , REFUSE and UNTIL) that facilitate the definition of test purposes [262]. Initially, TP1 offers three choices. Each one can be chosen depending on the events offered by the specification. If y is offered it is selected, and TP1 behaves as the process UNTIL({z }, ACCEPT (1)) that selects acceptance scenarios where eventually z is offered. If z is offered, but y has not yet been offered, then TP1 behaves as REFUSE (1) that selects refusal scenarios. Finally, if the offered events differ from y and z , TP1 restarts the selection and behaves as TP1.

Automatic Test-Case Generation

83

The parallel product between the specification SYSTEM and the test purpose TP1 SPEC SPEC is the process PPTP = SYSTEM |[ αSYSTEM ]| TP1. The set of traces of PPTP 1 1 is similar to the set of traces of SYSTEM , except for the traces that end with the mark events accept .1 and refuse.1. For instance, the trace a, y, b, z , accept .1 belongs to SPEC traces(PPTP 1 ) but not to traces(SYSTEM ). Thus, in the first iteration, the refinement verification yields ts1 = b, z , refuse.1 as the first counterexample. Using ts1 , we have the second selection refinement iteration that does not hold and yields ts2 = a, y, b, z , accept .1 as the next counterexample. There are infinite test scenarios that can be obtained from S using TP1, since the specification is recursive and has infinite acceptance and refusal traces. Besides the uniformity of expressing the entire theory in terms of a process algebra, there are other potential benefits of this approach. – As opposed to LTSs, CSP models that represent the specification can naturally evolve in an abstract formalism. This is particularly relevant in a context where complex models are built from simpler ones that model isolated (or individual) features. – Implementation conformance to a specification and test-case generation can be directly expressed and mechanised in terms of process refinement. – Test purposes are specified in terms of CSP processes that allow them to be composed and easily combined and extended. – The generated test cases (also represented as CSP processes) can be combined to form more coarse-grain test cases, or split to give rise to finer-grain test cases. Some previous approaches have addressed test generation in the context of CSP [274,294,73]. All these approaches focus on the formalisation of conformance relations. The approach summarised here goes beyond that, addressing guided test selection and generation using refinement checking, the implementation of a tool that supports the approach (the ATG tool), and practical use of the approach through cooperation with Motorola Inc. A related research topic is to explore more elaborate CSP semantic models, such as failures-divergences. Such models enable the definition of more robust implementation relations that, in addition to traces, consider nondeterminism and quiescence (for example, deadlock or livelock) as observations of testing. Another research direction is to explore test-generation approaches that capture component interaction. This might potentially emphasise the nature of an approach entirely based on a process algebra, which offers a rich repertoire of operators to express several patterns of concurrency and communication, contrasting with strategies based on more operational models (like LTSs or finite state machines) that do not record the application architecture.

4 Test-Case Selection Strategies In this section, we present test-case selection strategies that can be applied to guide an automatic test-case generation process. Even though test-case generation is always based on a coverage criteria, mostly structural criteria, the possible number of test cases to be selected is usually huge. Also, as mentioned before, not all test cases are relevant

84

P. Machado and A. Sampaio

and also test-case suites automatically generated tend to include redundant test cases. The focus is on test-purpose selection, random selection and similarity-based selection. 4.1 Test Selection Based on Test Purpose Test purposes describe behaviors that we wish to observe by testing a SUT. This is a widely used strategy, particularly evolved with the TGV tool [180] and its test-case synthesis algorithms as mentioned in Section 3.2. Theoretical foundations of test purposes are presented in [332]. These are related to the formal testing framework proposed in [320]. In this section, we give a quick overview of fundamental concepts on test purposes and properties of test cases generated from them and also present a tool that supports test-purpose definition and application. Formal test purposes [332]. Testing for conformance and testing from test purposes, also called property-oriented testing, have different goals [225]. The former aims to accept or reject a given implementation. On the other hand, the latter aims to observe a desired behavior that is not necessarily directly related to a required behavior or correctness. If a behavior is observed, then confidence on correctness may increase. Otherwise, no definite conclusion can be reached. This strategy is usually implemented to reduce the scope of the model from which test cases are generated. Test purposes are related to implementations that are able to exhibit them by a well chosen set of experiments. This is defined by the relation exhibits ⊆ IMPS × TOBS , where IMPS is the universe of implementations, and TOBS is the universe of test purposes. To reason about exhibition, we also need to consider a test hypothesis by defining the reveal relation rev ⊆ MODS × TOBS , where MODS is the universe of models, so that: ∀ e ∈ TOBS · SUT exhibits e ⇔ iSUT rev e with iSUT ∈ MODS of SUT. A verdict function He decides whether a test purpose is exhibited by an implementation: He : P(OBS ) → {hit, miss}. Then, SUT hits e by te =def He (exec(te , SUT)) = hit This is extended to a test suite Te as SUT hits e by Te =def He ( {exec(t , SUT) | t ∈ Te }) = hit where exec is a procedure that represents test-cases execution. A test suite Te that is e-complete can distinguish among all exhibiting and nonexhibiting implementations, so that, SUT exhibits e if, and only if, SUT hits e by Te . A test suite is e-exhaustive when it can detect only non-exhibiting implementations (that is, SUT exhibits e implies SUT hits e by Te ), whereas a test suite is e-sound when it can detect only exhibiting implementations (SUT exhibits e if SUT hits e by Te ). We note that there is a similarity in purpose between sound test suites and e-sound test suites, even though the implications are relatively inverted. The former can reveal the presence of faults, whereas the latter can reveal intended behavior.

Automatic Test-Case Generation

85

Conformance and exhibition can be related. The goal is to consider test purposes in test selection to obtain test suites that are sound and e-complete. On one hand, esoundness guarantees that a hit result always implies exhibition. On the other hand, eexhaustiveness guarantees that implementations that exhibit are not rejected. Soundness provides us with the ability to detect non-conforming implementations. Contrary to complete test suites, e-complete test suites are more feasible. For instance, an algorithm is present in [332] for LTSs. Finally, we observe that a test purpose may be revealed by both conforming and non-conforming implementations. An ideal situation, though not practical, would be to consider a test purpose e only when i rev e ⊇ i passes T , where T is a test suite, and passes relates implementations to sets of test cases in which they pass. However, test purposes are chosen so that: {i | i rev e}∩{i | i imp s} = ∅. In this case, a test execution with test case Ts,e that is both sound and e-complete and that results in fail means non-conformity, since sound test cases do not reject conforming implementations and e-complete test cases distinguish between all exhibiting and non-exhibiting implementations. Also, if the result is {pass, hit}, confidence on correctness is increased, as the hit provides possible evidence of conformance. Test-purpose selection. The LTS-BT tool [70] provides a graphical interface from which test purposes can be selected to guide the generation of e-complete test suites. Test purposes are extracted from the ALTS model under consideration. The tester does not need to write the test purposes by hand or deal with these models. For each ALTS model, the LTS-BT tool loads the set of possible labelled transitions, that is, conditions, steps and expected results (see Section 3.3). The user chooses the test purpose in order to limit the ALTS model, having a filtered ALTS model as the result. The test-purpose notation is a sequence of transitions. In this sequence, the “*" (asterisk) can appear at the beginning or between transitions. The sequence must finish with either an Accept (meaning that the user wants all test cases that comply with the test purpose) or a Reject node. The asterisk means any transition. Some examples of test purposes for the ALTS test model showed in Figure 11 are shown below. 1. *, Go to inbox, *, Accept. This means all test cases that contain “Go to inbox". In this case, the model is not filtered. 2. *, Message Storage is full,*, Accept. This means all test cases that finish with “Message Storage is full". 3. *, Message Storage is full,*, Reject. This means all test cases that do not contain “Message Storage is full". The use of test purposes is particularly interesting for interruption testing, since the possible number of combinations of interruptions can be huge and make full coverage infeasible. In practice, this kind of testing is only considered for particular interruptions at specific "critical" points of a feature. Therefore, test purpose constitutes a valuable tool for model-based interruption test designers. For interruption test cases, the following test purposes can be defined for the model in Figure 12.

86

P. Machado and A. Sampaio

1. *, "Hot Message" folder is displayed, Send a message from phone 2 to phone under test, *, Accept This test purpose considers all possible behaviors of the interruption after the “"Hot Message" folder is displayed” result. In this case, the model is not filtered. 2. *, "Hot Message" folder is displayed, Send a message from phone 2 to phone under test, *, Reject This test purpose does not consider test cases with the interruption after the “"Hot Message" folder is displayed” result. As a result, we obtain the model in Figure 11. 3. *, "Hot Message" folder is displayed, Send a message from phone 2 to phone under test, *, Message is Displayed, *, Accept This test purpose considers all possible behaviors of the interruption after the “"Hot Message" folder is displayed”. In other words, this test purpose focus on the interruption at a specific point and also in the particular behavior of the interruption in which the message is displayed. 4.2 Random versus Deterministic Test Selection From usage models such as the one presented in Section 3.1, specific test cases can be selected among the possible ones that can be generated. This can be done either based on random walks on the model or by deterministic choice. As mentioned before, usage models can give rise to a huge number of test cases if only structural coverage criteria is considered. However, use profiles can be incorporated into these models by probabilities that are associated with transitions, characterising the most and less frequently traversed paths. Random selection based on this kind of probability distribution can produce test cases that have high probability of causing failures, contributing to operational reliability. On the other hand, crafted, non-random deterministic test cases are also of importance. They are usually scenarios of execution that are of interest to stakeholders whose value may not be captured by the use profile defined. They can often be used to validate the use model and the use profile (probabilities distribution) being considered. However, it is important to remark that deterministic selection alone is not effective to assure coverage: test cases are usually prone to a particular, usually partial, view of the application. In fact, a combination of both random and deterministic strategies is recommended in practice. In the next subsection, an alternative strategy to random selection is presented. It addresses the problem of guaranteeing the best coverage of a model by selecting the less similar test cases, with random choice applied to solve the problem of discarding similar (according to a certain degree) test cases. 4.3 Test Selection Based on Similarity Function The main goal of this selection strategy is to reduce the number of test cases and minimise redundancy [72]. This is inspired by the work of Jin-Cherng Lin and his colleagues [213].

Automatic Test-Case Generation

87

The strategy is based on a target percentage of the total of test cases that can be generated. A similarity criterion is used to achieve this percentage by eliminating similar test cases. Based on the similarity degree between each pair of test cases, the test cases candidate to be eliminated are those that have the biggest similarity degree: one of them is eliminated. The choice is made by observing the test case that has the smallest path, that is, the smallest number of functionalities (to maximise the chances of keeping a good coverage of the model). From an ALTS model, we mount a similarity paths matrix. This matrix is as follows. – n × n, where n is the number of paths and each n represents one path. – The elements are defined as: aij = SimilarityFunction(i, j ). This function, as presented below, is calculated by observing the number of identical transitions nit , that is, the ones where states from and to and also transition labels are the same. The function also considers the average of paths length avg(| i |, | j |) so that the test-cases lengths are balanced with respect to similarity, that is, two small test cases are not erroneously considered to be less similar than two large test cases that have more transitions in common but are not so similar. SimilarityFunction(i, j ) = nit /(avg(| i |, | j |)) Figure 16 presents an example of this selection strategy applied to an LTS model. In the example, there are four paths. Figure 17 shows the similarity matrix and the length of the paths. Considering that the percentage-of-coverage criterion is 50%, then two test cases must be eliminated. Observing the matrix in Figure 17(x), we note that (b) and (c) have

Fig. 16. Possible paths from an LTS model

88

P. Machado and A. Sampaio

Fig. 17. (x) Similarity Matrix, (y) Paths length, (z) Similarity Matrix after elimination

the biggest similarity. As they have the same length, (the paths length can be seen in Figure 17(y)), then the one to be eliminated is chosen randomly, for example (b). Figure 17(z), present the matrix after elimination of (b). From Figure 17(z), we have that (c) and (d) have now the biggest similarity. As | (c) |>| (d ) |, then (d) is eliminated. We note that the eliminated test cases are the most similar: (b) is very similar to (c) and (c) is very similar to (d). Of course, for this simple example, it is clearly possible to execute all test cases, but in an actual setting (with costs and time restrictions), 100% coverage may be infeasible. In order to assess this strategy, experiments have been conducted. The results presented in the sequel are from an experiment presented in [72] with three different applications. For this experiment, similarity was compared with random choice in other to assess the percentage of transition coverage in the two strategies. A path coverage of 50% was fixed and both strategies were applied 100 times (due to the random choices). The applications chosen are all reactive systems with the following profile. 1. A cohesive feature for adding contacts in a mobile phone’s contact list. 2. A message application that deals with embedded items. An embedded item can be a URL, a phone number, or an e-mail address. For each embedded item, it is possible to execute some tasks. 3. The TARGET tool, an application that generates test cases automatically from usecase scenarios. The results obtained for Application 1 are presented in Figure 18. We note that the "y-axis" represents the number of transitions removed, whereas the "x-axis" represents each of the 100 times both algorithms were executed on the test suite generated by considering the 50% path-coverage goal. For this application, similar test cases of similar length have been found. In this case, the similarity technique also involved random choices. Generally, for Application 1, similarity performed better with only a few experiments where random choice was equal or more effective. The results obtained for Application 2 are presented in Figure 19. This application has "disjoint" groups of similar test cases since it is composed of different features. Therefore, the performance of the similarity technique was better than that of random choice: the less similar test cases have been kept for each group of similar test cases.

Automatic Test-Case Generation

89

Fig. 18. Application 1: Similarity versus Random choice with 50% path coverage

Fig. 19. Application 2: Similarity versus Random choice with 50% path coverage

Finally, for Application 3 (see Figure 20), the performance of the similarity technique was much better. In this case, the application is composed of different features with test cases of different lengths: similarity was purely applied without the need for random choice of test cases of the same size. Further experiments have been conducted to assess more precisely similarity-based selection and also to find out its most indicated contexts of application. For instance, Cartaxo el al [71] presents a more elaborate experimentation that considers both transition and fault-based coverage criteria and path coverage varies from 5% to 95%. The results shows more clearly the benefits of the similarity strategy over random selection with respect to the ability of the selected test suite to detect more faults and transitions.

90

P. Machado and A. Sampaio

Fig. 20. Application 3: Similarity versus Random choice with 50% path coverage

5 Test-Model Generation from Use-Case Specifications Business analysts tend to adopt informal representations, based on natural-language descriptions, to record the result of requirements elicitation. While natural languages seem an appropriate notation to describe documents at the requirements level, they are also the potential cause of ambiguities and inconsistencies, as is well-known from software engineering practice. Before progressing towards analysis and design activities, more precise models of the application need to be developed. Ideally, a formal model of the application should be developed to serve as a consistent reference for requirements validation and verification of the correctness of candidate implementations, or even to generate correct implementations in a constructive way. Complementarily, formal models can be useful for the automatic generation of test cases, as discussed in the previous sections. In order to benefit from both a natural language description (to capture requirements) and a formal model (to serve as basis for the development process, and particularly for test-case generation), templates are proposed in [65] to write use cases. These templates can capture functionality at several abstraction levels, ranging from the interaction between the user and the application (in a black-box style) to the detailed interaction and synchronisation of system components. The text used to fill the templates obeys a fixed grammar; we call it a Controlled Natural Language (CNL). The CNL can be considered a domain specific language (for mobile applications), which allows to fix some relevant verbs and complements; this is the key to allow automatic processing of the templates and of the text inside the templates. Besides verifying whether the text is according to the defined CNL grammar, this also permits a formal-model generation from use-case templates. In the following sections we introduce the use case templates in some detail, and present a strategy to generate ALTS test models from use cases; finally, we consider some related work.

Automatic Test-Case Generation

91

5.1 Use-Case Templates As we are concerned with the generation of functional test cases, here we concentrate on user-view templates, which capture user interaction with the system through sentences used to describe user actions, system states and system responses. An example is presented in Figure 21, which describes a use case of a simple book application. In the domain of mobile applications, use cases are usually grouped to form a feature. Each feature contains an identification number. For instance, the use case in Figure 21 can be considered as part of Feature 11111 - My Phonebook. Although this grouping is convenient, it is not imposed by the strategy. As we can observe in Figure 21, each use case has a number, a name and a brief description. A use case specifies different scenarios, depending on user inputs and actions. Hence, each execution flow represents a possible path that the user can perform. Execution flows are categorised as main, alternative or exception flows. The main execution flow represents the use case happy path, which is a sequence of steps where everything works as expected. In our example, the main flow captures the successful insertion of a new contact in the phone book. An alternative execution flow represents a choice situation; during the execution of a flow (typically the main flow) it may be possible to execute other actions. If an action from an alternative flow is executed, the system continues its execution behaving according to the new path specification. Alternative flows can also begin from a step of another alternative flow; this enables reuse of specification. In our example, the alternative flow allows the insertion of more detailed information related to the contact, in addition to a name and a phone number. The exception flows specify error scenarios caused by invalid input data or unexpected system states. Alternative and exception flows are strictly related to the user choices and to the system state conditions. The latter may cause the system to respond differently given the same user action. In our example, the exception flow describes the failure to include a contact due to memory overflow. Each flow is described in terms of steps. The tuple (user action, system state, system response) is called a step. Every step is identified through an identifier, an Id. The user action describes an operation accomplished by the user; depending on the system nature it may be as simple as pressing some button or a more complex operation, such as printing a report. The system state is a condition on the actual system configuration just before the user action is executed. Therefore, it can be a condition on the current application configuration (setup) or memory status. The system response is a description of the operation result after the user action occurs based on the current system state. As an example, we consider the step identified as 4M in the main flow in Figure 21. The user action concerns the confirmation of the contact creation. However, this depends on the availability of memory. If the condition holds, then the expected result is that the contact is effectively inserted in the phone book. A user action, a system state or a system response may be related to named system requirements. This is useful for traceability purposes, both relating use cases with requirements and the generated test cases with requirements. When requirements change, it is possible to know which use cases might be impacted and, if it is the case, update

92

P. Machado and A. Sampaio

UC 01 - Creating a New Contact Description: This use case describes the creation of a new contact in the contact list. Main Flow Description: Create a new contact From Step: START To Step: END Step Id User Action System State System Response 1M Start My Phonebook applicaMy Phonebook application tion. menu is displayed. 2M Select the New Contact opThe New Contact form is tion. displayed. 3M Type the contact name and The new contact form is the phone number. filled. 4M Confirm the contact creation. There is enough phone The next message is high[TRS_11111_101] memory to insert a new lighted. contact. Alternative Flows Description: extended information to the contact From Step: 3M To Step: 4M Step Id User Action System State 1A Go to context menu and select Extended Information. 2A 3A

Fill some of the extended information fields. Press the OK softkey.

Exception Flows Description: There is not enough memory From Step: 3M, 3A To Step: END Step Id User Action System State 1E Confirm the contact creation. There is not phone memory.

2E

Select the OK softkey.

System Response The extended information form is displayed. [TRS_111166_102] Some of the extended information form is filled. The phone goes back to New Contact form. It is filled with the extended information.

System Response enough A dialog is displayed informing that there is not enough memory. [TRS_111166_103] The phone goes back to My Phonebook application menu.

Fig. 21. Example of a user view use case

Automatic Test-Case Generation

93

them. Test cases related to these use cases can also be updated or regenerated (assuming an automatic approach). Furthermore, requirement information can be used to filter the test-case generation. There are situations when a user can choose between different paths. When this happens it is necessary to define one flow for each path. Every execution flow has one or more starting points, or initial states, and one final state. The starting point is represented by the From step field and the final state by the To step field. The From step field can assume more than one value, meaning that the flow is triggered from different source steps. When one of the From step items is executed, the first step of the specified execution flow is executed. As an example, we consider the exception flow in Figure 21, which may start after the step 3M of the main flow or after the step 3A of the alternative flow. The To step field references only one step; after the last step of an execution flow is performed the control passes to the step defined in the To step field. In the main flow, whenever the From step field is defined as START it means that this use case does not depend on any other, so it can be the starting point of the system usage, as illustrated in our example. Alternatively, the main flow From step field may refer to other use case steps, meaning that it can be executed after a sequence of events has occurred in the corresponding use case. When the To step field of any execution flow is set to END, this flow terminates successfully after its last step is executed, as illustrated by the main flow of our example. Subsequently, the user can execute another use case that has the From step field set to START. The From step and the To step fields are essential to define the application navigation, which allows the generation of a formal model like an LTS [69], as discussed in previous sections. These two fields also enable the reuse of existing flows when new use cases are defined; a new scenario may start from a preexistent step from some flow. Finally, loops can appear in the specification if direct or indirect circular references between flows is defined; this scenario can result in a livelock situation in the case of infinite loops. The user-view use case in Figure 21 is an example of a mobile-phone functionality. Nevertheless, this template is generic enough to permit the specification of any application, not only mobile-phone ones. The user-view use case has the main characteristics of other use-case definitions, such as UML use cases [290]. However, our template seems to offer more flexibility. The existence of execution flows starting and ending according to other execution flows makes it possible to associate use cases in a more general way than through regular UML associations such as extend, generalisation, and include. References enable the reuse of parts of other use-cases execution flows and the possibility of defining loops, so use cases can collaborate to express more complex functionalities. 5.2 Generating ALTS Models In this section, we present a strategy for translating use-case templates into an ALTS from which test cases can be generated. The general translation procedure is shown in Figure 22, and is explained below.

94

P. Machado and A. Sampaio

Fig. 22. Procedure that translates use case templates to an ALTS

– Each template of the use case, starting from the main flow one, is processed sequentially and, from each step, states and transitions are created in the target ALTS according to the order of steps defined in the template. This is controlled by the two for loops. – currentState represents the state from which transitions are to be created for the current step. This is either: i) the last state created in case the From Step field is defined as START or this is the first state; or ii) the last state of a given step (defined in the From Step field) of another template. – From step and To Step guide the connection of each trace created by each of the templates. For this, the Step Id label is associated with the states that are created from it (see State constructor). – User Action, System State and System Response become transitions that are preceded by steps, conditions and expectedresults annotations respectively. – States are created as new transitions need to be added. These are incrementally numbered from 0. States and transitions are created by the add operation, but states already created can be reused when connecting the traces of new templates. When possible, addToStep (To Step is different from END), addFromStep (From Step is different from START) and addNewCondition (a new condition is considered based on a user action already added) are used instead. – Duplicated annotated transitions from the same state are also avoided. This can happen when two or more steps are possible from the same state, or two or more conditions define expected results from a single user action.

Automatic Test-Case Generation

95

The ALTS generated from the templates presented in Figure 21 is shown in Figure 23. We observe that, due to the lack of space, some transition labels are not completely presented. The templates are connected at state 13, which corresponds to the next step after the 3M step in the main flow, and at state 15, which corresponds to the state leading from the conditions annotation that combines the main flow and the exception flow. For the sake of simplicity, the procedure presented in Figure 22 only considers the existence of one From Step and one To Step. Therefore, From Step 3A in the exception flow is not directly handled. However, this behavior happens to be included in the ALTS in Figure 23.

Fig. 23. ATLS model generated from the Creating a New Contact Use Case

5.3 Final Considerations Rather than building specifications in an ad hoc way, some approaches in the literature also have explored the derivation of formal specifications from requirements. ECOLE [295] is a look-ahead editor for a controlled language called PENG (Processable English), which defines a mapping between English and First-Order Logic in order to verify requirements consistency. A similar initiative is the ACE (Attempto Controlled English) project [128] also involved with natural language processing for specification validation through logic analysis. The work reported in [174] establishes a mapping between English specifications and finite state machine models. In industry, companies,

96

P. Machado and A. Sampaio

such as Boeing [350], use a controlled natural language to write manuals and system specifications, improving document quality. There are also approaches that use natural language to specify system requirements and automatically generate formal specifications in an object-oriented notation [203]. Concerning the format to write requirements, use cases describe how entities (actors) cooperate by performing actions in order to achieve a particular goal. Some consensus is admitted regarding the use-case structure and writing method [84]; a use case is specified as a sequence of steps forming system-usage scenarios, and natural language is used to describe the actions taken in a step. This format makes use cases suitable to a wide audience. This section has proposed a strategy that automatically translates use cases, written in a Controlled Natural Language, into specifications in the ALTS. For obvious reasons, it is not possible to allow a full natural language as a source. Rather, a subset of English with a fixed grammar (CNL) has been adopted. The context of this work is a cooperation between CIn-UFPE and Motorola, known as Brazil Test Center, whose aim is to develop strategies to support the testing of mobile device applications. Therefore, the proposed CNL reflects this application domain. Unlike the cited approaches, which focus on translation at a single level, the work reported in [65] proposes use-case views possibly reflecting different levels of abstraction of the application specification; however, rather than adopting ALTS as the target test model, that work generates CSP test models. This is illustrated in [65] through a user and a component view. A refinement relation between these views is also explored; the use of CSP is particularly relevant in this context: its semantic models and refinement notions allow precisely capturing formal relations between user and component views. The approach is entirely supported by tools. A plug-in to Microsoft Word 2003 [307] has been implemented to allow checking adherence of the use-case specifications to the CNL grammar. Another tool has been developed to automate the translation of use cases written in CNL into CSP; FDR [285], a CSP refinement checker, is used to check refinement between user and component views.

6 The TARGET Tool The aim of this section is to present the TARGET tool, whose purpose is to mechanise a test-case generation strategy that supports the steps presented in the previous sections. Particularly, TARGET accepts as input use-case scenarios written in CNL (as addressed in Section 5) and generates test cases, also written in CNL, which includes the test procedure, a description and related requirements. Moreover, the tool can generate a traceability matrix relating test cases, use cases and requirements. The purpose of TARGET is to aid test engineers in the creation of test suites. Three major aspects distinguishes TARGET from other model-based testing tools: the use of test purposes, provided by the test engineer, to restrict the number of generated test cases as needed, focusing on test cases that are more critical or relevant to a given task; algorithms for elimination of similar test cases, reducing the test suite size without significant impact on effectiveness; the use of natural-language use cases as system input, which is more natural for engineers when compared to formal specification languages.

Automatic Test-Case Generation

97

Internally, TARGET translates CNL use-case scenarios into LTSs, in order to generate test cases, possibly guided by test purposes that are also translated to LTSs. Nevertheless, the use of LTS is completely hidden from test engineers. Some of the facilities provided by TARGET in order to support test case generation from use case scenarios are as follows. – processing of use-case templates and automatic generation of LTS test models; – automatic purpose-based generation of test suites, with adequate coverage criteria, from LTS models; – reduction of test suits by eliminating test cases according to a notion of similarity and user interaction; – automatic generation of traceability matrices: requirements versus test cases, requirements versus use cases, and test cases versus use cases; – friendly user-interface for editing test purposes and for guiding the user to generate test cases and reduce the size of test suits. Although the tool has been originally motivated by mobile-phone applications, it actually does not depend on any platform or domain. TARGET has already been used in practice for generating test cases for mobile phone as well as desktop applications. For instance, it has been used to generate test cases to test the implementation of the tool itself. 6.1 Using the Tool TARGET allows the user to create, open, close and refresh projects; open, rename and delete artifacts (use-case documents and test suites); import use-case documents; and, generate test suites, through the menu options presented in Figure 24, which shows a project with the phonebook feature and two use cases. The left panel contains the use-cases view in which the use cases are outlined. It is shown in a tree structure, grouping the use cases according to their feature. The bottom panel includes three views: – The artifacts view outlines the project artifacts. It groups, in two different folders, the use-case documents and the test-suite documents. These artifacts can be edited, removed or renamed. – The errors view lists all errors in the use-case documents. Each error is composed by the description, the resource where the error has occurred, the path of the resource, and the error location inside the resource. – The search view displays the results of a search operation. It lists all use cases that are found according to the given query. Each result is composed by the use-case name, the use-case identification, the identification of the feature that contains the use case (feature ID) and the name of the document that contains the use case (document name). The main panel is used to display the artifact contents. It shows use-case documents in HTML format, as the use-case scenario presented in the main panel in Figure 24.

98

P. Machado and A. Sampaio

Fig. 24. The main screen of TARGET

The major functionality of TARGET is the automatic generation of test cases. This facility provides the user with two generation features. Test suits can be generated for all possible scenarios or for some specific requirements. In any case, TARGET offers two features that allow the user to select test cases in the output test suite: defining test purposes or choosing the test-cases path coverage based on similarity, or both. As explored in the previous section, a test purpose is an abstract description of a subset of the specification, allowing the user to choose behaviors to test, and consequently reducing the specification exploration. In TARGET, test purposes are incomplete sequences of steps. Figure 25 illustrates the definition of test purposes. The notation is similar to a regular expression language. The test purpose in the figure selects all the test cases of the use case UC_02 (Searching a Contact) of the feature 11111 (My Phonebook) that include steps 1M and 2B. As another example of test purpose, we can use the sequence 11111#UC_01#1M;*;11111#UC_01#4M to select only test cases that begin with step 1M and end with step 4M of the use case Create a New Contact. As we can observe in Figure 25, a test purpose is composed of a sequence of commaseparated steps. Each step can be chosen from a list of use-case steps, in the Steps field. When a step is chosen, TARGET automatically appends it to the Current Test Purpose field. The user can define one or more test purposes through the window presented in Figure 25. If more than one test purpose is defined, TARGET combines the set of test cases generated for each purpose. As test purposes are defined, they are inserted in the Created Test Purposes field.

Automatic Test-Case Generation

99

Fig. 25. Test purpose and path coverage selection criteria

Another selection criteria is based on similarity. The similarity is set by the horizontal scale in Figure 25. Setting the scale to 100% means that all test cases will be included in the output test suite. Therefore, this scale preserves all the test cases selected by the previous selection functionalities (selection by requirements and test purposes). On the other hand, for instance, to eliminate the test suite by removing the 30% most similar test cases, the user must set the scale to 70%, indicating the intention to keep the 70%most distinct test cases. Details about the similarity algorithm have been presented in the previous section. The generated test suite files are created with a default name in the test suites folder. Each test case is presented in a template whose fields also obey the CNL rules. An example of a test case in the suite generated for the test purpose of Figure 25 is presented in Figure 26.

100

P. Machado and A. Sampaio

Fig. 26. Test case generated by TARGET

The generated test-case format has the following structure: – a Use Cases field, which lists all use cases related to the test case; – a Requirements field, which is filled with all requirements covered by the test case; – a Initial Conditions field, which is filled with all system states covered by the test case; – a Procedure field, which is a sequence of user actions from the use cases (see the use case structure); – a Expected Results field, which is the sequence of system responses for the corresponding actions (as in the use case templates). There are additional fields (Setup, Final Conditions and Cleanup) whose contents are not currently generated by TARGET. The first one has to do with the mobile-phone configuration for the execution of the test case, the second one defines a postcondition in the system state established by the test case, and the third one resets the configuration setup for the test-case execution. The test-suite generation also includes the automatic generation of the traceability matrices, which relates requirements, use cases and test cases. The traceability matrices are included in the output Microsoft Excel document as spreadsheets. There is one spreadsheet for each traceability matrix. TARGET provides the following traceability matrices:

Automatic Test-Case Generation

101

– the Requirements x Use Cases matrix links each requirement to its related use cases. – the Requirements x Test Cases matrix links each requirement to the test cases that cover it. – the Use Case x Test Cases matrix links each use case to the related test cases. By related test cases, we mean that at least one step from the use case is contained in the test cases. These matrices are illustrated by the Figures 27, 28 and 29.

Fig. 27. Requirement x Use Case matrix

Fig. 28. Requirement x Test Case matrix

6.2 Final Considerations TARGET has been developed in Java, based on the Eclipse Rich Client Platform. It depends on some open source software (Apache Lucene, Jakarta POI and AspectJ), and on the .NET Framework. The input use cases must be written in Microsoft Word 2003. A stable version of the tool has been developed. Four case studies have already been run with Motorola Brazil Test Center teams. These case studies are related to mobilephone features. As already mentioned, the tool has been used to generate test cases to

102

P. Machado and A. Sampaio

Fig. 29. Use Case x Test Case matrix

test the implementation of TARGET itself. All case studies indicated a reduction on testdevelopment effort. A detailed analysis of two of the case studies is presented in [261]. Interruption testing is going to be addressed in the next versions. Using TARGET, shorter test cycles are expected due to the automation of test design. However, this requires more detailed use cases, which takes longer to specify. Still, overall, experience has shown that there is a reduction of the test cycle time of around 50%. A more significant decrease on cycle time is expected for maintenance activities; maintaining the use cases and then generating updated test cases is potentially much more productive than maintaining test cases directly. Instead of having people to maintain and create a large number of test cases, the idea is to concentrate efforts on properly describing a much smaller number of use cases and then have the tool generate several test cases from each use case. The use of TARGET, like of other test-case generation tools, potentially reduces redundancy and ambiguity, both in use cases and generated test cases. The tool also helps to achieve better coverage since more tests can be generated in less time and with less effort, and test selection can be achieved using test purposes. The automatic generation of test cases avoids errors in test cases, and allows test-case document inspection focus more on more significant semantic issues. TARGET is restricted for use inside Motorola and its partners.

7 Concluding Remarks This chapter discusses automatic test-case generation in MBT processes. Test models and test-generation strategies, mostly devoted to reactive systems, are presented, including test models like Markov chains, LTSs and process algebra. Test-model generation is also considered by introducing a strategy for generating LTSs from use-case specifications. However, in this chapter, there is a special focus on MBT in the domain of mobilephone applications. Even though the use of TARGET is not limited to this domain, it has been primarily developed for supporting work in this context, as a result of a research cooperation of our team and Motorola Inc. The tool has been inspired by the challenges of a real testing setting, where MBT is to be integrated with other testing processes for increasing productivity and effectiveness on testing.

Automatic Test-Case Generation

103

The construction of TARGET along with experience on case studies in which the tool has been applied have confirmed and also uncovered many issues concerning advantages and challenges yet to be faced by MBT for widespread use. As main advantages, MBT indeed reduces the overall effort on test-case construction and provides an adequate coverage of requirements. However, it is important to remark that test-model generation rather than construction from scratch has been a differential: test designers can use a controlled natural language to document requirements and do not need to be concerned about writing models. Another advantage is that test suites can be regenerated by modifying the requirements documentation and by applying test-selection strategies that better suit a particular testing goal. As a main drawback, MBT may generate more test cases than what is actually needed. For instance, test cases may be redundant. This can be handled by automatic test-selection strategies such as the ones presented in this chapter that are implemented in TARGET. Also, MBT may not generate important test cases that would have been written by an experienced test designer in the domain. This often happens due to the so called “implicit” domain requirements that are not usually explicitly modelled as requirements for an application. Both problems have been addressed by our research team so that effective solutions can be incorporated in next versions of the TARGET tool and its processes. MBT is currently addressed by a number of research groups worldwide. This is a promising strategy since it can formalise the notion of abstraction that is so important in testing along with automation and reuse. Also, MBT can make an engineering approach to functional testing possible where systematic procedures can be undertaken for more reliability, effectiveness and productivity in testing. However, this is not a universal recipe. Finding out the right context of application and levels of testing as well as addressing the obstacles such as the ones mentioned above is determinant for success. Furthermore, even though a number of tools and theories have already been developed, the practice of MBT is still recent and immature. Lessons are still to be learned as well as theoretical problems still need to investigated. From satisfactory progress on them, effective practices will emerge. In this chapter, we have concentrated on test-case generation for operational models, particularly, from LTS. We have also been investigating test-case generation from the CSP process algebra, as briefly addressed in Section 3.4. While for isolated features the results seem very similar, for applications that involve feature interaction, a process algebraic test model might be a promising alternative to explore, since their rich repertoire of operators allow capturing elaborate patterns of component architectures, and selecting test cases confined to the interaction of a subset of the components.

Testing a Software Product Line John D. McGregor Clemson University, USA

The software product line approach to the development of software intensive systems has been used by organizations to improve quality, increase productivity and reduce cycle time. These gains require different approaches to a number of the practices in the development organization including testing. The planned variability that facilitates some of the benefits of the product line approach poses a challenge for test-related activities. This chapter provides a comprehensive view of testing at various points in the software development process and describes specific techniques for carrying out the various test-related tasks. These techniques are illustrated using a pedagogical product line developed by the Software Engineering Institute (SEI). We first introduce the main challenges approached by this chapter, then Section 2 overviews basic software product lines concepts. Section 3 explains, in a product line context, several testing concepts introduced in Chapter 1. Section 4 complements this by introducing Guided Inspections, a technique that applies the discipline of testing to the review of non-software assets typically found in software product lines. The core of the chapter is Section 5, which describes techniques that can be used to test product lines. Section 6 discusses how we can evaluate the benefits of a product line approach for testing, and Section 7 illustrates some of the presented techniques. Finally, we end with a discussion of related issues and research questions (see Section 8), and conclusions (see Section 9) about product line testing and its relation to development.

1 Introduction Organizations are making the strategic decision to adopt a product line approach to the production of software-intensive systems. This decision is often in response to initiatives within the organization to achieve competitive advantage within their markets. The product line strategy has proven successful at helping organizations achieve aggressive goals for increasing quality and productivity and reducing cycle time. The strategy is successful, at least in part, due to its comprehensive framework that touches all aspects of product development. Testing plays an important role in this strategic effort. In order to achieve overall goals of increased productivity and reduced cycle time, there need to be improvements to traditional testing activities. These improvements include the following: – – – –

Closer cooperation between development and test personnel Increased throughput in the testing process Reduced resource consumption Additional types of testing that address product line specific faults.

P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 104–140, 2010. c Springer-Verlag Berlin Heidelberg 2010

Testing a Software Product Line

105

There are several challenges for testing in an organization that realizes seventy-five to ninety-five percent of each product from reusable assets. Among these challenges, we have the following: – variability: The breadth of the variability that must be accommodated in the assets, including test assets, directly impacts the resources needed for adequate testing. – emergent behavior: As assets are selected from inventory and combined in ways not anticipated, the result can be an interaction that is a behavior not present in any one of the assets being combined. This makes it difficult to have a reusable test case that covers the interaction. – creation of reusable assets: Test cases and test data are obvious candidates to be reused and when used as is they are easy to manage. The amount of reuse achieved in a project can be greatly increased by decomposing the test assets into finer grained pieces that are combined in a variety of ways to produce many different assets. The price of this increased reuse is increased effort for planning the creation of the assets and management of the increased number of artifacts. – management of reusable assets: Reuse requires traceability among all of the pieces related to an asset to understand what an asset is, where it is stored and when it is appropriate for use. A configuration management system provides the traceability by explicit artifacts. We discuss several activities that contribute to the quality of the software products that comprise the product line, thinking of them as forming a chain of quality in which quality assuring activities are applied in concert with each production step in the software development process. In addition to discussing the changes in traditional testing processes needed to accommodate the product line approach, we present a modified inspection process that greatly increases the defect finding power of traditional inspection processes. This approach to inspection applies testing techniques to the conduct of an inspection. We use a continuing example throughout the chapter to illustrate the topics and then we summarize the example near the end of the chapter. The Arcade Game Maker product line is an example product line developed for pedagogical purposes (the complete example is available at http://www.sei.cmu.edu/productlines/ppl). A complete set of product line assets are available for this example product line. The product line consists of three games: Brickles, Pong and Bowling. The variation points in the product line include the operating system on which the games run, a choice of an analog, digital, or no scoreboard, and whether the product has a practice mode. The software product line strategy is a business strategy that uses a specific method to achieve its goals. The material in this chapter reflects this orientation by combining technical and managerial issues. We briefly introduce a comprehensive approach to software product line development and we provide a state-of-the-practice summary. Then we describe the current issues, detail some experiences and outline research questions regarding the test-related activities in a software product line organization.

2 Software Product Lines A software product line is a set of software-intensive systems sharing a common, managed set of features that satisfy the specific needs of a particular market segment or

106

J.D. McGregor

mission and that are developed from a common set of core assets in a prescribed manner [83]. This definition has a number of implications for test strategies. Consider these key phrases from the definition: – set of software-intensive systems: The product line is the set of products. The product line organization is the set of people, business processes and other resources used to build the product line. The commonalities among the products will translate into opportunities for reuse in the test artifacts. The variabilities among products will determine how much testing will be needed. – common, managed set of features: Test artifacts should be tied to significant reusable chunks of the products, such as features. These artifacts are managed as assets in parallel to the production assets to which they correspond. This will reduce the effort needed to trace assets for reuse and maintenance purposes. A test asset is used whenever the production asset to which it is associated is used. – specific needs of a particular market segment or mission: There is a specified domain of interest. The culture of this domain will influence the priorities of product qualities and ultimately the levels of test coverage. For example, a medical device that integrates hardware and software requires far more evidence of the absence of defects than the latest video game. Over time those who work in the medical device industry develop a different view of testing and other quality assurance activities from workers in the video game domain. – common set of core assets: The test core assets include test plans, test infrastructures, test cases and test data. These assets are developed to accommodate the range of variability in the product line. For example, a test suite constructed for an abstract class is a core asset that is used to quickly create tests for concrete classes derived from the abstract class. – in a prescribed manner: There is a production strategy and production method that define how products are built. The test strategy and test infrastructure must be compatible with the production strategy. A production strategy that calls for dynamic binding imposes similar constraints on the testing technology. The software product line approach to development affects how many development tasks are carried out. Adopting the product line strategy has implications for the software engineering activities as well as the technical and organizational management activities. The Software Engineering Institute has developed a Framework for Software Product Line PracticeSM 1 which defines 29 practice areas that affect the success of the product line. A list of these practices is included in the Appendix. The practices are grouped into Software Engineering, Technical Management and Organizational Management categories. These categories reflect the dual technical and business perspectives of a product line organization. A very brief description of the relationship between the Testing practice area and the other 28 practices is included in the list in the appendix. The software product line approach seeks to achieve strategic levels of reuse. Organizations have been able to achieve these strategic levels using the product line approach, while other approaches have failed, because of the comprehensive nature of the product line approach. For example, consider the problem of creating a “reusable” 1 SM

Service mark of Carnegie Mellon University.

Testing a Software Product Line

107

implementation of a component. The developer is left with no design context, just the general notion of the behavior the component should have and the manager is not certain how to pay for the 50% - 100% additional cost of making the component reusable instead of purpose built. In a software product line, the qualities and properties, required for a product to be a member, provide the context. The reusable component only has to work within that context. The manager knows that the product line organization that owns all of the products is ultimately responsible for funding the development. This chapter presents testing in the context of such a comprehensive approach. 2.1 Commonality and Variability The products in a product line are very similar but differ from each other (otherwise they would be the same product). The points at which the products differ are referred to as variation points. Each possible implementation of a variation point is referred to as a variant. The set of products possible by taking various combinations of the variants defines the scope of the product line. Determining the appropriate scope of the product line is critical to the success of the product line in general and the testing practice specifically. The scope constrains the possible variations that must be accounted for in testing. Too broad a scope will waste testing resources if some of the products are never actually produced. Too vague a scope will make test requirements either vague or impossible to write. Variation in a product is nothing new. Control structures allow several execution paths to be specified but only one to be taken at a time. By changing the original inputs, a different execution path is selected from the existing paths. The product line approach adds a new dimension. From one product to another, the paths that are possible change. In the first type of variation, the path taken during a specific execution changes from one execution to the next but the control flow graph does not change just because different inputs are chosen. In the second type of variation different control flow graphs are created when a different variant is chosen. This adds the need for an extra step in the testing process, sampling across the product space, in addition to sampling across the data space. This is manageable only because of the use of explicit, pre-determined variation points and automation. Simply taking an asset and modifying it in any way necessary to fit a product will quickly introduce the level of chaos that has caused most reuse efforts to fail. The commonality among the products in a product line represents a very large portion of the functionality of these products and the largest opportunity to reduce the resources required. In some product lines this commonality is delivered in the form of a platform which may provide virtually identical functionality to every product in the product line. Then a selected set of features are added to the platform to define each product. In other cases, each product is a unique assembly of assets, some of which are shared with other products. These different approaches will affect our choice of test strategy. The products that comprise a product line have much in common beyond the functional features provided to users. They typically attack similar problems, require similar technologies and are used by similar markets. The product line approach facilitates the

108

J.D. McGregor

exploitation of the identified commonality even to the level of service manuals and training materials [92]. The commonality/variability analysis of the product line method produces a model that identifies the variation points needed in the product line architecture to support the range of requirements for the products in the product line. Failure to recognize the need for variation in an asset will require custom development and management of a separate branch of code that must be maintained until a future major release. Variability has several implications for testing: – Variation is identified at explicit, documented variation points: Each of these points will impose a test obligation in terms either of selecting a test configuration or test data. Analysis is required at each point to fully understand the range of variation possible and the implications for testing. – Variation among products means variation among tests: The test software will typically have at least the same variation points as the product software. Constraints are needed to associate test variants with product variants. One solution to this is to have automated build scripts that build both the product and the tests at the same time. – Differences in behavior between the variants and the tests should be minimal: We show later that using the same mechanisms to design the test infrastructure as are used to design the product software is usually an effective technique. Assuming the product mechanisms are chosen to achieve certain qualities, the test software should seek to achieve the same qualities. – The specific variant to be used at a variation point is bound at a specific time: The binding time for a variation point is one specific attribute that should be carefully matched to the binding of tests. Binding the tests later than the product assets are bound is usually acceptable but not the reverse. – The test infrastructure must support all the binding times used in the product line: Dynamically bound variants are of particular concern. For example, binding aspects [191] to the product code as it executes in the virtual machine may require special techniques to instrument test cases. – Managing the range of variation in a product line is essential: Every variant possibility added to a variation point potentially has a combinatorial impact on the test effort. Candidate variants should be analyzed carefully to determine the value added by that variant. Test techniques that reduce the impact of added variants will be sought as well. Commonality also has implications for testing: – Techniques that avoid retesting of the common portions can greatly reduce the test effort – The common behavior is reused in several products making an investment in test automation viable. So it is also explored in this chapter.

Testing a Software Product Line

109

2.2 Planning and Structuring Planning and structuring are two activities that are important to the success of the product line. A software product line is chartered with a specific set of goals. Test planning takes these high level goals into account when defining the goals for the testing activities. The test assets will be structured to enhance their reusability. Techniques such as inheritance hierarchies, aspect-oriented programming and template programming provide a basis for defining assets that possess specific attributes including reusability. Planning and structuring must be carried out incrementally. To optimize reuse, assets must be decomposed and structured to facilitate assembly by a product team. One goal is to have incremental algorithms that can be completed on individual modules and then be more rapidly completed for assembled subsystems and products using the partial results. Work in areas such as incremental model checking and component certification may provide techniques for incremental testing [334]. A product line organization defines two types of roles: core asset builder and product builder. The core asset builders create assets that span a sufficient range of variability to be usable across a number of products. The core assets include early assets such as the business case, requirements and architecture and later assets such as the code. Test assets include plans, frameworks and code. The core asset builder creates an attached process for each core asset. The attached process describes how to use the core asset in building a product. The attached process may be a written description, provided as a cheatsheet in Eclipse or as a guidance element in the .NET environment, or it may be a script that will drive an automated tool. The attached process adds value by reducing the time required for a product builder to learn how to use the core asset. The core asset builder’s perspective is: create assets that are usable by product builders. A core asset builder’s primary trade-off is between sufficient variability to maximize the reuse potential of an asset and sufficient commonality to provide substantial value in product building. The core asset developer is typically responsible for creating all parts of an asset. This may include test cases that are used to test the asset during development and can be used by product developers to sanity test the assets once they are integrated into a product. The product builders construct products using core assets and any additional assets they must create for the unique portion of product. The product builder provides feedback to the core asset builders about the usefulness of the core assets. This feedback includes whether the variation points were sufficient for the product. The product builder’s perspective is: achieve the required qualities for their specific product as rapidly as possible. The product builder’s primary trade-off is between maximizing the use of core assets in building the product and achieving the precise requirements of their specific product. In particular, product builders may need a way to select test data to focus on those ranges critical to the success of their product. To coordinate the work of the two groups, a production strategy is created that provides a strategic overview of how products will be created from the core assets. A production method is defined to implement the strategy. The method details how the core asset developers should build and test the core assets so that product builders can achieve their objectives.

110

J.D. McGregor

Fig. 1. Test points

A more in-depth treatment of software product lines can be found in [83].

3 Testing Overview As discussed throughout this book, testing is the detailed examination of an artifact guided by specific information. The examination is a search for defects. Program failures signal that the search has been successful. How hard we search depends on the consequences if a defect is released in a product. Software for life support systems will require a more thorough search than word processing software. The purpose of this section is to give a particular perspective on testing at a high level, both recalling and relating to the product line context concepts discussed in Chapter 1. We discuss some of the artifacts needed to operate a test process. After that we present a perspective on the testing role and then briefly describe fault models. 3.1 Testing Artifacts This definition of testing encompasses a number of testing activities that are dispersed along the software development life cycle. We refer to the places where these activities are located as test points. Figure 1, similar to what is illustrated in Chapter 1, shows the set of test points for a high-level view of a development process. This sequence is intended to establish the chain of quality. As mentioned in Chapter 1, the IEEE 829 standard [176] defines a number of testing artifacts, which will be used at each of the test points in the development process. Several of these test assets are modified from their traditional form to support testing in a

Testing a Software Product Line

111

product line. To better understand the necessary modifications, we now recall some of the concepts discussed in Chapter 1 and relate them to the product line context: – test plan: A description of what testing will be done, the resources needed and a schedule for when activities will occur. Any software development project should have a high level end-to-end plan that coordinates the various specific types of tests that are applied by developers and system testers. Then individual test plans are constructed for each test that will be conducted. In a software product line organization a distinction is made between those plans developed by core asset builders that will be delivered to the product builders of every product and the plans that the product builder derives from the product line plan and uses to develop their product specific plan. The core asset builders might provide a template document or a tool that collects the needed data and then generates the required plan. – test case: A single configuration of test elements that embodies a use scenario for the artifact under test. In a software product line, a test case may have variation points that allow it to be configured for use with multiple products. – test data: All of the data needed to fully describe a scenario. The data is linked to specific test cases so that it can be reused when the test case is. The test data may have variation points that allow some portion of the data to be included or excluded for a particular use. – test report: A summary of the information resulting from the test process. The report is used to communicate to the original developers, management and potentially to customers. The test report may also be used as evidence of the quantity and quality of testing in the event of litigation related to a product failure. The primary task in testing is defining effective test cases. To a tester, that means finding a set of stimuli that, when applied to the artifact under test, exposes a defect. We discuss several techniques for test case creation but all of them rely on a fault model, as briefly discussed in Chapter 1 and explored in more detail in Chapter 8. There will be a fault model for each test point. We focus on faults related to being in a product line in Section 1. 3.2 Testing Perspective The test activities, particularly the review of non-software assets, are often carried out by people without a traditional testing background. Unit tests are usually carried out by developers who pay more attention to creating than critiquing. A product line organization will provide some fundamental training in testing techniques and procedures but this is no substitute for the perspective of an experienced tester. In addition to training in techniques, the people who take on a testing role at any point in any process should adopt the “testing perspective”. This perspective guides how they view their assigned testing activities. Each person with some responsibility for a type of testing should consider how these qualities should affect their actions: – Systematic: Testing is a search for defects and an effective search must be systematic about where it looks. The tester must follow a well-defined process when they

112

J.D. McGregor

are selecting test cases so that it is clear what has been tested and what has not. For example, coverage criteria are usually stated in a manner that describes the “system.” All branches level of test coverage means that test cases have been created for each path out of each decision point in the control flow graph. – Objective: The tester should not make assumptions about the work to be tested. Following specific algorithms for test case selection removes any of the tester’s personal feelings about what is likely to be correct or incorrect. “Bob always does good work, I do not need to test his work as thoroughly as John’s,” is a sure path to failure. – Thorough: The tests should reach some level of coverage of the work being examined that is “complete” by some definition. Essentially, for some classes of defects, tests should look everywhere those defects could be located. For example, test every error path if the system must be fault tolerant. – Skeptical: The tester should not accept any claim of correctness until it has been verified by an acceptable technique. Testing boundary conditions every time eliminates the assumption that “it is bound to work for zero.” 3.3 Fault Models A fault model is the set of known defects that can result from the development activities leading to the test point. Faults can be related to several different aspects of the development environment. For example, programs written in C and C++ are well known for null pointer errors. Object-oriented design techniques introduce the possibility of certain types of defects such as invoking the incorrect virtual method [268,10]. Faults also are the result of the development process and even the organization. For example, interface errors are more likely to occur between modules written by teams that are non-co-located. The development organization can develop a set of fault models that reflect their unique blend of process, domain and development environment. The organization can incorporate some existing models such as Chillaredge’s Orthogonal Defect Classification into the fault models developed for the various test points [77]. Testers use fault models to design effective and efficient test cases since the test cases are specifically designed to search for defects that are likely to be present. Developers of safety critical systems construct fault trees as part of a failure analysis. A failure analysis of this type usually starts at the point of failure and works back to find the cause. Another type of analysis is conducted as a forward search. These types of models capture specific faults. We want to capture fault types or categories of faults. The models are further divided so that each model corresponds to specific test points in the test process. A product line organization develops fault models by tracking defects and classifying these defects to provide a definition of the defect and the frequency with which it occurs. Table 1 shows some possible faults per test point related to a product line strategy. Others can be identified from the fault trees produced from safety analyses conducted on product families [94,93,218].

Testing a Software Product Line

113

Table 1. Faults by test point Test point Requirements Analysis Architecture Design Detailed design Unit testing Integration testing System testing

Example faults incomplete list of variations missing constraints on variations contradictions between variation constraints failure to propagate variations from subsystem interface failure to implement expected variations mismatched binding times between modules inability to achieve required configuration

3.4 Summary Nothing that has been said so far requires that the asset under test be program source code. The traditional test points are the units of code produced by an individual developer or team, the point at which a team’s work is integrated with that of other teams, and the completely assembled system. In section 4 we present a review/inspection technique, termed Guided Inspection, that applies the testing perspective to the review of non-software assets. This technique can be applied at several of the test points shown in Figure 1. This adds a number of test points to what would usually be referred to as “testing” but it completes the chain of quality. In section 5 we present techniques for the other test points. In an iterative, incremental development process the end of each phase may be encountered many times so the test points may be exercised many times during a development effort. Test implementations must be created in anticipation of this repetition.

4 Guided Inspection Guided Inspection is a technique that applies the discipline of testing to the review of non-software assets. The review process is guided by scenarios that are, in effect, test cases. This technique is based on Active Reviews by Parnas and Weiss [273]. An inspection technique is appropriate for a chapter about testing in a product line organization because the reviews and inspections are integral to the chain of quality and because the test professionals should play a key role in ensuring that these techniques are applied effectively. 4.1 The Process Consider a detailed design document for the computation engine in the arcade game product line. The document contains a UML model complete with OCL constraints as well as text tying the model to portions of the use case model that defines the product line. A Guided Inspection follows the steps shown in Figure 2, a screenshot from the Eclipse Process Framework Composer. The scenarios are selected from the use case model shown in Figure 3.

114

J.D. McGregor

Fig. 2. Guided inspection description

Fig. 3. Use case diagram

For this example, we picked “The Player has begun playing Brickles. The puck is in play and the Player is moving the paddle to reflect the puck. No pucks have been lost and no bricks hit so far. The puck and paddle have just come together on this tick of the clock.” as a very simple scenario that corresponds to a test pattern discussed later.

Testing a Software Product Line

115

Fig. 4. Central class diagram

The GameBoard shown in the class diagram in Figure 4 is reused, as is, in each game. It serves as a container for GamePieces. According to the scenario, the game is in motion so the game instance is in the moving state, one of the states shown in Figure 8. The sequence diagram shown in Figure 5 shows the action as the clock sends a tick to the gameboard which sends the tick on to the MovableSprites on the gameboard. After each tick the gameboard invokes the “check for collision” algorithm shown in Figure 6. The collision detection algorithm detects that the puck and paddle have collided and invokes the collision handling algorithm shown in Figure 7. In the inspection session, the team reads the scenario while tracing through the diagrams to be certain that the situation described in the scenario is accurately represented in the design model, looking for problems such as missing associations among classes and missing messages between objects. The defects found are noted and in some development organizations would be written up as problem reports. Sufficient scenarios are created and traced to give evidence that the design model is complete, correct and consistent. Coverage is measured by the portions of diagrams, such as specific classes in a class diagram, that are examined as part of a scenario. One possible set of coverage criteria, listed in order of increasing coverage, includes: – a scenario for each end-to-end use case, including “extends” use cases – a scenario that touches each “includes” use case

116

J.D. McGregor

Fig. 5. Basic running algorithm

Fig. 6. Collision detection algorithm

Testing a Software Product Line

117

Fig. 7. Collision handling algorithm

– a scenario that touches each variation point – a scenario that uses each variant of each variation point. Guided Inspection is not the only scenario based evaluation tool that can be employed. The Architecture Tradeoff Analysis Method (ATAM) developed by the SEI also uses scenarios to evaluate the architecture of a product line [28]. Their technique looks specifically at the quality attributes, that is, non-functional requirements, the architecture is attempting to achieve. The benefit of a Guided Inspection session does not stop with the many defects found during the inspections. The scenarios created during the architecture evaluation and detailed design will provide an evolution path for the scenarios to be used to create executable integration and system test cases. 4.2 Non-functional Requirements Software product line organizations are concerned about more than just the functional requirements for products. These organizations want to address the non-functional requirements as early as possible. Non-functional requirements, sometimes called quality attributes, include characteristics such as performance, modifiability and dependability. So and others describe a technique for using scenarios to begin testing for performance early in the life of the product line [284]. Both Guided Inspection and the ATAM provide a means for investigating quality attributes. During the “create scenarios” activity, scenarios, elicited from stakeholders, describe desirable product behavior in terms of user-visible actions. The ArchE tool

118

J.D. McGregor

Fig. 8. Simple state diagram

(available for download at http://www.sei.cmu.edu/architecture/arche.html), developed at the SEI, aids in reasoning about these attributes. It provides assistance to the architect in trading off among multiple, conflicting attributes.

5 Testing Techniques for a Product Line The test practice in a product line organization encompasses all of the knowledge about testing necessary to operate the test activities in the organization. This includes the knowledge about the processes, technologies and models needed to define the test method. We first discuss testing as a practice since this provides a more comprehensive approach than just discussing all of the individual processes needed at the individual test points [179]. Then we use the three phase view of testing developed by Hetzel [167]—planning, construction and execution—to structure the details of the rest of the discussion. In his keynote to the Software Product Line Testing Workshop (SPLiT), Grütter listed four challenges to product line testing [150]: – Meeting the product developer’s quality expectations for core assets – Establishing testing as a discipline that is well-regarded by managers and developers – Controlling the growth of variability – Making design decisions for testability. We incorporate a number of techniques that address these challenges. 5.1 Test Method Overview The test practice in an organization encompasses the test methods, coordinated sets of processes, tools and models, that are used at all test points. For example, “test first development” is an integrated development and testing method that follows an agile development process model and is supported by tools such as JUnit. The test method for a product line organization defines a number of test processes that operate independently of each other. This is true for any project but in a product

Testing a Software Product Line

119

line organization these processes are distributed among the core asset and product teams and must be coordinated. Often the teams are non-co-located and communicate via a variety of mechanisms. The test method for a project must be compatible with the development method. The test method for an agile development process is very different from the test method for a certifiable software (FAA, FDA, and others) process. The rhythm of activity must be synchronized between the test and development methods. Tasks must be clearly assigned to one method or the other. Some testing tasks such as unit testing will often be assigned to development staff. These tasks are still defined in the testing method so that expectations for defect search, including levels of coverage and fault models, can be coordinated. There is a process defined for the conduct of testing at each of the test points. This process defines procedures for constructing test cases for the artifacts that are under test at that test point. The test method defines a comprehensive fault model and assigns responsibility for specific defect types to each test point. The product line test plan should assign responsibility for operating each of these test processes. An example model of the testing practice for a software product line organization is provided elsewhere [243]. 5.2 Test Planning Adopting the software product line approach is a strategic decision. The product line organization develops strategies for producing products and testing them. The strategies are coordinated to ensure an efficient, effective operation. One of the challenges to testing in a product line organization is achieving the same gains in productivity as product production so that testing does not prevent the organization from realizing the desired improvements in throughput. Table 2 shows frequently desired production and product qualities and corresponding testing actions. The faster time to market is achieved by eliminating the human tester in the loop as much as possible. Short iterations from test to debug/fix will keep cycle time shorter. Production is cheaper and faster when the human is out of the loop. The product line strategy also supports having a cheaper, more domain-oriented set of product builders. These product builders could be less technical and paid less if the core assets are designed to support a more automated style of product building and testing. The quality of the products can be made better because more thorough coverage can be achieved over time. Finally, the product line core asset base supports mass customization but requires combinatorial test techniques to cover the range of variability without having the test space explode. The production planning activity begins with the business goals of the product line and produces a production Table 2. Product qualities and testing actions To achieve faster cheaper better mass customization

use automation and iteration automation and differentiated work force more thorough coverage combinatorial testing

120

J.D. McGregor Table 3. Organization by test point Core asset builders Major responsibility Test to reusability level Integration testing Shared responsibility n-way interactions tested System testing Minor responsibility test example products Unit testing

Product developers Minor responsibility Test for specific functionality Shared responsibility existing interactions tested Major responsibility test the one product

plan that describes the production strategy and the production method [74]. The testing activities in the product line organization are coordinated with the product development activities by addressing both production and testing activities during production planning. The plan talks about how core assets will be created to support strategic reuse and how products will be assembled to meet the business goals of the organization. The production method is a realization of the production strategy and defines how products are created from the core assets. The core asset development method is structured to produce core assets that facilitate product building. The production method specifies the variation mechanisms to be used to implement the variation points. These mechanisms determine some of the variation mechanisms used in the test software. Table 3 shows the division of testing responsibilities between the core asset developers and the product builders for code-based assets. In the earlier test points, shown in Figure 1, the core asset developers have the major responsibility while the product builders do incremental reviews. The product line test plan maps the information in Table 3 into a sequence of testing activities. The IEEE standard outline for a test plan provides the entries in the leftmost column in Table 4. We have added comments for some of the items that are particularly important in a product line environment. These comments apply to the product line wide test plan that sets the context for the individual plans for each core asset and product. The plan illustrates the concurrency possible in the test activities where the core asset team is testing new core assets while products built from previous assets are being tested by the product teams. Defining the test method. Once the production strategy has been defined, the product building and test methods can be developed. The method engineer works with testers and developers to specify the processes, models and technologies that will be used during the various test activities. For example, if a state-based language is chosen as the programming language in the production method, then testing focuses on states and the notion of a “switch cover”, a set of tests that trace all the transitions in the design state machine, becomes the unit of test coverage. In a software product line organization all of the other 28 practices impact testing. Some of the testing-related implications of the three categories of practices are shown in Table 5. The appendix contains a more detailed analysis of the relationship of testing to each of the other practice areas. The test method describes the test tools that will be used. We cover these more thoroughly in Section 5.3 but we give one example here. If development will be done in Java, the testers will use JUnit and its infrastructure.

Testing a Software Product Line

Table 4. IEEE test plan outline Introduction Test Items Tested Features

This is the overall plan for product line testing. All products possible from the core assets. Product features are introduced incrementally. As a product uses a feature it is tested and included in the core asset base.

Features Not Tested (per cycle) Testing Strategy and Approach Separate strategies are needed for core assets and products. Syntax Description of Functionality Arguments for tests Expected Output Specific Exclusions Dependencies The product line test report should detail dependencies outside the organization. Are special arrangements necessary to ensure availability over the life of the product line? Test Case Success/Failure Cri- Every type of test should have explicit instructions as to teria what must be checked to determine pass/fail. Pass/Fail Criteria for the Complete Test Cycle Entrance Criteria/Exit Criteria Test Suspension Criteria and In a product line organization a fault found in a core asResumption Requirements set should suspend product testing and send the problem to the core asset team. Test Deliverables/Status Com- Test reports are core assets that may be used as part of munications Vehicles safety cases or other certification procedures Testing Tasks Test Planning Test Construction Test Execution and Evaluation Hardware and Software Requirements Problem Determination and Important that this reflect the structure defined in the Correction Responsibilities CONOPS Staffing and Training Needs/Assignments Test Schedules Risks and Contingencies The resources required for testing may increase if the testability of the asset specifications is low. The test coverage may be lower than is acceptable if the test case selection strategy is not adequate. The test results may not be useful if the correct answers are not clearly specified. Approvals

121

122

J.D. McGregor Table 5. Implications of practice area categories for personnel roles Core asset development Product development Shift traditional emphasis from backend Consider the impact of product seto frontend testing. quence on development of assets including tests man- Coordinate development of core assets Provide configuration support for tests with product development. What testing as well as development artifacts tools are delivered with the core assets? engi- Design for testability Use testing to guide integration

Organizational management Technical agement Software neering

The test method in a data intensive application development effort might specify that unit testers will create datapools, that is, specially formatted file of data for test cases. The test case then contains commands that retrieve data as a test case is ready to be run. This facilitates sharing test data across different test cases and different software development efforts. In this way datapools help automate the handling of large amounts of data. The method defines techniques and provides examples such as the one below. Listing 4.1 shows the setUp method in a JUnit test class that uses datapools to apply test data to configuration cases. This is a reusable chunk that could be packaged and used by any JUnit test class.

/ / I n i t i a l i z e the datapool f actory IDatapoolFactory dpFactory ; d p F a c t o r y = new C o m m o n _ D a t a p o o l F a c t o ry Im p l ( ) ; / / Load t h e s h o p p i n g C a r t D a t a p o o l d a t a p o o l IDatapool datapool = dpFactory . load ( new j a v a . i o . F i l e ( " c : \ \ c o u r s e s \ \ . . . \ \ velocityPool . datapool ") , false ) ; / / C r e a t e an i t e r a t o r t o t r a v e r s e t h e d a t a p o o l d p I t e r a t o r = d p F a c t o r y . open ( d a t a p o o l , " o r g . e c l i p s e . h y a d e s . datapool . i t e r a t o r . DatapoolIteratorSequentialPrivate "); / / I n i t i a l i z e the datapool to t r av er s e the f i r s t / / equivalence class . d p I t e r a t o r . d p I n i t i a l i z e ( datapool , 0 ) ;

1 2 3 4 5 6 7 8 9 10 11 12 13 14

}

Listing 4.1. Datapool snippet

Design for testability. For a given component, if it is implemented in a single-product, custom development effort, we assume that a component C is executed x times for a specified time period. In development where multiple copies of the product are deployed the same component will now be executed nc ∗ x times in a specified time period, nc is the number of copies. In product line development, the same component is executed np i=1

(nci ∗ xi )

Testing a Software Product Line

123

times in the same time period, where np is the number of products in which the component is used, nci is the number of copies of a given product and xi is the number of executions for a given product. In this scenario, assume that the probability of a defect in the component causing a failure is P(d ). Obviously the number of failures observed in the product line scenario will likely be greater than the other two scenarios as long as P(d ) remains constant. The expected number of failures can be stated as: expectedNumFailures = P(d ) ∗

np (nci ∗ xi ) i=1

In a product line, a component is used in multiple products, which may have different levels of quality attributes and different types of users. We expect the range of input data presented to a component varies from one product context to another. So, P(d ) does not remain constant. If we assume that when a failure occurs in one product, its failure is known to the organization, the number of failures can be stated as: expectedNumFailures =

np

Pi (d ) ∗ (nci ∗ xi ))

i=1

We discuss later that this aggregate decrease in testability may require additional testing. Part of developing the production method is specifying the use of techniques to ensure that assets are testable. There are two design qualities that guide these definitions: – Observability: provide interfaces that allow the test software to observe the internal state of the artifact under test. Some languages provide a means of limiting access to a particular interface so that encapsulation and information hiding are not sacrificed. For example, declaring the observation interface to have package visibility in Java and then defining the product and test assets in the same package achieves the objective. A similar arrangement is possible with the C++ friend mechanism. – Controllability: provide interfaces that allow the artifact under test to be placed in a particular state. Use the same techniques as for observability to preserve the integrity of the artifact. Testability is important in a software product line organization because the presence of variability mechanisms makes it more difficult to search through an asset for defects. A level of indirection is introduced by many of the mechanisms. The product line infrastructure may provide tools that provide a test view of the assets. This test view allows test objects to observe the state of the object under test and to set the state to any desired value. A module can provide application behavior through one interface that hides its implementation while providing access to its internals through another, transparent interface. The transparent test interface may be hidden from application modules but visible to test modules. One approach is to use package level visibility and include only the module and its test module. Or, the test interface may be protected by a tool that checks, prior to compilation, for any reference to the test interface other than from a test module.

124

J.D. McGregor

Test coverage. As discussed in Chapter 1, test coverage is a measure of how thorough the search has been with the test cases executed so far. The test plan defines test coverage goals for each test point. Setting test coverage levels is a strategic decision that affects the reputation of the organization in the long term and the quality of an individual product in the short term. It is also strategic because the coverage goals directly influence the costs of testing and the resources needed to achieve those goals. “Better” is often one of the goals for products in a product line where better refers to some notion of improved quality. One approach to achieving this quality is to allow the many users of the products find defects and report them, so they can be repaired. However, in many markets letting defects reach the customer is unacceptable. An alternative is to require more thorough test coverage of the core assets compared to coverage for typical traditional system modules. The earlier discussion on testability leads to the conclusion that traditional rules of thumb used by testers and developers about the levels of testing to which a component should be subjected will not be adequate for a software product line environment. The increased number of executions raises the likelihood that defects will be exposed in the field unless the test coverage levels for in-house testing are raised correspondingly. This is not to say that individual users will see a decline in reliability. Rather, the increased failures will be experienced as an aggregate over the product line. Help desks and bug reporting facilities will feel the effects. If the software is part of a warranted product, the cost of repairs will be higher than anticipated. The weight of this increase in total failures may result in pressures, if not orders, to recall and fix products [244]. The extra level of complexity in the product line—the instantiation of individual products—should be systematically explored just as values are systematically chosen for parameter values. We refer to these as configuration cases because they are test cases at one level but are different from the dynamic data used during product execution. A product line coverage hierarchy might look something like this: – select configurations so that each variant at each variation point is included in some test configuration, see section 5.3; obey any constraints that link variants – select variants that appear together through some type of pattern even though there is no constraint linking them – select variant values pair-wise so that all possible pairs of variant values are tested together – select higher order combinations. Other coverage definitions can be given for specific types of artifacts. Kauppinen and others define coverage levels for frameworks. The coverage levels are defined in terms of the hook and template implementations in a framework [190]. The coverage is discussed for feature-level tests in terms of an aspect-oriented development approach. 5.3 Test Construction In a product line organization, tests are executed many times: – as the artifact it tests is iteratively refined and – as the artifact it tests is reused across products.

Testing a Software Product Line

125

The test assets must be constructed to meet these requirements. This includes: – technologies to automatically execute test cases and – technologies to automatically build configuration cases. The technologies and models used to construct test assets should be closely related to the technologies and models used to construct the product assets. This coordination occurs during production planning. As test software is constructed, the design of the products is considered to determine the design of the test artifacts. There are several types of variation mechanisms. Here is one high level classification of mechanism types [113]: – Parameterization: Mechanisms here range from simply sending different primitive data values as parameters to sending objects as parameters to setting values in configuration files. – Refinement: Mechanisms include inheritance in object-oriented languages and specializers that use partial evaluation. – Composition: Container architectures in which components “live” in a container that provides services. – Arbitrary transformation: Transformations based on underlying meta-models that take a generic artifact and produce an asset. In the following we discuss factors related to these mechanisms. Binding time. The variation mechanisms differ in many ways, one of which is the time at which definitions are bound. There is a dynamic tension between earlier binding that is easier to instrument and verify and later binding which provides greater flexibility in product structure. The choice of binding time affects how we choose to test. Mechanisms that are bound during design or compile time can often be statically checked. Mechanisms that bind later must be checked at runtime and perhaps even over multiple runs since their characteristics may change. In an object-oriented environment, some mechanisms are bound to classes while others are bound to objects. This generally fits the static/dynamic split but not always. It is possible to verify the presence of statically defined objects in languages such as Java. This is a particularly important issue since these static objects are often responsible for instantiating the remainder of the system to test. Test architecture. A software product line organization is usually sufficiently large and long lived that the tools and processes need to be coordinated through an architecture. The test architecture controls the overall design of the test environment. Several research and experience reports point to the fact that the design of the test environment should parallel the design of the product. Cummins Engines, a Software Product Line Hall of Fame member, has experience that reinforces the notion of the product architecture and the test architecture having a similar shape. “Tests must be designed for portability by leveraging the points of variation in the software as well as in the System Architecture” [346]. In fact at a top level they view the test environment architecture as a natural part of the product line architecture.

126

J.D. McGregor

A product line test architecture must address some specific range of variability and the accompanying variety of binding times. If the range in the product line is very large, it may be reasonable to have multiple architectures and this usually happens between test points. The architecture for unit testing usually needs to have access to the internals of components and will tie into the security model of the programming language. The architecture for a GUI tester will likewise tie into the windowing model. For the very latest binding mechanisms it is necessary to tie into the execution environment such as the JVM for a Java program. Aspect-oriented techniques. Aspect-oriented programming is one technology used to implement variation in core assets. An aspect is a representation of a cross cutting concern that is not the primary decomposition. Research has shown that for code that is defined in terms of aspects, the test software should also be defined in terms of aspects [197]. In that way, the test software is bound at the same time as the product software. We do not give a tutorial on aspect-oriented programming here but we talk about some characteristics of aspects that make them a good variation mechanism. An aspect is a design concern that cuts across the primary decomposition of a design. Canonical examples include how output or security are handled across a product. Aspect-oriented techniques do not require any hooks to exist in the non-aspect code before the aspects are added. The aspect definition contains semantic information about where it should be inserted into the code. An aspect weaver provides that insertion either statically or dynamically depending upon the design. Shown below is the paint method for a DigitalScoreBoard that is being used in products that use the Java Microedition. The microdedition is for small devices such as cellphones. A different paint method is used when the base code for the scoreboard is used in the Java Standard Edition. The rest of the scoreboard code is common. package c o r e A s s e t s ; import j a v a x . m i c r o e d i t i o n . l c d u i . G r a p h i c s ; public aspect DigitalScoreBoardA { p r i v a t e s t a t i c f i n a l i n t c o l o r = 255 < <16 | 255 < <8 | 0 ; v o i d a r o u n d ( S c o r e B o a r d gsb , G r a p h i c s g ) : c a l l ( v o i d S c o r e B o a r d . p a i n t ( G r a p h i c s ) ) && t a r g e t ( g s b ) && a r g s ( g ) {g . s e t C o l o r ( c o l o r ) ; S t r i n g s c o r e S t r = gsb . s c o r e ( ) ; int o f f s e t = ( g . getFont ( ) . stringWidth ( s core Str )) / 2; g . d ra w St ri n g ( s c o r e S t r , gsb . g e t L o c a t i o n ( ) . getRealX ( ) − o f f s e t , gsb . g e t L o c a t i o n ( ) . getRealY ( ) , G r a p h i c s . TOP | G r a p h i c s . LEFT ) ; } }

Aspect-orientation is a useful variation mechanism, particularly for unanticipated variation since no hooks are needed. Implementations of aspect-oriented programming such as AspectJ provide a variety of design constructs including inheritance and composition. An aspect adds a fragment of behavior to the existing behavior of an object. This can be exploited for testing. In figure 9 classes are defined that implement two types of scoreboards for the product line. Each class defines all of the behavior needed to keep score but does not define the method needed to present it on the display. Two types of

Testing a Software Product Line

127

Fig. 9. Aspect design including tests

display are supported via aspects. One is for use with the Java Micro Edition and the other is for use with Java Standard Edition. A test class is defined for each scoreboard class. Then a fragment of behavior is added to the test class that examines the graphics portion of the class behavior. There may be multiple test cases defined in the test aspect. Assembly. Core assets may be assembled from smaller pieces to incorporate the selected variants. These variants are bound at the time of this assembly. Assets, such as unit tests, may be composed to provide an increase in reuse opportunities [131]. One technology used for this purpose is the XML-based Variant Configuration Language (XVCL). The language provides a set of XML tags that are processed by an engine. Non-control language XML tags and all other content is simply streamed out. XVCL can be used to produce files that obey any format. Table 6 shows a small section of XVCL code. The var statements set variables to hold the names of two output files. The adapt statements feed the processor with XVCL code fragments. The one XVCL program will output both the asset, to out.txt, and the test asset, to testout.txt. This facilitates traceability by closely associating the product and test code. The assembly approach requires that a complete implementation be developed and then decomposed into the appropriate chunks, breaking apart into regions of commonality. Then a set of pieces for each variant. The variant pieces can be produced as needed or produced up front for later use.

128

J.D. McGregor Table 6. XVCL segment <x-frame name="main.xvcl"> <set-multi var="variationPoints" value="vp1.1,vp1.2"/> <set var="codefile" value="c:/out.txt"/> <set var="testfile" value="c:/testout.txt"/>

One advantage of XVCL is that it can be used with technologies that have no concept of variation. For example, a user’s manual can be split at points which correspond to variation points in the product. Each piece is captured in an XVCL frame. Then control logic is added to select the correct sub-frames based on variant choices. The hierarchy of frames is resolved based on the variant choices. Each frame that is accepted is written to the output. This approach can handle RTF files, XML files, text files, or any other format. From the example in Table 6 you can see that multiple outputs can result from the specification of one product. More is needed. XVCL will traverse a frame hierarchy but the resulting output needs to be compiled, linked or jarred and perhaps incorporated into an installer package. The Buckminster project of the Eclipse Foundation provides a realization service that will use tools such as XVCL to extract files from a repository and then apply a variety of tools. This tool chain allows us to blend assets to reduce traceability problems between product and test assets, and other assets such as service manuals and other supporting information. Generation. Automatic generation is the most widely used technique. Object code is generated from source code written in a “higher-level” language. Model driven development (MDD) moves this up a level of abstraction by using design models and templates as the “source code.” The model provides the situation-specific information. The templates embody patterns. The generator replaces the variability in a template with information from the model. The output is an instantiation of the pattern in an output format chosen when the template was designed. The advantage of this approach is in the generality of the templates. A template that can be used to generate the shell for any Java class is much more useful than a template that can only be used to generate an algorithm designed to solve a problem in one product. Templates are useful for situations where the time required to abstract out the variabilities is more than compensated for by the time that would be required to complete each individual occurence of the code pattern. For example, intricate algorithms,

Testing a Software Product Line

129

like calculating the Doppler effect between a satellite and its ground station can be captured once in template. A second advantage is that a review of the template provides a partial review of the code that results from instantiating the template. The primary disadvantage of generation is the effort required to achieve the appropriate level of abstraction. Failure to attain that level results in a generator that produces very incomplete code that needs much manual work. The template instantiation process produces output that may be source code in a programming, or scripting, language. This must still be converted into an executable form by compiling the output. A second disadvantage is the difficulty of coordinating the application of multiple templates. Test patterns, discussed in Section 5.3, are used as the basis for templates used to generate test software. One pattern is the “Test the product from the perspective of the user.” This can be instantiated in a script for a GUI-based test tool such as Abbot [289]. This activity follows a process like the following: – The initial script is generated by operating the product by hand, in record mode – The resulting script is captured, converted into a template by replacing variation points with variables, and adding control tags – The template can be instantiated multiple times with specific values chosen from a test case model. Generation requires more planning and initial effort than simply writing a JUnit test case. This effort is rewarded across the product line by greatly reduced effort for subsequent products. Combinatorial testing. The variability in a product line leads to an explosion of interactions that would quickly require billions and trillions of test cases to test exhaustively. Combinatorial test design techniques can just as quickly reduce the number of tests needed to achieve effective levels of test coverage to a manageable number, say about 30 to 60. The test designs from a combinatorial design technique are based on defining combinations of parameter values. At the simplest level all pairs of parameter values are selected. Most combinatorial techniques can be adjusted to generate 3-way, 4-way up to n-way possibilities. Most reports have found that pair-wise testing will result in 90%+ test coverage. The usual use of combinatorial tests is in the selection of multiple parameters for unit testing of methods. In a product line, there is a second place where combinatorial testing is useful—configuration cases in which products are instantiated for testing. Combinatorial techniques are used to select the configuration for a product. Then each instantiated product is executed using an instantiated set of data values chosen using combinatorial design. This approach is referred to as Design of Experiments in some literature. The orthogonal array approach of Taguchi has been automated in several tools. Minitab, and other statistical packages, can be used to generate the combinatorial designs [237]. We illustrate this with a short example from the AGM case study: – State the test scenario: Each AGM game is to be able to run on multiple platforms and the optional scoreboard is one of the display items that may expose problems between platforms

130

J.D. McGregor

Fig. 10. Minitab setup of Taguchi experiment

– Choose the number of factors you want: This will be a 3 factor experiment – Identify and list the factors that are involved: The Game, the operating system and the type of scoreboard – Identify and list the levels for each factor: For Game the levels are Brickles, Pong and Bowling. For operating system: windows, apple and linux. For the scoreboard: digital scoreboard, analog scoreboard and no scoreboard – Choose the appropriate design in minitab: See Figure 10 – Map the design onto the problem data: See Figure 11 Each row in the orthogonal array is a test vector that defines the configuration of a product. Each configuration is then subjected to a set of test data. This table can be outputed as a data file that JUnit uses as a datapool. The Junit test cases read the configuration cases, create an object under test (OUT) for each configuration, execute all the test cases on each OUT and publish the results. Test patterns. Test patterns represent a useful mid-level test coverage criteria. Typically a test pattern represents a portion of the design that cuts across architectural boundaries and is large enough to have interesting behavior and therefore interesting defects. Running test cases that are based on the applicable test patterns uncovers interactions among a set of units.

Testing a Software Product Line

131

Fig. 11. Orthogonal array for top level AGM example

There are numerous sources of test patterns. Every design pattern provides the opportunity to define a corresponding test pattern. A number of test patterns have been discovered [246,41]. The Mock object test pattern has been implemented in several frameworks that generate the mock objects automatically [226]. The pattern approach is also useful for automating test generation. Generation usually calls for a template that is instantiated with variant choices. The template is nothing more than a pattern captured in a templating language such as the Java Emitting Templates (JET) [331]. For example, the Test Observer pattern is associated with the Observer design pattern. In many instantiations, the Observer is supposed to initiate some action upon receipt of an event of a specific type. The Observer is attached to what it is observing. The TestObserver pattern is shown in Figure 12.

Fig. 12. TestObserver pattern

132

J.D. McGregor

Fig. 13. Abbot tester

The Test Observer registers for the same event as the Observer it is testing. When it receives the expected event it queries the Observer to determine that it has reacted as expected. It may also interrogate the object being observed. The Test Observer tests the interaction of the Observer, its observed object and any objects that the Observer is supposed to notify. A product line organization will identify some patterns that it is using to implement specific behavior. Developing test cases from these patterns will ensure that the patterns are being implemented correctly. The organization may decide to automate the generation of code from some of the more widely used patterns. There are many techniques for constructing test artifacts. We have only presented a few because most of these techniques are not specific to product lines. With the wide range of test points in a product line effort most existing test techniques are applicable at some points. 5.4 Test Execution and Evaluation Tests should be executed early and often. The results of executing the tests should be accumulated so that a history of data used and pass/fail rates can be used to support the

Testing a Software Product Line

133

maintenance of the individual core assets. This data can also be used as a measure of the health of the core asset base. Depending on the development method used, the unit tests may be executed beginning on the first day of code development or they may not even be created until the unit is largely implemented. Writing the tests can be a way of testing one’s understanding of the requirements on the unit and can raise questions early that will reduce the need to rewrite code later. The development environment should support the rapid execution of tests and concise reporting of results. Advances in testing graphical user interfaces allow increased automation. Figure 13 shows a test run with the Brickles product. Much of the implementation of this test suite can be automatically generated or the record and playback mechanism can be used. Listing 4.2 shows a fragment of the Abbot test script. The Abbot tester can test GUI-based programs using a record/play back approach. The tester plays the game in record mode and successfully completes a sequence of moves. The session can then be played back after changes are made to the code to ensure it still behaves correctly. Certain changes to the graphics would require that the script be altered before being used again. This code can be generalized and then automatically generated by taking a recorded session and rewriting it as a JET template. Listing 4.3 shows an example of the JET template script.

6 The Business of Testing A software product line organization is founded on a sound business case and any major activity is usually justified in the same manner. The Structured, Intuitive Model of Product Line Economics (SIMPLE) is one technique for quantifying the justification [45,46]. The model defines four cost functions, two of which are fixed costs and two are per product costs. The model also defines a benefits function. The basic formula is: m NumProducts Beneﬁts − (Corg + Ccab + (Cuniquei + Creusei )) j =1

i=1

Each of these functions is applied to all of the product line assets. In the discussion that follows we focus on the testing assets. The SIMPLE approach supports thought experiments about alternative techniques. In the following we give the meaning of each function with respect to testing assets. Corg - The cost of resolving organizational issues related to testing. All personnel who assume a testing role must be trained in that role. Organizational management will participate in developing the production strategy and the organization’s test strategy. Ccab - The cost of the core assets for testing. The test infrastructure and other testing tools must be acquired and perhaps adapted. This test infrastructure supports all codebased test points. The core asset team will extend the basic infrastructure to handle specific test issues. There is on-going expense as requirements evolve and defects are reported. Test case software is written as new classes are defined in response to new features and bug fixes and as new combinations of product assets are aggregated into new subsystems.

134

J.D. McGregor

Cuniquei - The cost of the unique test software that must be built for a specific product or other single use test situation. Some portion of every product is unique from the other products. Test cases, and perhaps additional test drivers, must be created and must be executed. Creusei - The cost of getting ready to reuse the testing core assets. Despite efforts to build good assets, there will be some learning curve before the testers can effectively use the core assets. The attached processes are intended to reduce this learning curve.

1

extends ComponentTestFixture {

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

public void testTestMethod1_PlayGame ( ) { / / b r i c k l e s . TODO ( ) ; a s s e r t T r u e ( true ) ; / / a s s e r t T r u e ( b r i c k l e s . TODO == t r u e ) ; } public void te s tT e s tM e th od 2 _H itB ric k ( ) { b r i c k l e s = new B r i c k l e s V i e w ( ) ; brickles . preinit ( brickles ); System . o u t . p r i n t l n ( " S t a r t T e s t 1 " ) ; int counter = 0; i n t b r i c k G o n e = −1; w h i l e ( b r i c k l e s . g e t P u c k ( ) . _ k a p u t == f a l s e ) { System . o u t . p r i n t ( " " ) ; b r i c k l e s . moveMouse ( b r i c k l e s . g e t P u c k ( ) . g e t P o s i t i o n ( ) ) ; BrickPile pile = brickles . getBrickPile ( ) ; counter = 0; f o r ( i n t i = 0 ; i < p i l e . g e t S i z e ( ) ; i ++) { Brick b = p il e . getBrickAt ( i ) ; if (b . isHit ()) { brickGone = i ; } c o u n t e r ++; } i f ( b r i c k G o n e != −1) break ; } a s s e r t T r u e ( " B r i c k i s gone " , b r i c k l e s . g e t B r i c k P i l e ( ) . g e t B r i c k A t ( b r i c k G o n e ) . i s H i t ( ) == t r u e ) ; System . o u t . p r i n t l n ( c o u n t e r ) ; a s s e r t T r u e ( c o u n t e r − 1 == 1 1 ) ; }

Listing 4.2. Abbot test script

Beneﬁts - The benefits realized as a result of incurring the costs. In the case of testing, these benefits are not easy to quantify. Customers experiencing higher quality, retaining reputation, elimination of litigation, and other risks reduced can be estimated. Finally, m is the number of benefit expressions. Obviously reducing the per product costs is the best way to reduce overall costs. Automation of test creation and execution will reduce Creuse . A thorough commonality and variability analysis will reduce the variability to the minimum and will identify many of the variation points. This will reduce Cunique . Ganesan and others used a variation on the basic SIMPLE formula to compare testing strategies [132]. They compared a strategy in which each product was tested individually from scratch to a strategy in which an infrastructure was tested once and the product

Testing a Software Product Line

135

testing was limited to the product specific behavior. Their study showed a 13% savings in test costs. They did not consider a number of known reduction techniques discussed in this paper. They did use Monte Carlo simulations, which have been used in a number of economic studies related to product lines to compensate for the uncertainty of the estimates of model parameters. The longevity of a product line would magnify inaccuracies. They ran thousands of trials varying the input parameters. The 13% savings forecasted is an estimate based on those simulation runs.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

public void te s tT e s tM e thod ( ) { / / b r i c k l e s .TODO ( ) ; b r i c k l e s = new B r i c k l e s V i e w ( ) ; brickles . preinit ( brickles ); System . o u t . p r i n t l n ( " S t a r t T e s t 1 " ) ; i n t c o u n t e r = 0 ; i n t b r i c k G o n e = −1; w h i l e ( b r i c k l e s . g e t P u c k ( ) . _ k a p u t == f a l s e ) { System . o u t . p r i n t ( " " ) ; b r i c k l e s . moveMouse ( b r i c k l e s . g e t P u c k ( ) . getPosition ( ) ) ; BrickPile pile = brickles . getBrickPile ( ) ; c o u n t e r =0; f o r ( i n t i = 0 ; i < p i l e . g e t S i z e ( ) ; i ++) { Brick b = p i le . getBrickAt ( i ) ; i f ( b . i s H i t ( ) ) { brickGone = i ; } c o u n t e r ++; } i f ( b r i c k G o n e != −1) break ; } a s s e r t T r u e ( true ) ; / / a s s e r t T r u e ( b r i c k l e s . TODO == / / ) ; a s s e r t T r u e ( " B r i c k i s gone " , b r i c k l e s . g e t B r i c k P i l e ( ) . getBrickAt ( brickGone ) . i s H i t ( ) == t r u e ) ; System . o u t . p r i n t l n ( c o u n t e r ) ; a s s e r t T r u e ( c o u n t e r − 1 == ) ; }

Listing 4.3. JET template script

136

J.D. McGregor

Another SIMPLE-based study that has implications for testing showed that over time the core asset base needs to be refreshed which requires an infusion of funding. Ganesan et al considered the notion of a product line generation [133]. When a new generation is created, the same test obligation will exist as during the initial creation. Through the use of incremental techniques, the effort can be reduced but must be planned as part of the overall product line planning.

7 Case Study The Arcade Game Maker product line from AGM is a set of 3 nostalgia video games: Brickles, Pong and Bowling. AGM wants to use an iterative, incremental approach in which the three products are released to three different markets over a relatively short period of time. Besides the variation in platform and the obvious variation in types of game there is also a variation in platform. The first increment was a .NET implementation, the second and third were Java-based. The second was for cellphones and PDAs and was based on the J2ME. The third was a more generic version and was based on the J2SE. During production planning AGM decided they wanted to set a goal of finding 90% of the defects no later than unit test, 9% at integration test, and the remaining 1% at system test time. As their approach matures, they would expect to find an increasing portion of requirements defects during the review process prior to any testing. 7.1 Guided Inspection AGM has used a review process for many years but a detailed study revealed that too many design defects were being coded because they escaped detection in design reviews. AGM decided that along with the new product line strategy they would implement the more disciplined review process. Their informal review process was not finding things that were not there. That is, the review process did not define sufficient context to allow for identifying missing pieces. The inspection team was trained to first develop a set of scenarios that capture the essential features of the artifact to be reviewed. In many cases the scenarios could be derived from the use cases being developed as part of the requirements effort. The scenarios are prioritized based on risk and as many as possible are applied. 7.2 Core Asset Development During core asset development, guided inspection was used extensively to evaluate the requirements, architecture and detailed design models. The team found that the guided inspection process was very effective and eliminated a large number of trivial defects as well as several classes of major ones. They had greater confidence that the models used to guide the development of code were accurate and would result in good code. Unit tests. In order to achieve their goal of 90% of defects found, the test plan specified a coverage level termed “every branch” coverage for structural tests and an “every variant” level of coverage for parameters.

Testing a Software Product Line

137

Integration tests. AGM used a number of design patterns in defining subsystems. The integration test team used test patterns that correspond to the design patterns used in the design. These tests exercised variant combinations among the pieces being integrated. A number of defects were found that resulted from a failure to make the architecture sufficiently specific. Sample system tests. AGM used a combinatorial test design to selecte specific test configurations for testing at the core asset level. Not every product had all the features it would have once the product team added the unique portion, but they had the core functionality. This testing gave the core asset team valuable feedback about their assets and generated a To-Do list of enhancements for the second release. 7.3 Product Development The product development team benefited from all the testing done by the core asset team. The product team used guided inspection where the unique features made up a significant portion of the product. The product team found only a few defects in the core assets as they integrated their product-unique parts. Most of the defects were within the unique code. In a few cases there was a mismatch between the core assets and the product-unique piece. These were mostly due to conflicting assumptions that did not show up in testing since the same assumptions were applied there as well.

8 Issues In this section we present more questions than answers but we illustrate some of the issues on the cutting edge of research. 8.1 Core Assets What is a core asset? Traditional product line approaches think of a core asset as discipline related such as a requirements document, but what if the core assets are functionally divided? A core asset is the set of hardware, software, test documentation and results, and other items related to a functional subsystem that will be reused as a unit? In the first case the set of all test suites for a test point in the product line is a core asset. In this scheme, traceability must be maintained between a specific test suite and the software it is used to test. In the second case a set of test suites is grouped with what it tests. In this scheme, traceability must be maintained between a test suite and any other test suites it uses or is derived from. A related issue is what constitutes a configuration item and a configuration. In some product line organizations, the core asset base as a whole is one configuration item. Each release is a release of all the core assets. This usually results in a lengthy wait for the next release and a long wait for fixes to identified problems. A finer grained approach can be used. Some natural breakpoint is chosen, such as the architectural subsystem level or module level. A release is a new version of a subsystem or module.

138

J.D. McGregor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

c r e a t e OUT : GameElements from IN : MOF; −− g e t B a r r i e r B e h a v i o r from o r i g i n a l f i l e i n a Se q u e n c e h e l p e r c o n t e x t MOF! E Pa c k a g e d e f : getBB ( ) : S e q u e n c e (MOF! E C l a s s ) = MOF! E C l a s s . a l l I n s t a n c e s ( ) −>a s S e q u e n c e ( ) −> i t e r a t e ( e ; a c c : Se q u e n c e (MOF! E C l a s s ) = S e q u e n c e {} | i f e . eSuperTypes . t o S t r i n g ( ) = ’ Se q u e n c e { IN ! B a r r i e r B e h a v i o r } ’ t h e n acc . append ( e ) el s e acc endif ); −− g e t R u l e from o r i g i n a l e c o r e f i l e i n a Se q u e n c e h e l p e r c o n t e x t MOF! E Pa c k a g e d e f : g e t R u l e ( ) : S e q u e n c e (MOF! E C l a s s ) = MOF! E C l a s s . a l l I n s t a n c e s ( ) −>a s S e q u e n c e ( ) −> i t e r a t e ( e ; a c c : Se q u e n c e (MOF! E C l a s s ) = S e q u e n c e {} | i f e . e S u p e r T y p e s . t o S t r i n g ( ) = ’ S e q u e n c e { IN ! R u l e } ’ t h e n acc . append ( e ) e l s e acc endif );

Listing 4.4. ATL code snippet

This approach results in a faster cycle time with releases of bug fixes happening more often. The implication for testing is the faster the cycle time, the faster the testing process must react. Testing must be as agile as the development work. How do we do agile testing? 8.2 Software Certification In a related issue, how do government regulators regard reuse of certified assets? What has to be done to ensure that the regulations are met in an acceptable way? Certification of software products by regulatory agencies usually require specific evidence, including test results. The familiar phenomenon of emergent behavior confounds the easy answer of testing each subsystem to a high degree of coverage and then assuming that a product assembled from these subsystems is of equally high quality. This complicates the product line approach of maintaining an inventory of core assets that are assembled to form products. One possible approach is to store a fragment of a verification model with each asset. When the assets are assembled, the fragments of the verification model are assembled. Some form of incremental model checking algorithm might be possible. Another approach is to structure the product line as a completed supersystem from which optional pieces are removed when not needed, but that does work well when we have alternative features. Is it possible to have emergent behavior in the remaining system when pieces are taken away from that system? 8.3 Meta-data “Meta-data” is data about data. This data can be used to reason about the data described by the meta-data. Our research is considering what meta-data should be collected about

Testing a Software Product Line

139

Fig. 14. Toolchain for development/test

tests. Current examples include only keeping the usual data such as the name of the person who constructed the test case and when [250]. We believe that using an approach similar to the Java annotations, in which the meta-data are directly associated with the source code segments, can be used by tools to manage variability and automate processes. Meta-data conforms to a meta-model. The meta-model defines the allowable structure of the meta-data. Our current research is focusing on defining the appropriate metamodel for test data. Current efforts at automation, discussed below, are based on existing meta-models. 8.4 Automation Although we have talked about a number of ways to automate, this is still an issue with much work left to be done. Our research group is currently investigating a tool chain that would provide a more comprehensive approach to automation [178]. Our premise is that the constrained scope of a product line provides a sufficiently narrow context that a domain specific language (DSL) can be profitably used. We take practices from the framework and exploit their relationships with each other. We begin by Understanding Relevant Domains through the development of an ontology. We use the Web Ontology Language (OWL) [333] to capture the concepts in the domain. The ontology is transformed into a DSL by adding constraints, cardinality and other semantic information. The Atlas Transformation Language (ATL) [185] is used to transform the output of one tool into the input suitable for another tool. Meta-models are defined for both the source and target of the transformation. Then functions, or rules as shown in Listing 4.4, map an element from the source to elements in the target. In the second line of the ATL code notice that the OUT is producing a model called GameElements while the IN is

140

J.D. McGregor

taking a model written in MOF. Rules such as the helper rule define what to do with each legal symbol from the IN model. ATL provides extensive facilities for transforming a model written in one modeling language into a different modeling language. The Ecore meta-model is accompanied by a set of tools that apply general templates to generate utility applications to work with the model. We can automatically build editors from the Ecore model that will support writing use cases for the product line or specific products. The use cases follow the stimulus and response pattern. A similar editor can be generated that supports the development of test cases. A mapping is established between the concepts in the DSL and the methods in the implementations of the concepts. This mapping is exploited both for developing the product functionality and the test case semantics. The mapping is supplemented with patterns, both standard design patterns and domain specific patterns, which are captured in templates. Figure 14 shows the comprehensive flow. Yet to be determined is exactly how much of the process can be automated.

9 Conclusion The software product line development strategy provides the opportunity for significant advances in product production. The strategic levels of reuse make it profitable to spend more time on assets. We can (and in some cases must) reach for more thorough levels of test coverage to compensate for the wider range of inputs that a module will experience across all the products. The range of inputs depends on the variation among the products. The scope of the product line provides a constraint on the variability. This constraint allows us to make assumptions that facilitate automation. To achieve some of the aggressive goals of a product line organization, more of the development steps must be automated. The initial cost of this increased automation is amortized over the multiple products. The automation comes from specific models and tools that are used. Using meta-models to capture abstractions that define the product line provides a basis for creating specific product models rapidly. From those models much of the product can be generated. Tests as well as product behavior can be generated, sometimes from the same models. Our current research is taking us deeper into representations and restrictions. We continue to get experience with product line creation and sustainment. In particular we are looking at the relationships between development and testing and how representations can unify the two.

Parameterized Unit Testing with Pex: Tutorial Nikolai Tillmann, Jonathan de Halleux, and Wolfram Schulte Microsoft Research One Microsoft Way, Redmond WA 98052, USA {nikolait,jhalleux,schulte}@microsoft.com

This hands-on tutorial teaches the principles of Parameterized Unit Testing in Visual Studio with Pex, an automatic test input generator. A parameterized unit test (PUT) is simply a method that takes parameters, calls the code under test, and states assertions. Given a PUT written in a .NET language, Pex automatically produces a small test suite with high code and assertion coverage. Moreover, when a generated test fails, Pex can often suggest a bug fix. To do so, Pex performs a systematic program analysis, similar to path bounded model-checking. Pex learns the program behavior by monitoring execution traces, and uses a constraint solver to produce new test cases with different behavior. At Microsoft, this technique proved highly effective in testing even an extremely well-tested component. From a specification, the developer (1) writes parameterized unit tests in C# to reflect the specification, and (2) develops code that implements the specification. The tutorial outlinea key aspects to make this methodology successful in practice, including how to write mock objects, as well as the theoretical foundations on which Pex is built. This document is separated into two main parts. The first part provides detailed walkthrough exercises on unit testing in Section 2, the methodology of Parameterized Unit Testing in Section 3, the usage of the Pex tool in Section 4, and ways to deal with the environment in Section 5. The second part is for the advanced reader. It provides a background on white box testing techniques in Section 6, and discusses in detail various aspects of the Pex tool in Section 7. Section 8 gives an overview of related work. Finally, Section 9 concludes. This tutorial refers to Pex version 0.18. The latest version of Pex can be obtained from http://research.microsoft.com/Pex/.

1 Introduction Unit tests are becoming increasingly popular. A recent survey at Microsoft indicated that 79% of developers use unit tests [325]. Unit tests are written to document customer requirements, to reflect design decisions, to protect against changes, but also, as part of the testing process, to produce a test suite with high code coverage that gives confidence in the correctness of the tested code. The growing adoption of unit testing is due to the popularity of methods like XP (“extreme programming”) [31], test-driven development (TDD) [30], and test execution frameworks like JUnit [186], NUnit [322] or MbUnit [90]. XP does not say how P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 141–202, 2010. c Springer-Verlag Berlin Heidelberg 2010

142

N. Tillmann, J. de Halleux, and W. Schulte

and which unit tests to write. Moreover, test execution frameworks automate only test execution; they do not automate the task of creating unit tests. Writing unit tests by hand can be a laborious undertaking. In many projects at Microsoft there are more lines of code for the unit tests than for the implementation being tested. Are there ways to automate the generation of good unit tests? We think that Parameterized Unit Testing is a possible answer, and this is the topic of the tutorial. We describe how to design, implement and test software using the methodology of Parameterized Unit Testing [314,315], supported by the tool Pex [276,263]. Pex, an automated test input generator, leverages dynamic [141] symbolic execution [194] to test whether the software under test agrees with the specification. As a result, software development becomes more productive and the software quality increases. Pex produces a small test suite with high code coverage from Parameterized Unit Tests. In effect, we combine two kinds of testing introduced in Chapter 2: 1) Testing for functional properties: Just as unit tests, Parameterized Unit Tests usually serve as specifications of functional properties. 2) Structural testing: We analyze such tests with dynamic symbolic execution, a structural testing technique. To facilitate unit testing, mock objects are often used to isolate the test from the environment. We extend this notion to parameterized mock objects, which can be viewed as a model of the environment. Writing such parameterized mock objects, and generating tests with them, is in effect a form of model-based testing (see Chapter 3). The effectiveness of a test suite, whether written by hand, or generated from parameterized unit tests, can be measured with mutation testing (see Chapter 8). Parameterized Unit Tests are algebraic specifications [38] written as code. Another name for this concept is theories [291,293] in the JUnit test framework. They appear as row tests in MbUnit [90], and under other names in various other unit test frameworks. While Pex is a tool that can generate test inputs for .NET code, many other research and industrial tools [141,297,67,66,296,142] exist that can generate test inputs in a similar way for C code, Java code and x86 code. We introduce the concepts and illustrate the techniques with some examples. We assume deterministic, single-threaded applications. An earlier and shorter version of this tutorial on Parameterized Unit Testing with Pex can be found in [91]. More documentation can be found on the Pex website [276].

2 Unit Testing Today A unit test is a self-contained program that checks an aspect of the implementation under test. A unit is the smallest testable part of the program. One can partition each unit test into three parts: (1) exemplary data, (2) a method sequence and (3) assertions. – exemplary data can be considered as the test input that is passed to the methods as argument values, – method sequence based on the data, the developer builds a scenario that usually involves several method calls to the code-under-test, – assertions encode the test oracle of a unit test. The test fails if any assertion fails or an exception is thrown but not caught. Many unit test frameworks have special support for expected exceptions, which can often be annotated with custom attributes.

Parameterized Unit Testing with Pex: Tutorial

143

The program in Listing 5.1 is an example of a unit test that checks the interplay among some operations of .NET’s ArrayList class. The example is written in C#, omitting the class context, as we will do often for brevity.

1 2 3 4 5 6 7 8 9 10

p u b l i c v o i d AddTest ( ) { / / exemplary data int capacity = 1; object element = null ; / / method s e q u e n c e A r r a y L i s t l i s t = new A r r a y L i s t ( c a p a c i t y ) ; l i s t . Add ( e l e m e n t ) ; // assertions A s s e r t . I s T r u e ( l i s t [ 0 ] == e l e m e n t ) ; } Listing 5.1. A unit test for the ArrayList class

The AddTest method first sets up its state, by picking the values 1 and null for capacity and value, respectively. Then, the test method performs a sequence of method calls, starting by creating an instance of the ArrayList class with the selected capacity. An array list is a container whose size may change dynamically. Internally, it uses a fixed-length array as backing storage. The array list’s capacity is the allocated length of its current backing storage. Next, the test adds the element to the array list. Finally, the test checks an assertion, making sure that the array list at position 0 contains element. This is the test oracle. There is often more than one way to partition a unit test into the three different parts (exemplary data, method sequence, assertions). For example, Listing 5.2 is a very similar test, in which the input data consists of more complex objects, including an object for the element, and the initial instance of the array list itself. In the unit testing frameworks NUnit [259,322] and Visual Studio Unit Test in Visual Studio 2008 Professional [249], a parameterless method such as AddTest is decorated with a custom attribute like [TestMethod] to designate it as a unit test. The class in which unit tests are defined is decorated with an attribute like [TestClass]. Usually, each unit test explores a particular aspect of the behavior of the class-under-test. 2.1 Benefits of Unit Testing Software developers (and sometimes testers) write unit tests for different purposes. – Design and specification: developers translate their understanding of the specification into unit tests and/or code. Developers following the test-driven development write the unit tests before code, and therefore use the unit tests to drive the design. Nonetheless, unit tests may be written in all phases of the software development process. Also, developers capture exemplary customer scenarios as unit tests.

144

N. Tillmann, J. de Halleux, and W. Schulte

1 2 3 4 5 6 7 8 9

p u b l i c v o i d AddTest2 ( ) { / / exemplary data o b j e c t e l e m e n t = new o b j e c t ( ) ; A r r a y L i s t l i s t = new A r r a y L i s t ( 1 ) ; / / method s e q u e n c e l i s t . Add ( e l e m e n t ) ; // assertions A s s e r t . I s T r u e ( l i s t [ 0 ] == e l e m e n t ) ; } Listing 5.2. A typical unit test

– Code coverage and regression testing: developers or testers may write unit tests to increase their confidence in the correctness of code that they have already written. It is well known that a test suite which achieves high code coverage and checks many assertions is a good indicator of code quality. In this way, unit tests represent a safety net that developers can use when refactoring the code. Unit tests are usually small tests that run fast and give a quick feedback on effects of code changes. Additionally, several tools exist to automatically execute a suite of unit tests on each code change that is committed to the source code repository. – Short feedback loop: as mentioned above, unit tests are usually written by the developers themselves before or after writing the product code. When a unit test fails and exposes a bug, the feedback loop to get the bug fixed is very short. – Documentation: the resulting unit tests are commonly considered as documentation of correct program behavior. Therefore, writing unit tests is increasingly becoming an essential part of software development processes. 2.2 A Critique of Unit Testing Unit testing faces several challenges: – Quality of unit tests: the quality of the unit tests is mostly dependent on the time the developer is willing to invest in them. – Amount of unit tests: writing more unit tests does not necessarily increase the code coverage. Therefore, the size of the test suite is not an indicator of the code quality. – New code with old tests: while the unit tests are actively edited when the developer is implementing the code, such tests are usually not updated later on (besides syntactic corrections when APIs are refactored). Therefore, if the developer changes the code implementation, for example by adding more special cases in the code, but does not update the unit tests, he might introduce a number of new untested behaviors. – Hidden integration test: ideally, a unit test should test the code in isolation. This means that all environment dependencies, e.g database, file I/O, must be hidden behind an abstraction layer. During testing, the abstraction layer provides a fake implementation, also referred as mocks. In practice, it is very easy to leak such dependencies.

Parameterized Unit Testing with Pex: Tutorial

145

Despite these potential difficulties, continuous research efforts are addressing such issues. 2.3 Measurement of Test Quality: Code Coverage and Assertions What is a good test suite? When do you have enough unit tests to ensure a minimum level of quality? Those are hard questions that developers face. Our experience within Microsoft and from the industry indicates that a test suite with high code coverage and high assertion density is a good indicator for code quality. Code coverage alone is generally not enough to ensure a good quality of unit tests and should be used with care. The lack of code coverage to the contrary clearly indicates a risk, as many behaviors are untested. A statement is covered when at least one unit test executes this statement. The code coverage is then usually computed by executing the entire unit test suite and computing the ratio of covered statements. Different notions of code coverage exist [32], including the following: – Basic Block Coverage: This coverage is based on basic block representation of the program’s control flow graph. A "basic block" is a sequence of instructions, here the MSIL instructions of .NET, that has one entry point, one exit point, and no branches within the block. It is commonly used in the industry. – Branch Coverage: This coverage is computed by analyzing the coverage of explicit arcs. An arc is a control transfer from one basic block to another in the program control flow graph. – Implicit Branch Coverage: This is an extension of the arc coverage where all explicit and implicit arcs are considered. Implicit arcs occur for exceptional behavior of instructions, for example when accessing a field, the arc that throws a null dereference exception is implicit. These notions are the ones that Pex employs, namely those that rely on the control flow coverage. 2.4 Unit Testing in .Net This section gives a quick introduction to unit testing in .NET. If you are familiar with the techniques and tools, you might as well skip it. Unit test frameworks. Several frameworks exist in .NET to help developers effectively author and run unit tests. Although each framework has its own particularities, they all provide a core set of services: – a custom attribute based system for tagging methods as unit tests, – automatic detection and execution of such unit tests, – a runner with reporting capabilities. The runner might be a simple console application or an integrated GUI. In this document, we will use the Visual Studio Unit Test test framework that comes with Visual Studio.

146

N. Tillmann, J. de Halleux, and W. Schulte

2.5 Exercises Exercise 1. Getting started with Visual Studio 2008 Professional (or better) In this exercise, we go through the steps to create a new test project, author unit tests and run them. This section is targeted to users who are new to Visual Studio Unit Test projects in Visual Studio. Part 1: Creating a New Test Project 1. Go to File|New|Project....

2. On the left pane, select Visual C#|Test, then select the Test Project item. Select a location for the new project and click Ok.

The content of the window shown above might vary depending on your Visual Studio installation. 3. Delete the sample files that were generated by the project wizard (AuthoringTests.txt, ManualTest1.mht, UnitTest1.cs). (You can right-click a file name, and select Delete in the context menu.)

Parameterized Unit Testing with Pex: Tutorial

147

Part 2: Creating a Passing Unit Test 1. Right-click on the project node and select Add|New Test.

2. In the main pane, select the Unit Test item, update the test name to HelloWorldTest.cs and hit Ok.

3. Open HelloWorldTest.cs and clean the generated code to have an empty class definition. This is a test class, tagged with a [TestClass] attribute. Such a test class is often called a test fixture. using using using using

System ; System . T e x t ; System . C o l l e c t i o n s . G e n e r i c ; Microsoft . VisualStudio . TestTools . UnitTesting ;

namespace T e s t P r o j e c t 1 { [ TestClass ] / / t hi s class contains unit t e s t s p u b l i c c l a s s H e l l o W o r l d T e s t {} }

4. We start by adding a test that will pass. Add a new public instance method to the HelloWorldTest class, tagged with the TestMethodAttribute, that writes "Hello World" to the console:

148

N. Tillmann, J. de Halleux, and W. Schulte [ TestMethod ] / / t h i s i s a t e s t p u b l i c void P assi ngT est ( ) { Console . W r i t eL i ne ( " h e l l o world " ) ; }

5. In the Test View window (Test|Windows|Test View), select the PassingTest test, then click on Run Selected icon (upper left).

6. The test result window displays the status of the current run. Each selected test is represented as a row in the report. In this case, PassingTest succeeded and you can review the details of that particular test by double-clicking on the row.

7. The test details view gives various metrics about the test run (including its duration) as well as the console output.

Part 3: Creating a Failing Unit Test 1. In the following, we add a test that fails. Add a new public instance method to the HelloWorldTest class that throws an exception. [ TestMethod ] p u b l i c void F a i l i n g T e s t ( ) { t h r o w new I n v a l i d O p e r a t i o n E x c e p t i o n ( " boom " ) ; }

Open the Test View and execute both tests,

Parameterized Unit Testing with Pex: Tutorial

149

2. The test result window now contains two test results; one for PassingTest and one for FailingTest.

3. Go to the FailingTest method and hit F9 to set a debugger breakpoint.

4. To start debugging the failing tests, go to the test result view and click on the Debug original tests menu item

5. The debugger will automatically stop on the breakpoint set previously. If you are not familiar with the Visual Studio debugger, this is a good time to get some experience. The yellow line shows the statement that will be executed next.

Part 4: Creating a Negative Unit Test 1. Visual Studio Unit Test supports a special attribute, ExpectedExceptionAttribute , that specifies that the test must throw an exception of a particular type. You can use it to write unit tests that check that parameter validation code works properly. [ TestMethod ] [ ExpectedException ( typeof ( InvalidOperationException ) ) ] p u b l i c void E xpect edE xcept i onTest ( ) { t h r o w new I n v a l i d O p e r a t i o n E x c e p t i o n ( " boom " ) ; }

After running the three tests, you can see that the ExpectedExceptionTest test was marked as a passing test since it threw the expected exception,

150

N. Tillmann, J. de Halleux, and W. Schulte

Part 5: Enabling Code Coverage (This exercise requires Visual Studio 2008 Test Edition, or better. Visual Studio 2008 Professional is not sufficient.) 1. Visual Studio Unit Test in Visual Studio comes with a built-in code coverage support. To enable this feature, go to Test|Edit Test Run Configurations|Local Test Run.

2. On the left pane, select Code Coverage then select the TestProject1.dll assembly to be instrumented and click Close. Execute PassingTest and FailingTest, leaving aside ExpectedExceptionTest.

Parameterized Unit Testing with Pex: Tutorial

151

3. In the Test Results window, click on the code coverage icon (last icon on the right).

4. In the Code Coverage Results window, enable source code coloring.

5. Covered code is colored in light blue, while uncovered code is colored in red. In this example, we did not run the ExpectedExceptionTest. That is why this method is colored in red.

To summarize, in this exercise, we learned – how to create a new Visual Studio Unit Test test project in Visual Studio, – how to author, execute and debug unit tests, – how to enable code coverage and analyze the results. Exercise 2. Unit testing the Luhn Algorithm using Test Driven Development In this exercise, we will implement the Luhn validation algorithm using a test driven development (TDD) approach [30]. The TDD cycle consists of the following short steps: 1. 2. 3. 4. 5.

Add a test, Run it and watch it fail, Change the code as little as possible such that the test should pass, Run the test again and see it succeed, Refactor the code if needed.

We apply the TDD steps in the following.

152

N. Tillmann, J. de Halleux, and W. Schulte

Part 1: Credit Card Number Validation Specification Most credit card companies use a “check digit” encoding scheme [344]. A check digit is added to the original credit card number, at the beginning or the end, and is used to validate the authenticity of the number. The most popular encoding algorithm is the Luhn algorithm [345] which can be computed by the following steps: 1. Double the value of alternate digits of the primary account number beginning with the second digit from the right (the first right–hand digit is the check digit.) 2. Add the individual digits comprising the products obtained in Step 1 to each of the unaffected digits in the original number. 3. The total obtained in Step 2 must be a number ending in zero (30, 40, 50, and so on) for the account number to be validated. Now that we have a specification for the algorithm, we can start working on the implementation. Part 2: Add a Failing Test 1. Right-click on the solution node and select Add|New Project.

2. On the left pane, select Visual C#|Windows then select the Class Library item. Change the project name to Creditar.

3. In the test project, right-click on the References node and select Add References.

Parameterized Unit Testing with Pex: Tutorial

153

4. Select the Projects tab and double-click the Creditar project row to add it as a reference.

5. We start by writing a first unit test for the Validate method, before writing or declaring the Validate method itself. Add a first unit test that verifies that the Validate method throws ArgumentNullException when it receives a null reference. [ TestMethod ] [ ExpectedException ( t ypeof ( ArgumentNullException ) ) ] p u b l i c void NullNumberThrowsArgumentNullException ( ) { LuhnAlgorithm . V a l i d a t e ( n u l l ) ; }

6. Right-click on the Creditar project node and select Add|Class.

7. Add a minimal implementation of the Validate method such that the projects will compile. p u b l i c s t a t i c c l a s s LuhnAlgorithm { p u b l i c s t a t i c b o o l V a l i d a t e ( s t r i n g number ) { return false ; } }

8. Execute the unit test and make sure that it fails.

154

N. Tillmann, J. de Halleux, and W. Schulte

Part 3: Run the Unit Test and Watch It Pass 1. Make a minimal change to the Validate implementation such that the unit test will pass. p u b l i c s t a t i c c l a s s LuhnAlgorithm { p u b l i c s t a t i c b o o l V a l i d a t e ( s t r i n g number ) { i f ( number == n u l l ) t h r o w new A r g u m e n t N u l l E x c e p t i o n ( " number " ) ; return false ; } }

2. Repeat the steps above to ensure that when a non-digit character is passed to the Validate method, the implementation throws a ArgumentException, [ TestMethod ] [ ExpectedException ( t ypeof ( ArgumentException ) ) ] p u b l i c v o i d A T h r o w s A r g u m e n t Ex ce p tio n ( ) { LuhnAlgorithm . V a l i d a t e ( " a " ) ; }

Interestingly, the minimum change to get this test to pass is not really what one would call a “correct” implementation, or even a useful implementation. p u b l i c s t a t i c c l a s s LuhnAlgorithm { p u b l i c s t a t i c b o o l V a l i d a t e ( s t r i n g number ) { i f ( number == n u l l ) t h r o w new A r g u m e n t N u l l E x c e p t i o n ( " number " ) ; i f ( number == " a " ) t h r o w new A r g u m e n t E x c e p t i o n ( " number " ) ; return false ; } }

To write such an implementation really means to follow the incremental idea of the test-driven development methodology. We could continue writing more unit tests that will fail, in order to refine our implementation. Part 4: Continue the iteration 1. Now that we have a passing test suite, we can refactor the code into a smarter implementation that checks for any non-digit character p u b l i c s t a t i c c l a s s LuhnAlgorithm { p u b l i c s t a t i c b o o l V a l i d a t e ( s t r i n g number ) { i f ( number == n u l l ) t h r o w new A r g u m e n t N u l l E x c e p t i o n ( " number " ) ; f o r e a c h ( v a r c i n number ) i f ( ! Char . I s D i g i t ( c ) ) t h r o w new A r g u m e n t E x c e p t i o n ( " number " ) ; return false ; } }

Parameterized Unit Testing with Pex: Tutorial

155

The rest of the unit testing of the Validate method is left as an exercise. Do not forget to use the code coverage view to ensure that the unit test reach a minimum level of basic block code coverage. This usually means that executing the unit test suite yields to a certain percentage of basic block coverage. In the case of this exercise, 80% is a reasonable goal.

Tip: Where to get valid credit card numbers? To help you validate your implementation, you can use this number generator [345]: p u b l i c s t a t i c i n t [ ] CreateNumber ( i n t l e n g t h ) { Random random = new Random ( ) ; i n t [ ] d i g i t s = new i n t [ l e n g t h ] ; / / S e t a l l b u t t h e l a s t d i g i t t o a random number ; / / the l a s t d i g i t remains zero f o r ( i n t i = 0 ; i < l e n g t h − 1 ; i ++) { d i g i t s [ i ] = random . Next ( 1 0 ) ; } i n t sum = 0 ; bool a l t = t r u e ; f o r ( i n t i = l e n g t h − 2 ; i >= 0 ; i −−) { if ( alt ) { i n t temp = d i g i t s [ i ] ; temp ∗= 2 ; i f ( temp > 9 ) { temp −= 9 ; } sum += temp ; } else { sum += d i g i t s [ i ] ; } alt = ! alt ; } i n t modulo = sum i f ( modulo > 0 ) { d i g i t s [ l e n g t h − 1 ] = 10 − modulo ; } return digits ; }

3 Parameterized Unit Testing The unit test in Listing 5.2 specifies the behavior of the array list by example. Strictly speaking, this unit test only says that by adding a new object to an empty array list, this object becomes the first element of the list. What about other array lists and other objects?

156

N. Tillmann, J. de Halleux, and W. Schulte 1 2 3 4 5 6 7 8 9 10 11

p u b l i c v o i d AddSpec2 ( / / parameters ArrayList l i s t , object element ) { / / assumptions PexAssume . I s T r u e ( l i s t ! = n u l l ) ; / / method s e q u e n c e i n t l e n = l i s t . Count ; l i s t . Add ( e l e m e n t ) ; // assertions P e x A s s e r t . I s T r u e ( l i s t [ l e n ] == e l e m e n t ) ; } Listing 5.3. A parameterized unit test for the ArrayList class

Traditional unit tests do not take inputs. A straightforward extension is to allow parameters. The result is a Parameterized Unit Test (PUT), which one can partition into four parts: (1) parameters, (2) assumptions, (3) a method sequence and (4) assertions. – parameters represent the test input which is later passed on to other methods as their argument values, – assumptions over the parameters can be used to shape legal test inputs, – method sequence specify a scenario (as before), and – assertions encode the test oracle of a unit test (as before). Listing 5.3 is a parameterized version of the array list unit test that describes the normal behavior of the Add method with respect to two observers, the property Count and the indexing operator []. Under the condition that a given array list is not null, this PUT asserts that after adding an element to the list, the element is indeed present at the end of the list: This test is more general than the original test. PUTs like this one can be called with various input values, perhaps drawn from an attached database. Unit testing frameworks that support PUTs sometimes refer to them as data-driven tests (for example in [2]). Instead of stating the exemplary data values explicitly, a PUT may state assumptions about how valid input data must look like. Here, we assume that the list is not null. PUTs are more general specifications than traditional unit tests: PUTs state the intended program behavior for entire classes of program inputs, and not just for one exemplary input. And yet PUTs are still easy to write since they merely state what the system is supposed to do, and not how to accomplish the goal. Unlike many other forms of specification documents, PUTs are written on the level of the actual software APIs, in the programming language of the software project. This allows PUTs to evolve naturally with the code against which they are written. 3.1 Separation of Concerns Splitting the specification and test cases by parameterized unit testing is a separation of concerns:

Parameterized Unit Testing with Pex: Tutorial 1 2 3 4 5 6 7 8 9 10 11 12 13

157

public class ArrayList ... { private Object [ ] _items = null ; private int _size , _version ; ... p u b l i c v i r t u a l i n t Add ( O b j e c t v a l u e ) { i f ( _ s i z e == _ i t e m s . L e n g t h ) E n s u r e C a p a c i t y ( _ s i z e + 1 ) ; _items [ _size ] = value ; _ v e r s i o n ++; r e t u r n _ s i z e ++; } } Listing 5.4. ArrayList implementation in .NET

– Firstly, we specify the intended external behavior of the software as PUTs; only human beings can perform this specification task. – Secondly, a tool like Pex can automatically create a test suite with high code coverage by determining test inputs which exercise different execution paths of the implementation. 3.2 Coverage through Test Input Generation Adding parameters to a unit test improves its expressiveness as a specification of intended behavior, but we lose concrete test cases. We can no longer execute a parameterized test by itself. We need actual parameters. But which values must be provided to ensure sufficient and comprehensive testing? Which values can be chosen at all? Consider the code of Listing 5.4 that implements Add and the indexing operator in the .NET base class library. There are two cases of interest. One occurs when adding an element to an array list that already has enough room for the new element (when the array list’s capacity is greater than the current number of elements in the array list). The other occurs when the internal capacity of the array list must be increased before adding the element. We can assume that the library methods invoked by the ArrayList implementation are themselves correctly implemented (EnsureCapacity guarantees that the items array is resized so its length is greater or equal size + 1), and we do not consider possible integer overflows. Then we only need to run two test cases to check that the assertion embedded in AddSpec2 holds for all array lists and all objects given the existing .NET implementation. Two test cases are needed as there are only two execution paths through the Add method shown above; accordingly, all inputs can be partitioned into two equivalence classes: one where size == items.Length holds, and one where it does not hold. Each of the two test cases below is a representative of one of the two equivalence classes.

158

N. Tillmann, J. de Halleux, and W. Schulte [ TestMethod ] p u b l i c v o i d TestAddNoOverflow ( ) { AddSpec2 ( new A r r a y L i s t ( 1 ) , new o b j e c t ( ) ) ; } [ TestMethod ] p u b l i c void TestAddWithOverflow ( ) { AddSpec2 ( new A r r a y L i s t ( 0 ) , new o b j e c t ( ) ) ; }

No other inputs are needed to test all behaviors of Add, since any other input will execute exactly the same paths as the two inputs mentioned above. 3.3 Theory of Parameterized Unit Tests By adding parameters we turn a closed unit test into a universally quantified conditional axiom that must hold for all inputs under specified assumptions. Intuitively, the AddSpec2(. . .) method asserts that for all array lists a and all objects o, the following holds: ∀ ArrayList a, object o. (a = null) → let i = a.Count in a.Add(o ) ˚, a [i ] == o where “˚,” represents sequential composition from left to right: (f ˚, g)(x ) = g(f (x ))1 . See [314] for more background information on PUTs, and [38] for an overview of the theory and practice of algebraic specifications. 3.4 Patterns for Parameterized Unit Testing As PUTs are really just a way to write algebraic specifications as code, many standard patterns for algebraic specifications can be applied in the context of parameterized unit testing. We have collected a set of such patterns in the context of Pex, which can be found in the documentation section of the Pex website [276]. 3.5 Test Driven Development by Parameterized Unit Testing Test Driven Development [30] (TDD) is the activity of programming where all written code is preceeded by writing tests which specify the intended functionality of the code. In TDD, the main purpose of writing tests is to drive the design of the API of the code. It is simply a side effect that the result is a test suite that can serve as documentation and specification of the intended behavior of the code. It is straightforward to extend TDD to PUTs. In fact, it is usually more expressive to state the intended properties of an API with PUTs instead of closed unit tests with exemplary data. 1

The axiom becomes more complicated when we specify side effects of sequential code precisely. We explain later how to model references to objects on the heap, see Section 7.3.

Parameterized Unit Testing with Pex: Tutorial

159

When writing PUTs, the TDD process is as follows: 1. The developer writes or changes a parameterized unit test (PUT), or code that implements the behavior described by already existing PUTs. 2. The developer runs Pex on the PUT. 3. If Pex finds errors, the developer goes back to step 1, in order to change the PUT or fix the code, possibly with Pex’ “Fit-It” feature. 4. The developer keeps the generated tests for future regression testing.

4 Selecting Test Inputs for Parameterized Unit Tests Classic unit tests are methods without parameters, parameterized unit tests are methods with parameters. In fact, parameterized unit tests have been around for a while now. They were already supported by Visual Studio Unit Test in Visual Studio (referred to as data-driven tests), and by MbUnit (referred to as row tests) starting from version one, and they were recently added to JUnit [186] (referred to as theories). Yet, it used to be the case that the user had to provide the input parameters for those tests, as ranges, spreadsheet or database of some sort. Improper choice of inputs would lead to missed corner cases or a hugely redundant test suite. With Pex, things change for the better: the user does not have to provide any input to the parameterized unit tests. By analyzing the program behavior at runtime, Pex can generate inputs that matter, in the sense that those inputs will increase the coverage of the test suite (advanced readers can refer to Section 7 for further details). 4.1 Automated Exploratory Testing Selecting meaningful test inputs requires a certain understanding of the code under test, which in turn requires an understanding of what the relevant parts of the (potentially huge) code under test are. Exploratory Testing [189,17] (ET) is an incremental process during which the tester learns more and more about the actual behavior of the code. Another characterization of ET is that it is test design and test execution at the same time. Together with experience and creativity the tester can craft more and better tests. Pex uses dynamic symbolic execution [141], a technique that works in a way similar to ET. The tool executes the code multiple times and learns about the program behavior by monitoring the control and data flow. After each run, it picks a branch that was not covered previously, builds a constraint system (a predicate over the test inputs) to reach that branch, then uses a constraint solver to determine new test inputs, if any. The test is executed again with the new inputs, and this process repeats. On each run, Pex might discover new code and dig deeper into the implementation. In this way, Pex explores the behavior of the code. We also refer to this process as Automated Exploratory Testing. For the advanced reader, Section 7 discusses dynamic symbolic execution. Exploratory testing example. We will apply exploratory testing to test a simple method (see Listing 5.5) that takes two integers as an input and prints different strings to the console based on those values. We will manually go through all the steps involved in the analysis. (Pex would perform a similar analysis, only fully automatically.)

160

N. Tillmann, J. de Halleux, and W. Schulte 1 2 3 4 5 6 7 8 9

v o i d Bar ( i n t i , i n t j ) { i f ( i < 0) { Console . WriteLine ( " l i n e 3" ) ; i f ( j == 1 2 3 ) Console . WriteLine ( " l i n e 5" ) ; } else Console . WriteLine ( " l i n e 8" ) ; } Listing 5.5. A method to explore

One way to explore the possible behaviors of this method is to throw different values at Bar and analyze the output to see what is happening. Iteration 1: pick arbitrary value. Let us create a unit test that does exactly that and step into the debugger. Since we do not really know anything about the Bar method yet, we simply pick 0 for i and j . [ TestMethod ] void Zero ( ) { Foo . Bar ( 0 , 0 ) ; }

When we reach the statement Console.WriteLine("line 8"); on line 8, we can figure out that we took this branch because the condition i < 0 on line 2 evaluated to false. With this we let the execution continue (and the test finishes successfully). – line 2, i 0, uncovered branch Iteration 2: flip the last condition In the previous run, we have remembered that some code was not covered on line 3. We also know that this code path was not covered because the condition i < 0 evaluated to false. At this point, we usually intuitively figure out a value of i in our head, to make this condition true. In this case, we need to solve find i such that i < 0. Let us pick −1. [ TestMethod ] v o i d MinusOne ( ) { Foo . Bar ( −1 , 0 ) ; }

We run the test under the debugger. As expected on line 2, the condition evaluates to true, and the program takes the other branch (compared to the one taken in the previous test). The program continues and reaches line 4 where another if-statement is evaluated. The condition j = 123 evaluates to false, so we remember the following: – line 4, j = 123, uncovered branch The program continues to run and finishes.

Parameterized Unit Testing with Pex: Tutorial

161

Iteration 3: path condition + flipped condition. There are still some uncovered branches to cover in the method, guarded by the condition at line 4. To be able to cover this code, we need two things: 1. reach line 4: i < 0 2. make the condition in line 4 evaluate to true: j = 123 So to cover the last statement in the method, we need to find parameter values such that i < 0 ∧ j = 123. Let us pick i = −1 and j = 123. [ TestMethod ] v o i d MinusOneAndOneTwoThree ( ) { Foo . Bar ( −1 , 1 2 3 ) ; }

The test executes and prints line 5 as we wanted. At this point, we have fully covered the behavior of Bar. 4.2 Exercises Exercise 3. Getting started with Pex in Visual Studio Part 1: Adding Pex to a Project 1. Add a reference to the Microsoft.Pex.Framework.dll assembly to the test project. In the Add Reference dialog, select the .NET pane, then scroll down to Microsoft.Pex.Framework,

Part 2: Creating a Parameterized Unit Test 1. In the HelloWorldTest, add a new public instance method ParameterizedTest that takes an int parameter. Mark this method with the PexMethodAttribute, written as [PexMethod] in C#. [ PexMethod ] p u b l i c void P aramet eri zedT e st ( i n t i ) { i f ( i == 1 2 3 ) t h r o w new A r g u m e n t E x c e p t i o n ( " i " ) ; }

162

N. Tillmann, J. de Halleux, and W. Schulte

Tip: Adding Using Clauses Automatically Visual Studio can automatically add the missing using clause: Move the editing caret on the [PexMethod] attribute. You will see a little red rectangle, located at the bottom right of PexMethod. Click on the little red rectangle. You get a context menu with two entries: The first will add a using clause to your source code, and the second will embed the fully qualified namespace to this occurrence of PexMethod. Just press Enter to insert the using clause. The keyboard shortcut to open this context menu is Ctrl + ..

Tip: Using Snippets Pex also provides snippets to create a new parameterized unit test skeleton. In the editor, write pexm and then press the tab-key. This will expand the snippet.

Part 3: Run the Parameterized Unit Test 1. Move the mouse cursor inside the ParameterizedTest method, right-click and select the Run Pex Exploration menu item.

2. Pex automatically displays the Pex Results window. Most of your interactions with Pex will be through this window.

3. Each row in the table corresponds to a generated test for the current exploration. Each row contains – an icon describing the status of the test (passing, failing), – a number indicating how often Pex had to execute the parameterized unit test with different input values in order to arrive at this test, – a summary of the exception that occurred, if any

Parameterized Unit Testing with Pex: Tutorial

163

Pex also automatically logs the values of the input parameters of the test. Note that often Pex runs the parameterized unit test several times until Pex outputs the next test. The rationale behind this behavior is that Pex explores different execution paths of the program, but it only outputs a new test when this test increases the coverage (arc coverage, to be precise; see Section 7.9 for more details). Many execution paths may result in the same coverage. 4. When exploring the ParameterizedTest that we wrote earlier, Pex generates two unit tests. Each unit test can be accessed by selecting the corresponding row and clicking on the Go to generated test link. [ TestMethod ] [ PexGeneratedBy ( t y p e o f ( T e s t C l a s s ) ) ] p u b l i c void P aramet eri zedT est 01 ( ) { t h i s . ParameterizedTest ( 0) ; } [ TestMethod ] [ PexGeneratedBy ( t y p e o f ( T e s t C l a s s ) ) ] [ PexRaisedException ( t ypeof ( ArgumentException ) ) ] p u b l i c void P aramet eri zedT est 02 ( ) { t h i s . ParameterizedTest (123); }

The TestMethod attribute indicates that the generated methods are unit tests, the PexGeneratedBy attribute indicates that the test was generated by Pex by exploring parameterized unit tests in a particular test class, and the PexRaisedException indicates that this test raised an (unexpected) exception. Exercise 4. Using Pex from the Command Line. In Section 4.2 we learned how to use Pex in the Visual Studio environment. In this exercise, we will learn how to drive Pex from the command line and how to read the HTML reports generated by Pex. Before running Pex, you should have a .NET assembly (a .NET executable, ending in .dll or .exe) that contains a class annotated with the PexClassAttribute , containing a public instance method annotated with the PexMethodAttribute. We will use the Listing 5.6, which corresponds to the example that we created with Visual Studio in Exercise 2.5. You can build it with the standalone C#-compiler csc.exe, Visual C# 2005/2008 Express Edition, or any other C# development environment. In any case, you need to reference the Microsoft.Pex.Framework assembly. Tip: Other test frameworks Pex works best with the unit test framework of Visual Studio, but Pex can also generate tests for other unit test frameworks. Pex detects the intended unit test framework by inspecting the referenced assemblies in your test project.

We run Pex from the command line as follows. 1. Open a command prompt window. 2. Go to the build folder of the test project,

164

N. Tillmann, J. de Halleux, and W. Schulte u s i n g System ; u s i n g M i c r o s o f t . Pex . Framework ;

1 2 3 4 5 6 7 8 9 10 11

[ PexClass ] public p a r t i a l class TestClass { [ PexMethod ] p u b l i c void P a r a m e t e r i z e d T e s t ( i n t i ) { i f ( i == 1 2 3 ) throw new A r g u m e n t E x c e p t i o n ( " i " ) ; } } Listing 5.6. Self-contained parameterized unit test code

Tip: Getting the path of a document in Visual Studio One can get the full path of an open document in Visual Studio by right-clicking on the tab and select Copy Full Path.

3. Apply the Pex command line program, pex.exe, on the test project assembly. This will run all the parameterized unit tests in that assembly (you may need to change name of the .dll file as appropriate): pex bin\Debug\TestProject1.dll

Pex runs and logs its activity to the console. At the end of the run, it automatically opens a detailed HTML report. Part 1: Command Line Output The console output contains different aspects. The precise output may vary, depending on your version of Pex and .NET. 1. Pex inspects the test project and launches a second process. (In this process, Pex acts as a profiler in order to instrument code so that Pex can monitor a taken execution path precisely.) 1 2 3

Microsoft Pex v0.18 -- http://research.microsoft.com/pex Copyright (c) Microsoft Corporation 2007-2009. All rights reserved.

4 5

instrumenting...

2. At this point, the instrumented process is running. Pex is loading the parameterized tests. Any issue in the test metadata will be logged during this phase.

Parameterized Unit Testing with Pex: Tutorial 6 7 8 9 10

165

launched Pex x86 Edition on .NET v2.0.50727 [reports] report path: reports\TestProject1.71115.170912.pex 00:00:00.0> starting execution 00:00:00.1> reflecting tests

3. In the next phase, Pex explores all parameterized unit tests. 11 12 13

00:00:02.8> TestProject1 00:00:02.8> TestProject1.TestClass 00:00:12.3> TestClass.ParameterizedTest(Int32)

4. At each new test, Pex writes a line that describes the test: the test ParameterizedTest02 was generated on the second run and raised a ArgumentException. 14 15 16 17 18

[test] (run 1) ParameterizedTest00 (new) [coverage] coverage increased from 0 to 2 blocks [test] (run 2) ParameterizedTest01 (new), Exception: Exception of type ’System.Exception’ was thrown. [coverage] coverage increased from 2 to 4 blocks

Tip: How does Pex name generated tests? The generated test names are created by mangling the exploration method name, plus a number.

5. At the end of the exploration, Pex also gives a basic block coverage summary. 19

[dynamic coverage] 4/4 block (100.00 percent)

6. Once all parameterized unit tests have been processed, Pex displays some statistics, then renders the HTML report. In this run, seven tests were generated, one of which was considered as a failing test: 20 21

00:00:13.2> [finished] -- 2 generated tests, 1 failing, 2 new

No errors or warnings were logged by the Pex infrastructure: 22

-- 0 critical errors, 0 errors, 0 warnings

Every time you run Pex, it generates a new folder in the reports directory to hold all the generated files (and by default, the folders get recycled after a while; in other words, the generated reports have only a limited lifetime, and you should copy a report that you want to preserve). The generated folder name is constructed from the test assembly name and a time stamp. Tip: Where are the generated tests? Pex creates a subdirectory, here it would be reports\TestProject1.71115.170912.pex\tests, which contains the generated tests as code.

[reports] generating reports... [reports] html report: 25 reports\TestProject1.71115.170912.pex\pex.html 7. Finally, Pex displays the overall verdict, success or failure. 26 EXPLORATION SUCCESS 23 24

166

N. Tillmann, J. de Halleux, and W. Schulte

Part 2: Command Line Filters The command line supports a rich set of filters to select which exploration to run: by namespace, by type name, by method name, by category, and so on. – by namespace, pex /nf:TestProject ...

– by type name, pex /tf:HelloWorld ...

– by method name, pex /mf:Add ...

– by test suite, pex /sf:checkin ...

– the explorations that had errors in the last runs, pex /lef ...

– of course, you can combine all those filters together, pex /nf:TestProject /tf:HelloWorld /mf:Add ...

By default, a partial case-insensitive string match is used. To have a precise match, add a bang (!) at the end of the filter: pex /tf:HelloWorldTest!

Part 3: Using Pex HTML report The HTML reports gives a tabular view of each assembly/fixture/exploration. Overall statistics, including the number of fixtures and generated tests, are highlighted in the title,

1. Each exploration has a little menu to display associated data.

Parameterized Unit Testing with Pex: Tutorial

167

Tip: Do I have access to the HTML reports in Visual Studio? To enable HTML report generation, go to Tools|Options then locate the Pex - General category, click on the Reports option.

You need to run Pex again to actually generate a report. Afterwards, to open the HTML report, click on the Open Report link in the yellow bar.

168

N. Tillmann, J. de Halleux, and W. Schulte

2. The parameter values displays a table where each row contains the values of the parameters for a particular test. This table is very similar to the Pex Results view in Visual Studio.

3. The log item opens a detailed activity log of the exploration process. The log is a very important tool to understand in detail which events Pex encountered, and you will get familiar with it soon.

4. The log of this test starts with a repro command that can be used to execute that particular exploration from the command line. Then, the generated tests are logged.

5. The coverage link opens the code coverage report document.

Parameterized Unit Testing with Pex: Tutorial

169

6. You can browse through the coverage data by assemblies, fixture and methods.

7. You can also view the coverage by source files.

8. Each source file is colored according to the legend found at the top of the page. user code under test (covered) Covered statements of the code-under-test. user code under test (uncovered) Statements of the code-under-test that were not covered. user code or test (covered) Statements that were covered, but that were not part of the code-under-test. user code or test (uncovered) Statements that were neither covered nor under test. tagged Statements which were covered, and which were tagged with some interesting information during the exploration process. Hover with the mouse over a tagged statement to see more information. We will later discuss how to configure which parts of the code are under test.

9. In this particular case, the ParameterizedTest method was fully covered,

Exercise 5. Parameterized Unit testing for the Luhn Algorithm We revisit the Luhn algorithm that we tested in Section 2.5 and write parameterized unit tests for it.

170

N. Tillmann, J. de Halleux, and W. Schulte

Part 1: Setting Up the Code Instrumentation 1. Add a parameterized unit test that specifies that a valid credit card number, according to the Luhn algorithm, does not contain any non-digit characters. [ PexMethod ] p u b l i c v o i d C o n t a i n s N u m b e r s ( s t r i n g number ) { PexAssume . I s T r u e ( L u h n A l g o r i t h m . V a l i d a t e ( number ) ) ; P e x A s s e r t . T r u e F o r A l l ( number , d e l e g a t e ( c h a r c ) { r e t u r n char . I s D i g i t ( c ) ; }); }

2. Run the test and analyze the results. As you may see, Pex only generated a single (trivial) test with a null reference as the number string. 3. The problem is that Pex did not analyze the Validate method. (Pex can only generate relevant test inputs for those portions of the code which get instrumented and can be monitored, and since the instrumentation is quite expensive, Pex does not instrument all code by default.) You need to tell Pex which assemblies or types need to be instrumented. This is a process that is usually done once. In our case, the product assembly is not instrumented, thus Pex could not build the constraints to explore the behavior of the Validate method, which eventually leads to poor code coverage of the generated test suite. 4. Pex shows important issues such as calls to methods that were not instrumented in a red bar.

5. Click on the link Uninstrumented Methods to see the list of relevant methods that were not instrumented. When you click on a method in the list, Pex offers several actions that can be taken to get rid of the warning. Since we want to test the Validate method, you want to click on the Instrument assembly link.

6. This action will add a custom attribute in the test project that tells Pex that the product assembly should be instrumented:

Parameterized Unit Testing with Pex: Tutorial

171

[ assembly : PexInstrumentAssembly (" C r e d i t a r " ) ]

Pex adds a special file PexAssemblyInfo.cs to hold the assembly level attributes. (Pex persists all project-specific settings as attributes.)

7. Run Pex again and analyze the results. The coverage should be much better. Part 2: Binding the Tests and the Code Under Test You should tell Pex what is the code under test, so that Pex can focus its analysis on that code, and show a relevant dynamic coverage graph. Pex provides several ways to specify this information. 1. The PexClassAttribute has a special constructor that takes a type. This constructor can be used to specify the type under test of a particular test fixture. This information is used by Pex to prioritize the exploration. Update the LuhnAlgorithmTest class to specify that it is testing the LuhnAlgorithm class. [ T es t C l as s , PexClass ( t y p e o f ( LuhnAlgorithm ) ) ] p u b l i c p a r t i a l c l a s s LuhnAlgorithmTest {

2. Because we know that the project Creditar.Tests is the test project for Creditar, we can add a custom attribute that will provide this information to Pex. [ assembly : PexAssemblyUnderTest ( " C r e d i t a r " ) ]

It is an assembly-level attribute that must be placed before the namespace declaration in C#. Part 3: Using the Wizard to Get Started Pex provides a code generation wizard that can automatically produce parameterized test stubs for the entire public API of a class. These stubs can be used as a starting point to write more elaborate scenarios. 1. Right-click inside the LuhnAlgorithm class and select Pex|Create Parameterized Unit Test Stubs

2. The Pex wizard will compile and analyze the LuhnAlgorithm class to produce the test stubs. The generated files will automatically be added to the test project using the information provided by the PexAssemblyUnderTestAttribute attribute.

172

N. Tillmann, J. de Halleux, and W. Schulte

Exercise 6. Instrumentation Configuration Pex can only generate a test suite with high code coverage if Pex monitors the relevant parts of the code. Therefore, it is most important to configure correctly which types Pex should instrument. Consider the following parameterized unit test. [ PexMethod ( MaxBranches = 2 0 0 0 ) ] p u b l i c void Test ( s t r i n g s ) { DateTime d t = DateTime . P a r s e ( s ) ; PexObserve . ValueForViewing ( " d t " , d t ) ; }

The MaxBranches setting makes sure that Pex does not stop too early. We will explain these bounds later, see Section 7.10. When Pex generates tests, it will only generate a single test at first. However, we do get a warning that some methods were not instrumented.

When you click on Uninstrumented Methods, you see in the log view the list of uninstrumented methods.

You can select one of them, and click on the link “Instrument type” to tell Pex that it should instrument this type in the future. (If you know that the code is irrelevant to the code that we want to cover, you can also click on “Ignore uninstrumented method”.) Pex will insert custom attributes such as the following for you. u s i n g M i c r o s o f t . Pex . Framework . I n s t r u m e n t a t i o n ; [ assembly : PexInstrumentType (" mscorl i b " , " System . D a t e T i m e P a r s e " ) ] [ assembly : PexInstrumentType (" mscorl i b " , " System . _ _ D T S t r i n g " ) ]

After you instruct Pex to instrument a type, you have to re-run Pex to see the effects of more instrumentation. In turn, you might get more uninstrumented method warnings. Can you determine the set of types that Pex must instrument in order to generate a valid DateTime string? It is usually easier when you want to test your own code whose structure you know. Hint: Work in a loop, where you add one type to the list of instrumented type which has the word DateTime in the name, and then re-run Pex to see what changed.

Parameterized Unit Testing with Pex: Tutorial

173

Tip: Coverage filtering The type DateTime is defined in mscorlib. Pex normally does not report coverage data achieved in this basic library. We can tell Pex to include coverage data with the following assembly-level attribute: u s i n g M i c r o s o f t . Pex . Framework . C o v e r a g e ; us i ng M i c r o s o f t . E x t e n d e d R e f l e c t i o n . Coverage ; [ assembly : PexCoverageFilterAssembly ( PexCoverageDomain . U s e r C o d e U n d e r T e s t , " mscorlib ")]

Exercise 7. Regular Expressions (optional): Try out your favorite regular expression and see if Pex can generate an input string that matches that expression: [ PexMethod ] p u b l i c v o i d MatchRegex ( [ PexAssumeNotNull ] s t r i n g v a l u e ) { i f ( Regex . I s M a t c h ( v a l u e , @"^[0 −9 a−fA−F ] + [ \ r \ n ] ∗ $ " ) ) Console . W r i t eL i ne ( " found i t ! " ) ; }

The [PexAssumeNotNull] attribute indicates that Pex should pass null for the value parameter. This is a simple case of an assumption on test inputs. See Section 7.7 for more details. Exercise 8. Binary Encoding (optional): An encoding is a reader-writer between byte arrays and char arrays. A simple example of encoding simply copies the bits of a char to two byte, and vice versa. The .NET Encoding class is defined in the System.Text namespace and has four abstract methods that need to be overloaded. Using the following PUT, write an implementation for GetChars and GetCharCount [ PexMethod ] p u b l i c void CharCountMatching ( [ PexAssumeNotNull ] E n c o d i n g e n c o d i n g , byte [ ] bytes , i n t index , i n t count ) { i n t charCount = encoding . GetCharCount ( bytes , index , count ) ; char [ ] c h a r s = encoding . GetChars ( bytes , index , count ) ; A s s e r t . AreEqual ( charCount , c h a r s . Length ) ; }

You can start from the Listing 5.7 which contains a sloppy implementation of those methods. Exercise 9. Compression and Decompression (optional): Write a run-length encoding (RLE) compression and decompression algorithm, that take arrays of bytes as input and output. Listing 5.8 describes the signature of the RLE implementation.

174

N. Tillmann, J. de Halleux, and W. Schulte 1 2

p u b l i c c l a s s SloppyBinaryEncoding : Encoding { const i n t BytesPerChar = 2;

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

p u b l i c overri d e i n t GetCharCount ( byte [ ] bytes , i n t index , i n t count ) { return count / 2; } p u b l i c overri d e i n t GetChars ( byte [ ] bytes , i n t byt eI ndex , i n t byteCount , char [ ] c h a r s , i n t c h a r I n d e x ) { int j = charIndex ; f o r ( i n t i = 0 ; i < b y t e C o u n t ; i += 2 ) { c h a r s [ j ++] = ( char ) ( b y t e s [ b y t e I n d e x + i ] << 8 | bytes [ byteIndex + i + 1 ] ) ; } return j − charIndex ; } Listing 5.7. A sloppy binary encoding class

1 2 3 4 5 6 7 8

p u b l i c c l a s s RleCompressor { p u b l i c s t a t i c b y t e [ ] Compress ( b y t e [ ] i n p u t ) { throw new N o t I m p l e m e n t e d E x c e p t i o n ( ) ; } p u b l i c s t a t i c b y t e [ ] Decompress ( b y t e [ ] i n p u t ) { throw new N o t I m p l e m e n t e d E x c e p t i o n ( ) ; } } Listing 5.8. Signatures of a run-length-encoding compressor

The parameterized unit test should simply state the roundtrip property: Compression followed by decompression should yield the original sequence of bytes. For example, you can write the following PUT: [ PexMethod ] p u b l i c void Roundtrip ( byte [ ] dat a ) { PexAssume . I s N o t N u l l ( d a t a , " d a t a " ) ; b y t e [ ] c o m p r e s s e d = R l e C o m p r e s s o r . Compress ( d a t a ) ; b y t e [ ] u n c o m p r e s s e d= R l e C o m p r e s s o r . Decompress ( c o m p r e s s e d ) ; // assertions P exAsser t . I sNot Nul l ( uncompressed , " uncompressed " ) ; A s s e r t . A r e E q u a l ( d a t a . Length , u n c o m p r e s s e d . L e n g t h ) ; f o r ( i n t i = 0 ; i < d a t a . L e n g t h ; i ++) P exA s s er t . AreEqual ( d a t a [ i ] , uncompressed [ i ] ) ; }

Parameterized Unit Testing with Pex: Tutorial

175

Exercise 10. String Interview Question (optional) A classic theme in programming job interviews is on manipulating strings. Here is one problem that we could test with Pex: Given an array of characters, with spaces, replace spaces with %20. You have extra space on the end of the array. (No additional memory and do it as close to O(n) as possible). – Write an implementation of EscapeSpaces. – Write a PUT and use Pex to validate your implementation.

5 Dealing with the Environment Each unit test, whether parameterized or not, should test a single feature, so that a failing unit test identifies the broken feature as concisely as possible. Also, the fewer system components a test interacts with, the faster it will run. However, in practice it is often difficult to test features in isolation: The code may take a file name as its input, and use the operating system to read in the contents of the file. Or the test may need to connect to another machine to fulfill its purpose. The first step towards making the code testable is to introduce abstraction layers. For example, the following Parse method is not testable in isolation, since it insists on interacting with the file system. p u b l i c void Parse ( s t r i n g fileName ) { S t r e a m R e a d e r r e a d e r = F i l e . OpenText ( f i l e N a m e ) ; string line ; w hi l e ( ( l i n e = r e a d e r . ReadLine ( ) ) != n u l l ) { ... } }

The parser in the following code is better testable, since the actual parsing logic can be driven from any implementation of the abstract StreamReader class. In particular, it is no longer necessary to go through the file system to test the main code. p u b l i c void Parse ( s t r i n g fileName ) { t h i s . P a r s e ( F i l e . OpenText ( f i l e N a m e ) ) ; } p u b l i c void Parse ( StreamReader r e a d e r ) { string line ; w hi l e ( ( l i n e = r e a d e r . ReadLine ( ) ) != n u l l ) { ... } }

Abstraction from the environment is necessary if you want to have a Pex-testable design.

176

N. Tillmann, J. de Halleux, and W. Schulte

5.1 Mocks, Stubs When the code is written with abstraction layers, mock objects [226] can be used to substitute parts of the program that are irrelevant for a tested feature. Mock objects answer queries with fixed values similar to those that the substituted program would have computed. Today, developers usually define the behavior of mock objects by hand. By behavior, we mean the return values of mocked methods, what exceptions they should throw, and so on. Several frameworks [343] exist which provide stubs. Stubs are trivial implementations of all methods of an interface or a class. Stubs do not perform any actions by themselves and usually just return some default value. The behavior of the stubs must still be programmed by the developer. (A capture-replay approach is used in [292] to distill actual behavior of an existing program into mock objects.) 5.2 Stubs Framework in Pex In fact, Pex comes with its own Stubs framework to make writing stubs easier. This framework also allows to detour legacy code that communicates with the environment and does not provide an encapsulated abstraction layer. In the remainder of this tutorial, we will not use any framework for mocks or stubs, not even the one that comes with Pex, but we will define everything required by hand to illustrate the general mechanisms. Tutorials and more documentation on how to leverage the stubs framework that comes with Pexcan be found on the Stubs website [277]. 5.3 Parameterized Mock Objects When manually writing mock objects, some of the main questions are: What values should the mock object return? How many versions of a mock object do I need to write to test my code thoroughly? We have seen earlier how parameterized unit tests are a way to write general tests that do not have to state particular test inputs. In a similar way, parameterized mock objects are a way to write mock objects which do not have just one particular, fixed behavior. Consider the method AppendFormat of the StringBuilder class in the .NET base class library. Given a string with formatting instructions, and a list of values to be formatted, it computes a formatted string. For example, formatting the string "Hello {0}!" with the single argument "World" yields the string "Hello World!". p u b l i c S t r i n g B u i l d e r AppendFormat ( IFormatProvider provider , s t r i n g format , o b j e c t [ ] args ) { ... }

The first parameter of type IFormatProvider “provides a mechanism for retrieving an object to control formatting” according to the MSDN documentation:

Parameterized Unit Testing with Pex: Tutorial

177

public i n t er f a ce IFormatProvider { o b j e c t G e t F o r m a t ( Type fmtType ) ; }

A non-trivial test calling AppendFormat needs an object that implements IFormatProvider. While the Stubs framework [277] that comes with Pex can generate implementations for interfaces automatically, we will illustrate in the following how to write such an implementation by hand, but leaving to Pex how it should behave. p u b l i c c l a s s MockFormatProvider : I F o r m a t P r o v i d e r { p u b l i c o b j e c t G e t F o r m a t ( Type fmtType ) { v a r c a l l = PexChoose . F r o m C a l l ( t h i s ) ; r e t u r n c a l l . ChooseResult < object > ( ) ; } }

The mock method GetFormat obtains from the global PexChoose a handle called call that represents the current method call. The PexChoose provides the values which define the behavior of the mocked methods, for example their return values. When the test case is executed, ChooseResult will initially return some simple value, for example null for reference types. Pex’ symbolic analysis will track how the value obtained from ChooseResult is used by the program, just as Pextracks all other test inputs. Depending on the conditions that are checked on the value obtained from ChooseResult, Pex will execute the test case multiple times, trying other values that will be different from null. You may change the code of the mock type to allow more diverse behavior, for example adding the choice to throw an exception, perform a callback, or change some accessible state. For example, you can insert the following lines in the GetFormat method. v a r c a l l = PexChoose . F r o m C a l l ( t h i s ) ; i f ( c a l l . ChooseThrowException ( ) ) throw c a l l . ChooseException ( ) ; r e t u r n c a l l . ChooseResult < object > ( ) ;

The choice whether to throw an exception will cause the exploration to consider two execution paths. Furthermore, if the caller of GetFormat would distinguish different exception types, for example by having several catch statements, Pex may explore even more execution paths. As mentioned before, Pex will track the usage of the values obtained from ChooseResult, and may execute the program with several different values to this end. The following call to GetFormat occurs in AppendFormat after checking provider!=null: cf = ( ICustomFormatter ) provider . GetFormat ( t y p e o f ( I C us t omF or mat t er ) ) ;

Depending on the result of GetFormat, the cast to ICustomFormatter might fail. Pex understands this type constraint, and Pex generates a test case with the following mock object behavior:

178

N. Tillmann, J. de Halleux, and W. Schulte M o c k F o r m a t P r o v i d e r m = new M o c k F o r m a t P r o v i d e r ( ) ; PexChoose . NewTest ( ) . O n C a l l ( 0 , " M o c k F o r m a t P r o v i d e r . G e t F o r m a t ( Type ) " ) . R e t u r n s (m ) ;

Here, Pex creates a mock object and instructs the oracle that during the execution of a unit test the first call to m.GetFormat should return the mock object itself! (The test cases that Pex generate are always minimal, this is an example of how Pex tries to use as few objects as possible to trigger a particular execution path.) This particular mock object behavior will cause the cast to fail, since MockFormatProvider does not implement ICustomFormatter. 5.4 Parameterized Mock Objects with Assumptions Unconstrained mock objects can cause the code to behave in unexpected ways. Just as you can state assumptions on the arguments of parameterized unit tests, you can state assumptions on the results of mock object calls. For example, the author of the IFormatProvider interface probably had the following contract in mind: p u b l i c o b j e c t G e t F o r m a t ( Type fmtType ) { v a r c a l l = PexChoose . F r o m C a l l ( t h i s ) ; o b j e c t r e s u l t = c a l l . ChooseResult < object > ( ) ; // constraining result PexAssume . I s T r u e ( r e s u l t ! = n u l l ) ; PexAssume . I s T r u e ( fmtType . I s A s s i g n a b l e F r o m ( r e s u l t . GetType ( ) ) ) ; return result ; }

5.5 Dependency Injection Dependency Injection [122], also known as Inversion Of Control is a design pattern that helps building components that are mockable. Let us illustrate this concept with the implementation of a client application that validates a credit card number. Credit card validation client. The client is a simple graphical user interface that lets the user input a credit card number of query for its validity. Then, the client should query the web service to validate the credit card number and display the result to the user.

Parameterized Unit Testing with Pex: Tutorial

179

A first attempt. The dialog window usually has a Show window to pop up the dialog, a Number property to access to the number string value and a Status property to update its display: public class CreditarDialog { p u b l i c b o o l Show ( ) ... p u b l i c s t r i n g Number ... public string Status ... }

– A first attempt at writing the validator would be the following: public class CreditarClient { p u b l i c void Execute ( ) { C r e d i t a r D i a l o g d i a l o g = new C r e d i t a r D i a l o g ( ) ; i f ( d i a l o g . Show ( ) ) { bool v a l i d = LuhnAlgorithm . V a l i d a t e ( d i a l o g . Number ) ; if ( valid ) dialog . Status = " validated "; else dialog . Status = " invalidated "; } } }

The implementation of the Execute method causes several issues when it comes to testing it. The control flow depends on the Show method of the CreditarDialog dialog window: C r e d i t a r D i a l o g d i a l o g = new C r e d i t a r D i a l o g ( ) ; i f ( d i a l o g . Show ( ) ) {

This means that the test will pop up a dialog window and we will need complex automated tools to artificially click on the buttons of the dialog. In that sense, we will spend a large amount of effort to test a functionality that is not directly related to the CreditarClient method. – A similar problem arises with the validate of the credit card number where we directly call the Validate method of the LuhnAlgorithm type: bool v a l i d = LuhnAlgorithm . V a l i d a t e ( d i a l o g . Number ) ;

To work around these problems, we need to add an abstract layer between the CreditarClient class and its dependencies, the user interface and the validator service.

180

N. Tillmann, J. de Halleux, and W. Schulte

Extracting dependencies. We do this by extracting an interface for each dependency (we already have an interface for the credit card validation service): public interface ICreditarDialog { b o o l Show ( ) ; s t r i n g Number { g e t ; } s t r i n g Status { get ; set ;} }

Each dependency is then injected into the CreditarClient: public p a rt i a l class CreditarClient { ICreditarDialog dialog ; ICreditCardValidator validator ; p u b l i c void Execute ( ) { i f ( t h i s . d i a l o g . Show ( ) ) { bool v a l i d = t h i s . v a l i d a t o r . V a l i d a t e ( d i a l o g . Number ) ; if ( valid ) this . dialog . Status = " validated "; else this . dialog . Status = " invalidated "; } } }

Injecting dependencies. The last problem that remains is that we need a way to set the dialog and validator fields in the CreditarClient class. There are many ways to implement this feature, the following are common patterns: Constructor Injection. Each dependency is passed in the class constructor, thus the inversion of control name of the pattern, public class CreditarClient { ICreditarDialog dialog ; ICreditCardValidator validator ; public CreditarClient ( ICr ed it ar Dia lo g dialog , ICreditCardValidator validator ) { t his . dialog = dialog ; this . validator = validator ; } ... }

Service Locator. The class instance is hosted in a container that can be queried through a message for particular services. In .NET, this pattern is implemented in the System.ComponentModel namespace. A Component element can query for a service using the GetService method:

Parameterized Unit Testing with Pex: Tutorial 1 2 3 4 5

181

i n t Complicated ( i n t x , i n t y ) { i f ( x == O b f u s c a t e ( y ) ) throw new R a r e E x c e p t i o n ( ) ; return 0; }

6 7 8 9

int Obfuscate ( int y ) { r e t u r n ( 1 0 0 + y ) ∗ 567 } Listing 5.9. Code that is difficult to analyze p u b l i c c l a s s C r e d i t a r C l i e n t : Component { p u b l i c void Execute ( ) { ICreditarDialog dialog = ( ICreditarDialog ) t h i s . GetService ( typeof ( ICreditarDialog ) ) ; ICreditCardValidator validator = ( ICreditCardValidator ) t h i s . GetService ( typeof ( ICreditCardValidator ) ) ; ... } }

6 Background on White Box Software Testing (Advanced) In general, all program analysis techniques fall somewhere in between the following two ends of the spectrum: – Static analysis techniques verify that a property holds on all execution paths. Since the goal is program verification, these techniques are usually overly conservative and flag possible violations as errors (“false positives”). – Dynamic analysis techniques verify that a property holds on some execution paths. Testing is a dynamic technique that aims at detecting bugs, but cannot usually proof the absence of errors. Thus, these techniques often fail to detect all errors. It is often not possible to detect bugs precisely when applying only static analysis, or employing a technique that is not aware of the code structure. Consider Listing 5.9. – Static analysis techniques tend to be overly conservative, and the non-linear integer arithmetic present in Obfuscate causes most static analysis techniques to give up and issue a warning about a potential error in Complicated. – Random testing techniques have very little chance of finding a pair of x and y values that triggers the exception. Pex implements an analysis technique which falls in between these two extremes: dynamic symbolic execution, a white box test generation technique. Similar to static analysis techniques, Pex proves that for most feasible paths (those within specified exploration bounds) a property holds. Similar to dynamic analysis techniques, Pex only reports real errors, and no false positives.

182

N. Tillmann, J. de Halleux, and W. Schulte

6.1 Specification Techniques All program analysis techniques try to proof and/or falsify certain specified properties of a given program. There are different ways to specify program properties: – API Contracts (including Spec# [23] and Eiffel[248]) specify the behavior of individual API actions from the implementation’s point of view. Their goal is to guarantee robustness, in the sense that operations do not crash and data invariants are preserved. A common problem of API contracts is their narrow view on individual API actions, which make it hard to describe system-wide protocols. Complex extensions like model-fields are necessary to raise the abstraction level. – Unit Tests (JUnit [186], NUnit [322], and so on) are exemplary usage scenarios from the point of view of a client of the API. Their goal is to guarantee functional correctness, in the sense that the interplay of several operations behaves as intended. A common problem of unit tests is that they are detached from the details of the implementation; measuring the code coverage achieved by the unit tests gives only a rough idea of how adequate the unit tests are. Pex enables Parameterized Unit Testing, which unites both. Supported by a test-input generation tool like Pex, this methodology combines the client and the implementation point of view. The functional correctness properties (parameterized unit tests) are checked on most cornercases of the implementation (test input generation). 6.2 The Testing Problem Starting from parameterized unit tests as specification, we can state the testing problem as follows. Given a sequential program P with statements S , compute a set of program inputs I such that for all reachable statements s in S there exists an input i in I such that P(i) executes s. Remarks: – By sequential we mean that the program is single-threaded. – We consider failing an assertion, or violating an implicit contract of the execution engine (for example raising NullReferenceException when null is dereferenced) as special statements. – Since reachability is not decidable in general, we aim for a good approximation in practice: high coverage of the statements of the program. Instead of statement coverage, other coverage metrics like arc coverage can be used. 6.3 White-box Test Input Generation White box testing means leveraging information about how a software system is implemented in order to validate or falsify certain properties. White box testing may involve the analysis of data flow, control flow, exception handling, or other details of the implementation. In order to obtain the necessary information, white box testing requires

Parameterized Unit Testing with Pex: Tutorial

183

access to the software’s source code, or the compiled binary. Pex is a white-box analysis tool that analyzes the compiled instructions sequence of the program (the .NET MSIL instructions). The opposite of white box testing is black box testing, which usually amounts to using randomly generated test data. 6.4 Symbolic Execution Pex implements a white box test input generation technique that is based on the concept of symbolic execution. Pex’ goal is to automatically and systematically produce the minimal set of actual parameters needed to execute a finite number of finite paths. In general, symbolic execution works as follows: For each formal parameter, a symbolic variable is introduced. When a program variable is updated to a new value during program execution, then this new value is often expressed as an expression over symbolic variables. When a statement in the program has more than one possible successor, execution is forked into two paths. For each code path explored by symbolic execution, a path condition is built over symbolic variables. For example, the Add-method of the ArrayList implementation shown in Listing 5.4 contains an if-statement whose condition is this. items.Length == this. size (where the field items denotes the array holding the array list’s elements and size denotes the number of elements currently contained in the array list). The symbolic execution conjoins this condition to the path condition for the then-path and the negated condition to the path condition of the else-path. In this manner all constraints are collected which are needed to deduce what inputs cause a code path to be taken. A constraint solver or automatic theorem prover is usually used to decide the feasibility of individual execution paths, and to obtain concrete test inputs as representatives of individual execution paths. Analysis of all paths cannot always be achieved in practice. When loops and recursion are present, an unbounded number of code paths may exist. In this case loops and recursion are usually analyzed only up to a specified number of unfoldings. Even if the number of paths is finite, solving the resulting constraint systems is sometimes computationally infeasible, depending on the employed constraint solver or automatic theorem prover. Symbolic execution in its original form [194] is a static program analysis, since it does not actually execute the program but merely analyzes possible execution paths. 6.5 Dynamic Symbolic Execution Applying symbolic execution as described above to real-world program is problematic, since such a program’s interaction with a stateful environment cannot be “forked”. Pex explores the behaviors of a parameterized unit test using a technique called dynamic symbolic execution [141,67]. This test generation technique consists in executing the program, starting with very simple inputs, while simultaneously performing a single-path symbolic execution to collect symbolic constraints on the inputs obtained from predicates in branch statements along the execution, and then using a constraint solver to infer variants of the previous inputs in order to steer future program executions along alternative program paths. In this way, all program paths will be exercised

184

N. Tillmann, J. de Halleux, and W. Schulte

Step 0 Set J := ∅ Step 1 Step 2 Step 3 Step 4 Step 5

(intuitively, J is the set of already analyzed program inputs) (stop if no such i can be found)

Choose program input i ∈ /J Output i Execute P (i); record path condition C (in particular, C (i) holds) Set J := J ∪ C (viewing C as the set {i | C (i)}) Goto Step 1. Algorithm 6.1. Dynamic symbolic execution

eventually. Operations that are implemented by the external environment are not tracked symbolically; instead, the actually observed input/output values become part of the constraints. As a result, dynamic symbolic execution extends static symbolic execution [194] with additional information that is collected at runtime, which makes the analysis more precise [141,139]. By continuation, all analyzed execution paths are feasible, which avoids the problem of spurious error reports common to overly conservative static analysis techniques. While additional information is collected on the level of individual execution traces which characterize individual execution paths, knowing the structure of the program still enables the analysis of many execution paths at once, [141,11,49]. Algorithm 6.1 shows the general dynamic symbolic execution algorithm. The choice of the new inputs in Step 1 decides in which order the different execution paths of the program are visited. Pex uses several heuristics that take into account the structure of the program and the already covered branches when deciding on the next program inputs. The goal of Pex’ strategy is to achieve high statement coverage fast. As an effect, the user just has to set a time limit or another rough analysis bound. Most other tools ([141,139,141,67]) explore the execution paths in a fixed search order, and they require the user to define detailed bounds on the size and structure of the input data. 6.6 Example In the following, we explain how Pex handles Listing 5.9, starting from the method Complicated. In order to run the code for the first time, Pex needs to supply some argument values to Complicated, for example x = 572 and y = 152 (arbitrary values). Pex monitors the execution of the code, following the execution into and out of method calls. With randomly chosen values, it is very unlikely that we will trigger the rare exception. Pex remembers that we did not cover the throw statement. Pex also remembers all conditions that were evaluated; here, it remembers that x = (100 + y) ∗ 567%2347. Knowing that we have not yet covered the throw statement, Pex looks at the program to find the branch that guards that statement. Last time, we had x = (100+ y)∗567 mod 2347. So in order reach the other branch of the if-statement, Pex builds the negated condition, x = (100 + y) ∗ 567 mod 2347, and hands it to a constraint solver. In this case, it is quite simple to solve the constraint system, since one just has to supply any value for y to compute x.

Parameterized Unit Testing with Pex: Tutorial

185

Pex runs the code again for with the new inputs, say x = (100 + 152) ∗ 567 mod 2347 and y = 152. This time the throw statement will be covered. Since all statements have been covered now, Pex will stop.

7 Understanding Pex (Advanced) This section explains various aspects of the tool Pex, that is used to craft inputs to the parameterized unit tests. 7.1 Why Dynamic Symbolic Execution Symbolic execution was originally proposed [194] as a static program analysis technique, which is an analysis that only considered the source code of the analyzed program. This approach works well as long as all decisions can be made on basis of the source code alone. It becomes problematic when the program contains constructs that make reasoning hard (for example accessing memory through arbitrary pointers), and when parts of the program are actually unknown (for example when the program communicates with the environment, for which no source code is available, and whose behavior has not been specified). Many .NET programs use unsafe .NET features such as arbitrary memory accesses through pointers for performance reasons, and most .NET programs interact with other unmanaged (non-.NET) components for legacy reasons. While static symbolic execution algorithms do not have or use any information about the environment into which the program is embedded, dynamic symbolic execution does leverage dynamic information that it observes during concrete program executions (information about actually taken execution paths, and the data that is passed around between the analyzed program and the environment). Knowing the actually taken execution paths allows Pex to prune the search space. When the program communicates with the environment, Pex builds a model of the environment from the actual data that the environment receives and returns. This model is an under-approximation of the environment, since Pex does not know the conditions under which the environment produces its output. The resulting constraint systems that Pex builds may no longer accurately characterize the program’s behavior, and as a consequence Pex prunes such paths. Thus, Pex always maintains an under-approximation of the program’s behavior, which is appropriate for testing. Exercise 11. Understanding the Path Condition Part 1: Solving constraints of Listing 5.9 Call Listing 5.9 from a parameterized unit test, and inspect which values Pex uses as test inputs. Do the values match your expectations? Would another set of values be more appropriate for testing? If so why, if not, why not? Part 2: Playing with Listing 5.9 We revisit Listing 5.9 and see how Pex handles this example. 1. Add a new test in the project, copy the source of Listing 5.9 and turn it into a parameterized test.

186

N. Tillmann, J. de Halleux, and W. Schulte

2. One way to gain more insight on what is happening is the GetPathConditionString method of PexSymbolicValue class that returns the current path condition. The following code will add a column to the parameter table and fill it in with the path condition. i n t Complicated ( i n t x , i n t y ) { i f ( x == O b f u s c a t e ( y ) ) t h r o w new R a r e E x c e p t i o n ( ) ; / / logging the path condition s t r i n g pc = P e x S y m b o l i c V a l u e . G e t P a t h C o n d i t i o n S t r i n g ( ) ; P e x O b s e r v e . V a l u e F o r V i e w i n g ( " pc " , pc ) ; return 0; }

If you run the above code as part of a parameterized unit test, you will notice that the path condition is only added to the table when no exception is thrown. To make sure that the path condition is added in any case, you can embed the logging code into a try-finally block. 2 3. Execute Pex and analyze the results. Do they match your understanding? 4. Another way to look into Pex’ symbolic analysis is the ToString method of the PexSymbolicValue class that returns a string that represents Pex’ current knowledge about how the given value was derived from the inputs: i n t Complicated2 ( i n t x , i n t y ) { i f ( x == O b f u s c a t e ( y ) ) { / / logging the path condition s t r i n g obfuscateY = PexSymbolicValue . T oS t r i ng ( O bf us cat e ( y ) ) ; PexObserve . ValueForViewing ( " o b f u s c a t e ( y ) " , obfuscateY ) ; t h r o w new R a r e E x c e p t i o n ( ) ; } return 0; }

Part 3: Additional Path Condition Samples Additionally, you can also try out to use the GetPathConditionString and ToString methods of the PexSymbolicValue class with other examples: 1. Dealing with loops p u b l i c void PathConditionLoop ( i n t x , i n t y ) { f o r ( i n t i = 0 ; i < x ; ++ i ) i f ( x == y + 1 ) { s t r i n g pc = P e x S y m b o l i c V a l u e . GetPathConditionString ( ) ; P e x O b s e r v e . V a l u e F o r V i e w i n g ( " pc " , pc ) ; } }

Parameterized Unit Testing with Pex: Tutorial

187

2. Dealing with strings void P a t h C o n d i t i o n S u b s t r i n g ( s t r i n g s , i n t x , s t r i n g y ) { i f ( s . S u b s t r i n g ( x ) == y ) { s t r i n g pc = P e x S y m b o l i c V a l u e . GetPathConditionString ( ) ; P e x O b s e r v e . V a l u e F o r V i e w i n g ( " pc " , pc ) ; } } void P at hCondi t i onS t ri ngMat ch ( s t r i n g s ) { i f ( s == " H e l l o " ) { s t r i n g pc = P e x S y m b o l i c V a l u e . GetPathConditionString ( ) ; P e x O b s e r v e . V a l u e F o r V i e w i n g ( " pc " , pc ) ; } }

7.2 Monitoring by Instrumentation Pex monitors the execution of a .NET program through code instrumentation. Pex plugs into the .NET profiling API. Before any method is compiled from the Microsoft Intermediate Language (MSIL) to native machine code by the Just-In-Time compiler of .NET, the Pex profiler gets a callback to inspect and rewrite the instructions. All .NET language compilers translate into this instruction set. Pex operates on the level of these instructions. Pex does not care in which high level language the program was written. (However, when Pex gives feedback to you in the form of code snippets, Pex only supports C# syntax at this time.) The instrumented code drives a “shadow interpreter” in parallel to the actual program execution. The “shadow interpreter” – constructs symbolic representations of the executed operations over logical variables instead of the concrete program inputs; – maintains and evolves a symbolic representation of the entire program’s state at any point in time; – records the conditions over which the program branches. Pex’ interpreter models the behavior of all verifiable .NET instructions precisely, and models most unverifiable (involving unsafe memory accesses) instructions as well. 7.3 Symbolic State Representation A symbolic program state is a predicate over logical variables together with an assignment of locations to expressions over logical variables, just as a concrete program state is an assignment of locations to values. The locations of a state may be static fields, instance fields, method arguments, locals, and positions on the operand stack. Pex’ expression constructors include constants for all basic .NET data types (integers, floating point numbers, object references), and functions over those basic types representing particular machine instructions, for example addition and multiplication. Pex uses tuples to represent .NET value types (“structs”), and maps to represent instance

188

N. Tillmann, J. de Halleux, and W. Schulte

fields and arrays, similar to the heap encoding of ESC/Java [121]: an instance field of an object is represented by a single map which associates object references with field values. Constraints over the .NET type system and virtual method dispatch lookups are also encoded in expressions. Predicates are represented by Boolean-valued expressions. Pex implements various techniques to reduce the overhead of the symbolic state representation. Before building a new expression, Pex always applies a set of rules which compute a normal form. A simple example of a reduction rule is constant folding, for example 1 + 1 is reduced to 2. All logical connectives are transformed into a BDD representation with if-then-else expressions [52]. All expressions are hash-consed, which means that only one instance is ever allocated in memory for all structurally equivalent expressions. Pex also employs independent constraint optimization [67]. Based on the already accumulated path condition, expressions are further simplified. For example, if it is already established that x > 0, then x < 0 simplifies to false. While strings can be represented in a way similar to arrays of characters, Pex represents certain string constraint as expressions using specialized string-functions [43]. When running the following example with Pex, it will print how the state is represented symbolically, including the path condition, at a particular program point. s t a t i c i n t Global ; [ PexMethod ] p u b l i c void SymbolicStateExample ( i n t x , i n t y ) { Global = 1 + x + x ∗ x + x ∗ x − 1; i f ( y > Global ) { C o n s o l e . W r i t e L i n e ( " G l o b a l =" + PexSymbolicValue . ToRawString ( Global ) ) ; C o n s o l e . W r i t e L i n e ( " x =" + PexSymbolicValue . ToRawString ( x ) ) ; C o n s o l e . W r i t e L i n e ( " y =" + PexSymbolicValue . ToRawString ( y ) ) ; C o n s o l e . W r i t e L i n e ( " pc =" + PexSymbolicValue . G et R aw P at hC ondi t i onS t r i ng ( ) ) ; } } ToRawString and GetRawPathConditionString return expressions representing

symbolic values and the path condition, formatted as S-expressions. Here, it will print the following. 1 2 3 4

Global=(Add (Mul (Exp x 2) 2) x) x=x y=y pc=(Clt (Add (Mul (Exp x 2) 2) x) y)

Add, Mul, Exp are binary functions representing addition, multiplication and exponentiation, and Clt is a predicate that compares if the left operand is less than the second operand. On the right-hand sides, x and y represent the symbolic test inputs. Note that

the expressions have been normalized by Pex.

Parameterized Unit Testing with Pex: Tutorial

189

Exercise 12. Heap Constraints Consider the following type public class C { public int F; }

that is used in the following method. p u b l i c v o i d A l i a s C h a l l e n g e (C x , C y ) { i f ( x != n u l l ) i f ( y != n u l l ) { x . F = 42; y . F = 23; / / i f ( x . F == 4 2 ) t h r o w new E x c e p t i o n ( " boom " ) ; } }

How many execution paths will Pex have to explore? (Pex only generates different test inputs when they exercise different execution paths.) How many are there if we include the commented code? Again, we can inspect the constraints that Pex’ collects internally: 2 GetPathConditionString shows us a pretty-printed string that bends Pex’ internal representation into a C#-like syntax. Try out GetRawPathConditionString instead to get a view on the raw expressions that Pex uses internally. They are formatted as S-expressions. In this example, you will see functions called select and update. These functions operate on maps which describe the evolution of and accesses to the heap [242]. 7.4 Search Strategy Pex builds the (reachable portion of the) execution tree of the program from all previously discovered execution paths. In each step of the test generation algorithm, Pex picks an outgoing unexplored branch of the tree, representing an execution path that has not been discovered yet. Pex implements a fair choice between all such unexplored branches. Pex includes various fair strategies which partition all branches into equivalence classes, and then picks a representative of the least often chosen class. The equivalence classes cluster branches by mapping them – to the branch statement in the program of which the execution tree branch is an instance (each branch statement may give rise to multiple execution tree branches, for example when loops are unfolded), – the stack trace at the time the execution tree branch was created, – the depth of the branch in the execution tree. Pex combines all such fair strategies into a meta-strategy that performs a fair choice between the strategies. See [359] for a more detailed description of the search strategies in Pex. Exercise 13. Search Frontiers Consider the following parameterized unit test. It has two loops, each of which can make 32 separate decisions.

190

N. Tillmann, J. de Halleux, and W. Schulte u s i n g M i c r o s o f t . Pex . Framework . S t r a t e g i e s ; [ PexMethod , P e x S e a r c h F r o n t i e r D e p t h F i r s t ] p u b l i c void TestWithLoops ( i n t x , i n t y ) { in t xbits = 0; f o r ( i n t i = 0 ; i < 3 2 ; i ++) i f ( ( ( 1 << i ) & x ) ! = 0 ) x b i t s ++; in t ybits = 0; f o r ( i n t i = 0 ; i < 3 2 ; i ++) i f ( ( ( 1 << i ) & y ) ! = 0 ) y b i t s ++; PexObserve . ValueForViewing ( " x b i t s " , x b i t s ) ; PexObserve . ValueForViewing ( " y b i t s " , y b i t s ) ; }

1. Try the different search frontiers. Additional search frontier implementation can be found under the Microsoft.Pex.Framework.Strategies namespace. 2. Can you explain what you see? Note that, by default, Pex does not emit a test for each path it explores. 7.5 Constraint Solving For each chosen unexplored branch, Pex builds a formula that represents the condition under which this branch may be reached. Pex employs Z3 as its constraint solver. Z3 is a Satisfiability Modulo Theories (SMT) solver, an automated satisfiability checker for first-order logic with several built-in theories. This makes Z3 an efficient automatic theorem prover; historically, SMT solvers have been used mainly to perform program correctness proofs, for example as part of ESC/Java [121] and Spec# [23]. In addition to deciding satisfiability, Z3 can also compute a model (satisfying assignments) for a satisfiable formula, which makes it an ideal constraint solver for Pex. Pex faithfully encodes all constraints arising in safe .NET programs such that Z3 can decide them with its built-in decision procedures for propositional logic, fixed sized bitvectors, tuples, arrays and quantifiers. Pex also has a specialized string solver [43] that is integrated with Z3. Arithmetic constraints over floating point numbers are approximated by a translation to rational numbers, and heuristic search techniques are used outside of Z3 to find approximate solutions for floating point constraints. Pex encodes the constraints of the .NET type system and virtual method dispatch lookup tables as universally quantified formulae. Exercise 14. Constraint Solving You can use Pex to solve constraint systems written in C#. For example, Pex can find the two prime factors of 52605271 with a test like the following. Random testing would most likely not be able to solve this problem.

Parameterized Unit Testing with Pex: Tutorial [ PexMethod ] p u b l i c void if (x > 1 y > 1 x ∗ y throw }

191

DetermineFactors ( i nt x , i nt y) { && x < 10000 && && y < 10000 && == 5 2 6 0 5 2 7 1 ) new E x c e p t i o n ( " f o u n d i t " ) ;

– What are the two prime factors? – Can you write other interesting constraint systems and solve them with Pex? 7.6 Implicit Branches Pex treats all possible deterministic exceptional control flow changes in the code like all other explicit branches: It tries to explore the successful case as well as the exceptional outcome. The following exceptions that the CLR execution engine may throw fall into this category: – – – – –

NullReferenceException IndexOutOfRangeException OverflowException InvalidCastException DivisionByZeroException

Pex does not systematically try to throw the following exceptions that the execution engine may raise non-deterministically: OutOfMemoryException, StackOverflowException, ThreadAbortException. (In fact, by default Pex tries hard to avoid them so that only perfectly deterministic execution paths are explored.) Exercise 15. Implicit Branches The following method has two execution paths. p u b l i c void ImplicitNullCheck ( i n t [ ] a ) { i n t x = a . Length ; }

How many paths will Pex explore in the following method? (Note that Pex checks for each possible exception type separately, and considers checks for different exception types as different branches.) p u b l i c void ImplicitIndexOutOfRangeCheck ( i n t [ ] a ) { int x = a [0]; }

Pex understands checked code as well. Pex finds input that will cause the following method to throw an OverflowException. p u b l i c void ImplicitOverflowCheck ( i n t x , i n t y ) { i n t z = checked ( x + y ) ; }

Can you write a parameterized unit test that could cause an InvalidCastException?

192

N. Tillmann, J. de Halleux, and W. Schulte

7.7 Assumptions and Assertions You can use assumptions and assertions to express preconditions (assumptions) and postconditions (assertions) of tests. While Pex tries different argument values during the exploration of the code, Pex might inadvertently violate an assumption. When that happens, Pex remembers the condition that caused the assumption violation, so that Pex will not generate another test that violates the assumption. The test case itself that violated the assumption is silently pruned. The concept of assertions is well known in unit test frameworks. Pex understands the built-in Assert classes provided by each supported test framework. However, most frameworks do not provide a corresponding Assume class. For that case, Pex provides the PexAssume class. If you do not want to use an existing test framework, Pex also has the PexAssert class. u s i n g M i c r o s o f t . Pex . Framework ; [ PexClass ] p u b l i c p a r t i a l c l a s s MyTests { [ PexMethod ] p u b l i c void Test1 ( o b j e c t o ) { / / p r e c o n d i t i o n : o s h o u l d n o t be n u l l PexAssume . I s N o t N u l l ( o ) ; ... } }

The non-nullness assumption can also be encoded as a custom attribute: u s i n g M i c r o s o f t . Pex . Framework ; p u b l i c p a r t i a l c l a s s MyTests { [ PexMethod ] p u b l i c v o i d T e s t 2 ( [ PexAssumeNotNull ] o b j e c t o ) / / p r e c o n d i t i o n : o s h o u l d n o t be n u l l { ... } }

When you write an assertion, Pex will not only passively detect assertion violations, but Pex will in fact actively try to compute test inputs that will cause the assertion to fail. The reason is simply that the assert statement is implemented as a conditional branch, which throws an exception in the failure case, similar to the following code: public c l a s s PexAssert { p u b l i c s t a t i c void IsT rue ( bool c o n d i t i o n ) { if (! condition ) t h r o w new P e x A s s e r t i o n V i o l a t i o n E x c e p t i o n ( ) ; } }

Parameterized Unit Testing with Pex: Tutorial

193

Just like for any other branch in the code, Pex will build constraint systems which aim at invalidating the asserted condition for each calling context. When Pex can solve the constraint system for the negated condition, we get test inputs that exhibit an error. Just as an assertion might throw an PexAssertFailedException, Pex uses a PexAssertFailedException internally to stop a test case when an assumption fails. 7.8 When Does a Test Case Fail? Pex considers the exceptional behavior of a test (whether the test throws an exception that is not caught) to decide whether a test case fails or succeeds: – If the test does not throw an exception, it succeeds. – If the test throws an exception, • that is a PexAssumeFailedException, it succeeds. (But it is usually filtered out, unless the TestEmissionFilter is set to All.) • that is a PexAssertFailedException, or any other assertion violation exception of other unit test framework, it fails. • that is neither an assumption nor an assertion violation exception, it depends on further annotations whether the test passes or fails. You can annotate the test (or the test class, or the test assembly) with one of the following attributes to indicate which exception types may or must be thrown by the test in order to be considered successful. – The PexAllowedExceptionAttribute indicates that a test method, or any other method that it calls directly or indirectly, may throw a particular type of exception for some test inputs. – The PexAllowedExceptionFromTypeAttribute indicates that any method of a specified type may throw a particular type of exception for some test inputs. – The PexAllowedExceptionFromTypeUnderTestAttribute indicates that any method of the designated type under test may throw a particular type of exception for some test inputs. – The PexAllowedExceptionFromAssemblyAttribute indicates that any method located in a specified assembly may throw a particular type of exception for some test inputs. 7.9 When Does Pex Emit a Test Case? Pex supports different filters that decide when generated test inputs will be emitted as a test case. You can configure these filters with the TestEmissionFilter property that you can set for example in the PexMethod attribute. Possible values are the following. – All - Emit every generated test input as a test case, including those which cause assumption violations. – FailuresAndIncreasedBranchHits (default) - Emit tests for all unique failures, and whenever a test case increases coverage, as controlled by the TestEmissionBranchHits property (see below). – FailuresAndUniquePaths - Emit tests for all failures Pex finds, and also for each test input that causes a unique execution path. – Failures - Emit tests for failures only.

194

N. Tillmann, J. de Halleux, and W. Schulte

Regarding increased branch coverage, the TestEmissionBranchHits property controls whether Pex should just consider whether a branch was covered at all (TestEmissionBranchHits=1), or whether a test covered it either once or twice (TestEmissionBranchHits=2) and so on. TestEmissionBranchHits=1 will give a very small test suite that will cover all branches Pex could reach. (In particular, this test suite will also cover all reached basic blocks and statements.) The default for this option is TestEmissionBranchHits=2, which generates a more expressive test suite that is also better suited to detect future regression errors. Exercise 16. Test case emission filters Consider the following max methods. i n t max ( i n t x , i n t y ) { if (x > y) return x; else return y; } i n t max ( i n t a , i n t b , i n t c , i n t d ) { r e t u r n max ( max ( a , b ) , max ( c , d ) ) ; }

Consider the following PexMethod. [ PexMethod ] p u b l i c v o i d MaxTest ( i n t a , i n t b , i n t c , i n t d ) { i n t e = max ( a , b , c , d ) ; P e x O b s e r v e . V a l u e F o r V i e w i n g ( " max " , e ) ; }

How many execution paths does it have? (And can you make Pex emit test cases for all execution paths?) How many tests will Pex emit with TestEmissionBranchHits=1? What is the relation between those two numbers? 7.10 When Does Pex Stop? If the code under test does not contain loops, or unbounded recursion, Pex will typically stop quickly because there is only a (small) finite number of execution paths to analyze. However, most interesting programs contain loops and/or unbounded recursion. In such cases the number of execution paths is (practically) infinite, and it is in general undecidable whether a statement is reachable. In other words, Pex would take forever to analyze all execution paths of the program. In order to make sure that Pex terminates after a reasonable amount of time, there are several exploration bounds. All bounds have predefined default values, which you can override to let Pex analyze more and longer execution paths. The bounds are parameters of the PexMethod, PexClass and PexAssemblySettings attributes.

Parameterized Unit Testing with Pex: Tutorial

195

There are different kinds of bounds: Constraint Solving Bounds apply to each attempt of Pex to determine whether an execution path is feasible. Pex may need several constraint solving attempts to compute the next test inputs. – ConstraintSolverTimeOut: seconds the constraint solver has to figure out inputs that will cause a different execution path to be taken. – ConstraintSolverMemoryLimit: megabytes the constraint solver may use to figure out inputs. Exploration Path Bounds apply to each execution path that Pex executes and monitors. These bounds make sure that the program does not get stuck in an infinite loop, or recursive method. – MaxBranches: maximum number of branches that may be taken along a single execution path. – MaxCalls: maximum number of calls that may be taken during a single execution path. – MaxStack: maximum size of the stack at any time during a single execution path, measured in number of active call frames. – MaxConditions: maximum number of conditions over the inputs that may be checked during a single execution path. Exploration Bounds apply to the exploration of each parameterized unit test. – MaxRuns: maximum number of runs that will be tried during an exploration (each run uses different test inputs; not every run will result in the emission of a new test case). – MaxRunsWithoutNewTests: maximum number of consecutive runs without a new test being emitted. – MaxRunsWithUniquePaths: maximum number of runs with unique execution paths that will be tried during an exploration. – MaxExceptions: maximum number of exceptions that may be found over all discovered execution paths combined. – MaxExecutionTreeNodes: maximum number of conditions over the inputs that may be checked during all discovered execution paths combined. – MaxWorkingSet: maximum size of working set in megabytes. – TimeOut: seconds after which exploration stops. The following example shows a parameterized test that involves a loop. The loop bound depends on the test inputs, and the lurking exception can only be triggered if the loop is unfolded a certain number of times. Here, we used an explicit bound of 10 runs (MaxRuns = 10) to let Pex finish quickly. However, with this bound, Pex will most likely not be able to trigger the exception, as Pex will not unroll the loop sufficiently many times.

196

N. Tillmann, J. de Halleux, and W. Schulte [ PexMethod ( MaxRuns = 1 0 ) ] p u b l i c void TestWithLoop ( i n t n ) { v a r sum = 0 ; f o r ( i n t i = 0 ; i < n ; i ++) sum ++; i f ( sum > 2 0 ) t h r o w new E x c e p t i o n ( ) ; }

In the Pex tool bar, you will see that the boundary button got enabled.

When you click on the button, Pex shows which bounds were exceeded, and it offers several actions to increase the bounds.

Go ahead and select Set MaxRuns=20 to double the bound. This will update the specified bounds in the source code. Run Pex again. If Pex is still not able to trigger the exception, you might have to double the bound again. The following example shows another parameterized test that involves a loop, but this loop does not depend on the test inputs. Here, we used an explicit bound of 10 branches (MaxBranches = 10) to let Pex finish quickly. However, with this bound, Pex cannot even once execute the code from beginning to end, as executing the embedded loop will cause more than 10 branches to be executed. [ PexMethod ( MaxBranches = 1 0 ) ] p u b l i c void TestWithFixedLoop ( i n t j ) { v a r sum = 0 ; f o r ( i n t i = 0 ; i < 1 5 ; i ++) sum ++; i f ( j == 1 0 ) t h r o w new E x c e p t i o n ( ) ; }

In those cases, where a particular run exceeded some path-specific bounds, we get a special row with the words “path bounds exceeded” in the table of all generated test cases. When we click on the row, we see more detailed information about the exceeded bound, and there is a button “Set MaxBranches=...” to increase the bounds. (The button in the tool bar is enabled as well.)

Parameterized Unit Testing with Pex: Tutorial

197

1. Pex determines the point in the execution trace of the test case where the failure manifested itself (where an exception was thrown that was not caught) 2. Pex looks back in the execution trace for the last public method call. 3. Pex computes a condition under which the failure happened, relative to the last public method call. 4. If the condition only involves parameters, Pex will suggest the negation of the condition as a missing precondition. 5. Otherwise, Pex will try to express the condition in terms of the fields of the class of the last public method call. Pex will suggest the negation of the condition as a missing invariant (a condition over the fields of the class which should always hold). Furthermore, Pex will execute the test again, and try to find which public method of this class first left the object behind in a state which violates the suggested invariant. If Pex finds such a method, Pex will start over at 3. Algorithm 7.1. Fix It

Click on the button, increase the bounds, and run Pex again. If Pex can still not find the exception, you might have to increase the bounds again. 7.11 How Does Pex Suggest Fixes for Errors? Sometimes, when a test run fails, Pex can suggest a potential change to the source code to prevent the same failure from happening again. Algorithm 7.1 shows how Pex tries to locate the failure cause. This algorithm terminates because in each iteration, Pex tries to find the failure cause at an earlier point in the execution trace. If Pex suggests a missing precondition or invariant, following Pex’ advice is guaranteed to prevent the same failure from happening again, under the proviso that Pex could monitor all relevant parts of the program. If this proviso is violated, Pex could suggest a precondition or an invariant which is inconsistent, which basically means that Pex suggests to not run this code again. 7.12 Creating Complex Objects Pex monitors the executed instructions when it runs a test and the program-under-test. In particular, it monitors all field accesses. It then uses a constraint solver to determine new test inputs — including objects and their field values — such that the test and the program-under-test will behave in other interesting ways. Thus, Pex needs to create objects of certain types and set their field values. If the class is visible and has a visible default constructor, Pex can create an instance of the class. If all the fields of the class are visible, Pex can set the fields automatically.

198

N. Tillmann, J. de Halleux, and W. Schulte

If the type is not visible, or the fields are not visible, Pex needs help to create objects and bring them into interesting states to achieve maximal code coverage. There are two ways to help Pex. – The first is for the user to provide parameterized factories for the complex objects such that Pex can explore different object states, starting from factory methods. – The second is to define the invariants of the object’s private fields, such that Pex can manufacture different object states directly. 7.13 Explorables Consider the following class. p u b l i c c l a s s MyCounter { p r i v a t e i n t _count ; p u b l i c MyCounter ( i n t i n i t i a l C o u n t ) { i f ( i n i t i a l C o u n t < 0 ) t h r o w new A r g u m e n t E x c e p t i o n ( ) ; t h i s . _count = i n i t i a l C o u n t ; } p u b l i c MyCounter ( ) { t h i s . _count = 0; } p u b l i c v o i d I n c r e m e n t ( ) { t h i s . _ c o u n t ++; } p u b l i c i n t Count { g e t { r e t u r n t h i s . _ c o u n t ; } } }

Consider the following test where something bad happens when the counter reaches the number 99. [ PexMethod ] p u b l i c v o i d T e s t M y C o u n t e r ( [ P exAssumeNotNull ] MyCounter c o u n t e r ) { i f ( c o u n t e r . Count == 9 9 ) t h r o w new E x c e p t i o n ( ) ; }

When you run Pex it may not find the exception, but instead Pex might show you a warning such as the following: Pex guessed how to create instances of MyCounter: new MyCounter()

This means that, not knowing all ways to construct meaningful counters, Pex chose the default constructor that takes no parameters. Of course, with that constructor we would have to call the Increment() method 99 times in order to hit the exception, which Pex would do only after a very long exploration time. We can improve the situation by telling Pex to use the other constructor. You can define a factory method that Pex can use to create instances of a given type. In fact, when you use Pex from Visual Studio, you get the option to create a factory method when necessary. The default code and attributes for such a factory are similar to the following.

Parameterized Unit Testing with Pex: Tutorial

199

public s t a t i c clas s Factory { [ P e x F a c t o r y M e t h o d ( t y p e o f ( MyCounter ) ) ] p u b l i c s t a t i c MyCounter C r e a t e M y C o u n t e r ( ) { r e t u r n new MyCounter ( ) ; } }

You can change this factory method as desired. You can add parameters, and then Pex will determine which values are relevant. The following is a factory method that creates MyCounter instances by calling the constructor that takes one parameter. [ P e x F a c t o r y M e t h o d ( t y p e o f ( MyCounter ) ) ] p u b l i c s t a t i c MyCounter C r e a t e M y C o u n t e r ( i n t x ) { r e t u r n new MyCounter ( x ) ; }

With these annotations in place, when you run Pex again, it will explore the constructor that takes an initial count, and Pex will find a test case such as the following: [ TestMethod ] [ PexRaisedException ( typeof ( Exception ) ) ] p u b l i c v o i d TestMyCounte r_ My Co u n ter _7 1 1 15 _ 1 94 0 0 2_ 0 _ 02 ( ) { MyCounter mc0 = new MyCounter ( 9 9 ) ; t h i s . T e s t M y C o u n t e r ( mc0 ) ; }

As an alternative to factory methods, you can indicate in a declarative way which constructor to use with the following attribute: [ assembly : PexExplorableFromConstruct or ( t y p e o f ( MyCounter ) , typeof ( in t ) ) ]

The first argument indicates the type to construct, the following arguments are the parameter types for the desired constructor. 7.14 Invariants Even when you tell Pex which constructor to use, or even when you write a factory method by hand, Pex may have a hard time to explore all combinations of the constructor and the mutator methods. It is important to emphasize that the section about invariants is for advanced developers. It is often sufficient, and much easier, if you can get Pex to create objects through the explorables that we described in Section 7.13. The example that we showed in Section 7.13 had a nice property: All possible and legal configurations of the MyCounter class could be constructed by calling the constructor. The constructor was written to throw an exception when an attempt is made to configure an invalid object. When a class does not have such a constructor, we can always write such a special constructor for testing purposes: a constructor that allows to configure the object in all possible and legal ways. To describe what is possible and legal, you have to explicitly

200

N. Tillmann, J. de Halleux, and W. Schulte

define the condition under which the fields of an object are properly configured. This condition is called the class invariant. You can write an invariant as a private parameterless instance method that returns bool. For example, the invariant of the array list class can be expressed as follows. p r i v a t e bool I n v a r i a n t ( ) { return t h i s . _ i t e m s ! = n u l l && t h i s . _ s i z e >= 0 && t h i s . _ i t e m s . L e n g t h >= t h i s . _ s i z e ; }

Now you can define the special public constructor for testing purposes. It simply receives all field values as parameters. When the supplied parameter values do not satisfy the invariant, the constructor is aborted by throwing an exception. One way to make sure that this constructor is only used for testing purposes is to define it conditionally. # i f DEBUG p u b l i c ArrayList ( o b j e c t [ ] items , i n t s i z e ) { t h i s . _items = items ; this . _size = size ; if (! this . Invariant ()) t h r o w new I n v a l i d O p e r a t i o n E x c e p t i o n ( ) ; } # endif

Another way to make sure this constructor is only used for testing purposes is to define it as internal, and use the InternalsVisibleToAttribute attribute to expose it to the assembly that contains the tests. Now you can tell Pex to use this constructor, as we discussed in Section 7.13: [ assembly : PexExplorableFromConstruct or ( t ypeof ( ArrayL i st ) , typeof ( object [ ] ) , typeof ( i nt ) , typeof ( i nt ) ) ]

Pex will explore the constructor and the conditions in the Invariant method, filtering out all configuration where the invariant does not hold. As an alternative to defining DEBUG-only constructors, Pex can also leverage designated invariant methods that can be specified in conjunction with the Code Contracts framework[13]. How to write and leverage invariants with code contracts is documented in the paper on patterns for parameterized unit testing in the documentation section of the Pex website [276]. 7.15 Limitations There are certain situations in which Pex cannot analyze the code properly: – Nondeterminism: Pex assumes that the analyzed program is deterministic. If it is not, Pex will prune non-deterministic execution paths, or it may go in cycles until it hits exploration bounds. – Concurrency: Pex does not handle multithreaded programs. (It might work in a scenario where the main thread execution is deterministic, independent of the behavior

Parameterized Unit Testing with Pex: Tutorial

201

of other spawned threads.) CHESS [255] is a project that is similar to Pex, but instead of exploring the non-determinism induced by input values, CHESS explores the non-determininism induced by different thread-interleavings in multi-threaded applications. – Native Code or .NET code that is not instrumented: Pex does not understand native code (x86 instructions called through the P/Invoke mechanism of .NET). Pex does not know how to translate such calls into constraints that can be solved by a constraint solver. And even for .NET code, Pex can only analyze code it instruments. However, even if some methods are uninstrumented, Pex will still try to cover the instrumented code as much as possible. – Language: in principle, Pex can analyze arbitrary .NET programs, written in any .NET language. However, the Visual Studio AddIn and Pex code generation only support C#. – Symbolic Reasoning: Pex uses an automatic constraint solver to determine which values are relevant for the test and the program-under-test. However, the abilities of the constraint solver are, and always will be, limited. In particular, Z3 cannot reason precisely about floating point arithmetic. Despite these limitations, the tool is being constantly evolved and some of these issues will be addressed in future versions.

8 Related Work Testing with algebraic specifications was started by Gaudel et al. [38]. They use axioms in various ways: to describe the test purpose; to obtain concrete data, which is necessary for the instantiations of the axioms; and to derive new theorems, which when tested should uncover errors in state not captured by the model. For deriving those theorems they introduced regularity assumptions. Parameterized Unit Tests [314] are a way to write algebraic specification as code. Another name for it is theories in the JUnit test framework, where Saff et al. [293] found them to be effective. They appear as row tests in MbUnit [90], and under other names in various other unit test frameworks. It has also been investigated how such algebraic axioms can be synthesized automatically [165,313,87] from a given implementation, possibly supported by a given test suite [112]. In order to generate test inputs for parameterized unit tests, Pex [263] uses dynamic symbolic execution. A recent overview on the combination of static and dynamic analysis can be found in Godefroid et al. [140]. The basic idea of symbolic execution [194] was introduced more than three decades ago. By combining it later with other work on dynamic test generation [198,199], DART [141] was the first practical tool implementing dynamic symbolic execution for C programs. Many other implementations have been recently developed as well: CUTE [297], EXE [67], KLEE [66] for C programs. jCUTE [296] for Java programs. SAGE [142] for x86 code. Related to dynamic symbolic execution is model-checking of programs. XRT [145] is a model checker for .NET programs, JPF [12] for Java programs. Both JPF and XRT

202

N. Tillmann, J. de Halleux, and W. Schulte

have extensions for symbolic execution. However, both can only perform static symbolic execution, and they cannot deal with stateful environment interactions. Many other approaches to automatic test generation exist. For example, the tool Randoop [271] generates new test-cases by composing previously found test-case fragments, supplying random input data.

9 Conclusion We presented a methodology of parameterized unit testing [314], a generalization of traditional unit testing. Parameterization allows the separation of two concerns: the specification of the behavior of the system, and the test cases to cover a particular implementation. Dynamic symbolic execution and constraint solving allow the automatic generation of the test cases in many cases. To this end, we used Pex [263] to automatically generate test inputs. We studied how to write parameterized unit tests for various programs, how to measure coverage of the code under test, how to test programs that interact with the environment, and how path constraints are collected and solved in dynamic symbolic execution.

Software Tool Issues Michael Hennell LDRA Ltd, UK

This chapter relies on experience with tool development gained over the last thirty years. It shows that there are a large number of techniques that contribute to any successful project, and that formality is always the key: a modern software test tool is based on a firm mathematical foundation. After a brief introduction, Section 2 recalls and extends the terminology of Chapter 1. Section 3 discusses the the design of different sorts of static and dynamic analysis tools. Nine important issues to be taken into consideration when evaluating such tools are presented in Section 4. Section 5 investigates the interplay between testing and proof. In Section 6, we call for developers to take their own medicine and verify their tools. Finally, we conclude in Section 7 with a summary of our main messages, emphasising the important role of testing.

1 Introduction Tools are an essential aid in all engineering activities, and Software Engineering is no exception. So it is not a surprise to find that they are a key to success there too: they perform tasks that are either too boring or too complex for humans. For example, they are used to capture requirements, capture designs, generate code, and verify systems. Normally, it is common practice to make exaggerated claims for tools and techniques. Myth blowing (1): formal methods are widely considered to be an over-sold, overly expensive, and ineffective technique [24]. A formal specification and proof of refinement is all that is needed to establish the correctness of a program. Myth blowing (2): in contrast, testing is totally accepted and regarded as applicable to, and effective for all software products. Running a program with test data and checking its outputs is all that is needed to establish correctness of a program. Of course, in practice neither of these claims is true. These lectures set out to show that there are a large number of techniques that contribute to any successful project, and that formality is always the key. A modern software test tool is based on a firm mathematical foundation. In fact, many such tools have formal methods built in, but customers are usually unaware of this fact, and hence there is no fear factor. The trend is for ever-closer links with more powerful mathematical techniques. See [316] for a discussion of invisible formal methods. P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 203–214, 2010. c Springer-Verlag Berlin Heidelberg 2010

204

M. Hennell

2 Preliminaries Consider a numerical calculation to compute the binding energy of three nucleons [34]. The problem is fully specified by a system of partial differential equations and boundary conditions. It is, at least in principle, solvable by standard methods. But, testing a program to solve this problem is infeasible, because we cannot compare the output with a value from the requirements: it is unknown. The problem of establishing criteria that will guarantee that the delivered system is the right system is not a trivial activity. It may be necessary to demonstrate what the system cannot do. The question is not “What is the system supposed to do?”, but rather “How can we identify that it is doing it correctly?” Coincidental correctness is a huge problem: the system produces the expected output, but for the wrong reason [170,335] (see the discussion in Chapter 1). The situation is particularly aggravated in the case of automated test-data generation, in which we cannot rely on the expertise of a tester. The goal in software verification is to ensure that there are no failures during execution of the system (recall the terminology of Chapter 1). A failure is a departure from the actual correct functionality, that is, from the actual requirements, not the stated requirements). A failure is caused by a system fault. Not all faults lead to failures: the fault may lead to a failure only under specific conditions, but removing all faults will ensure that no failures remain. Faults are, in general, application dependent, and identifying faults is another non-trivial task. A defect is a violation of some additional rule or constraint placed upon the software. Faults are often identifiable as defects or are closely associated with defects. Faults can be classified into two categories: technical faults and application faults. Technical faults are due to considerations such as: – Language defects: • • • •

Undefined or unspecified behaviour Implementation-dependent behaviour Erroneous features Unexpected effects

– Compiler faults – Processor characteristics: • Length, bus widths, storage issues – Optimisation effects – Ignorance by software engineers Application faults are those that can be found only from knowledge of the application or by observing anomalous behaviour. For example, in a drive-by-wire car, turning the steering wheel anti-clockwise and then turning right, would be anomalous. In general, the common approach is to remove all defects from the software. In a sense, this makes the software better than it really needs to be. The detectable defects in the main (but not exclusively) eliminate the technical faults.

Software Tool Issues

205

3 Verification and Validation Tools Tools for this activity must have at least some knowledge of the syntax and semantics of the representation. Note that the actual semantics of the application are defined by the actual platform, that is, the processors and other devices that are driven by the binaries. In general, they are unknown, but available. The first tool in the chain is usually a word processor, such as MS Word. Today, these tools are extremely good at producing documents that have few defects: they check spelling, and even have knowledge of grammar. Future tools may have specialist lexicons and the capability to detect ambiguity. The next tools are theorem provers, model checkers, and automated testing tools. Proof techniques may not work because approximate numerical techniques must be used. Such programs are checked by a variety of different techniques, such as partial proving, computing sub-problems with known answers, etc. A mixture of proof and testing is required. The principal enemies of software developers are ignorance of actual requirements, complacency and over-confidence, unknown and unmanaged complexity, and coincidental correctness. Not every project requires the same level of diligence; not every part of a project requires the same level of diligence. In this case, partitioning the system becomes an issue. Checking components to different standards is a complex management issue. 3.1 Checking Tools The simplest checking tool utilising knowledge of the syntax of the representation (the programming language) is a programming-rules checker. Such a tool can detect huge numbers of defects. It can also produce metrics. – Count the number of lines of code: if the code is unformatted, then these metrics are largely useless. – Count the number of executable lines of code – Count the number of lines of code per procedure In this chapter, two types of sophisticated tool are distinguished: – Static analysis tools: completely workable (but possibly imprecise) definition is that these tools do not involve the execution of the code with test data. This definition essentially then includes all formal methods tools. – Dynamic analysis tools: these measure actual properties of the executing program. In the following discussion, the structure will be to show how increasingly sophisticated analysis techniques can be built into tools, and to indicate how this trend will evolve in future years. Static analysis tools will be discussed first, because all dynamic analysis tools depend on a preliminary static analysis. 3.2 Control-flow Modelling Complexity-control rules can be applied such as: – Procedure size – Number of procedural parameters – Number of global variables

206

M. Hennell

In applying rules of this type, what is happening is a trade-off between different types of complexity. The biggest step forward in sophistication is to construct a control-flow model. Control flow is changed by conditional statements and control-flow jumps. A basic block is a sequence of statements such that, if one statement is executed, then all the statements in the block are executed. Basic blocks are linked together by control-flow statements, procedure call, and unconditional statements. The model is usually in the form of a directed graph. A related form is a call graph showing the calling structure. Once a control-flow graph is available, we can look at activities such as: – Counting paths: these are usually infinite; because loops will have no termination criterion, due to the fact the graph has (as yet) no variable information. – Computing path length: these may be in terms of various units: lines of code, numbers of basic blocks. – Generating paths: usually the start node and each termination node are of most interest. These may be the shortest or longest paths; a path starting with initialisation, or finishing with termination; a path leading forwards from a particular point, or one leading backwards. The graph-theoretic nature of a control-flow model suggests that we measure the number of paths through a program. For example, McCabe’s cyclomatic complexity measure [241] directly measures the number of linearly independent paths through a program’s source code. The cyclomatic complexity of a structured program is described as a directed graph G containing the basic blocks of a program, with an edge between two basic blocks if control may pass from the first to the second. The cyclomatic complexity of a program is then defined as V (G) = e − n + 2 where e is the number of edges of the graph n is the number of nodes of the graph It is assumed that each exit point is connected back to the entry point, and the graph is then described as strongly connected. V (G) calculates the number of linearly independent cycles that exist in the graph, that is, those cycles that do not contain other cycles within themselves, so that V (G) measures the size of a planar directed-graph. Note that because each exit point loops back to the entry point, there is at least one such cycle for each exit point. Call graph depth and width can also be measured, and again complexity trade-off decisions can be made. 3.3 Data-Flow Modelling If the control-flow model is annotated with the actions on variables, then it is possible to perform data-flow analysis. The objective is to discover defects in the way variables are used, and specific rules-of-usage are invoked.

Software Tool Issues

207

In its simplest form, consider the actions performed on a given variable. First it is declared but has no value, denoted by the symbol U . Then it is given a defined value, denoted by the symbol D . Then it is used (referenced), denoted by R. Finally it goes out of scope, denoted U again. If we consider one variable, then the sequence of actions performed on that variable as a control-flow path is traversed is: U D RRRD RRU Such a sequence of operations, called a path expression, is generated for every variable in the program, and for every path. The path expressions for each variable are checked for anomalous behaviour: – U R indicates that a variable has been given no value, but has been used in a computation. – D D indicates that a variable has been given a value and then is immediately given another value. – D U indicates that a variable is given a value that is never used. These events are termed anomalies, because we cannot be sure that they are not deliberate, and indeed, they are usually benign. The U R anomalies are most likely to be faults. The D D anomalies are also usually benign, but might be due to editing errors. The D U anomalies are usually due to badly constructed code. For example, use of a while loop instead of a repeat-until loop. D U anomalies can always be removed with a gain in efficiency and in code clarity. Programmers always kick against removal of these anomalies until they have had some experience, after which they wonder what the fuss was all about. Data-flow analysis is fast and efficient, and more importantly is valid for all paths. Loops are handled by traversing the zero-trip path and then going around each loop just twice. It is performed first over each block for all the variables. Any in-block anomalies can be reported, then the first and last operations for each variable in the block are remembered and the others discarded. Then if two blocks connect, we look at the last operation in the first block and the first operation in the second block. These two operators are either anomalous or they are not. When a block has no operations in it for a given variable, it uses those of the blocks to which it connects, that is, the operations are projected up and down. At procedure boundaries, the operations on the actual parameters are mapped to the formal parameters, and vice versa. This activity enables the whole of the procedure interface to be accurately documented. Parameters and global variables can be classified as to whether they are input, output, or both input and output. Further anomalies can now be identified: input and output variables that have clear paths. If an output variable gets a value on some paths through a procedure, but does not get a value on at least one other path, then the output variable is said to have a clear path. A procedure that has clear paths on its output variables is a potential cause of integration errors, because it is possible that future processing assumes that the variable always has a value. In total, data-flow analysis can identify up to about 20 different problems at procedure interfaces. For example, it is good at finding global variable aliasing effects.

208

M. Hennell

Data-flow analysis is normally accomplished bottom-up, but can be applied systemwide. It is an all-paths technique. In total, between 50 and 100 types of defects can be identified. In this technique, the data-flow graph is also annotated with the input/output operations. The analysis proceeds by starting from one output statement and following all the paths back to the start node. It lists the dependencies on other variables that are encountered as the paths are traversed. 3.4 Information-Flow Analysis The process eliminates all intermediate variables and ends with a list of just input variables. Loops are handled by traversing them until the dependencies do not change. Usually this needs to be done twice, and is always finite except for some cases of recursion [117]. The number of paths to be traversed can, however, become prohibitive if the software is carelessly written, that is, it has many infeasible paths [163]. The relationship of the output variables on the input variables can be classified in at least eight different ways. – SDF: strongly defined The output variable always has a value, that is, on all paths to the output point. – WDF: weakly defined The output variable has a value on only some of the paths. This in general indicates a defect. – SDP: strongly dependent The output variable always depends on the listed input variable, that is, on all paths to the output point. The output variable may be either weakly or strongly defined. – WDP: weakly dependent The output variable depends on the listed input variable on a subset of the paths so the output sometimes depends on the input. This might indicate a problem so the reasons why need to be explored. – SCDF: strongly conditionally defined The output variable is always conditionally dependent on the listed input variable on all paths to the output point. – WCDF: weakly conditionally defined The output variable that does not always have a value is sometimes conditionally dependent on the listed input variable, that is, on some paths to the output point. Again this may suggest a cause for concern. – SCDP: strongly conditionally dependent The output variable always depends on the listed input variables but only through different path choices. – WCDP: weakly conditionally dependent The output variable sometimes depends on the listed input variable but only because it influences the path choice. This may also indicate a cause for concern. This technique can benefit from external information. For example, from the requirements analysis, specification, or design it should be possible to state what the dependencies should be, and these predictions can be compared with the actual dependencies. This technique can therefore detect application-dependent faults. 3.5 Dynamic Analysis This technique is designed to measure the quality and efficacy of the test data. The quantification of the test data is performed in terms of structural units of the code under test. The test data is generated in many different ways depending on the ultimate objectives.

Software Tool Issues

209

In some projects it is enough to generate test data just to achieve some structural metric. In other projects the test data must be generated directly from requirements and hence any deficiencies in this data reflect directly on the analysis of the requirements. This is the avionics position. Just executing the program with the test data is by itself not enough, it is also required that the resultant program outputs be checked for correctness. The mechanism that performs this checking can be human or automatic. The test data and expected test outcomes therefore form a related pair. When a test fails there are a number of possibilities: the program is incorrect, the test data and expected output do not correspond, or the Oracle is faulty. If the required coverage metrics are not achieved then the possibilities are: – There is unnecessary code in the program. – The untested code is infeasible. – The test data is inadequate and needs improvement. Usually it is the test data that needs improvement, but infeasibility is a serious problem. There are two types of dynamic analysis tool: intrusive probes and non-intrusive probes. An intrusive probe system puts software probes (counts or procedure calls) into the code being analysed (high level language or assembler). These probes record information about the execution process: execution histories. It is essential to be able to demonstrate that the presence of the probes does not change the functionality of the instrumented code. This is normally demonstrated by showing that compiler validation suites are not affected by the instrumentation process. A compiler validation suite is designed to show that a compiler performs correct computations. A non-intrusive system obtains the same or similar information directly from the processor, for instance via a JTAG port (an IC debug or probing port). The dynamic analysis tool then relates this low-level information back to the original representation (high-level language or assembler). Unfortunately, it is not always possible to perform this relationship unambiguously. Effects that make this relationship difficult are compiler optimisations, pipelining in the processor, multi-threading, and concurrency. The execution history may be a simple record of which basic blocks have been hit (executed), or it may be a complete trace of exactly which statements have been executed and in what sequence together with the variables of specific variables. Note that this is not the same as debugging, but is similar technology applied systematically and subject to analysis. 3.6 Metrics in Common Use Chapter 1 mentions a fundamental question about the nature of testing: “How much testing is enough?” Coverage metrics can be used to gauge how much has been achieved. The most basic metric is the proportion of statements executed (the All-Statements criterion of Chapter 1). Ter1 =

number of statements executed by the test data total number of executable statements

210

M. Hennell

Higher level metrics can be generated, in particular [358]: Ter2 =

number of control-flow branches executed by the test data total number of control-flow branches

number of LCSAJs executed by the test data total number of LCSAJs A linear code sequence and jump (LCSAJ) consists of the following three items: the start of the linear sequence (basic block) of executable statements, the end of the linear sequence, and the target line to which control flow is transferred at the end of the linear sequence. These metrics satisfy a pure hierarchy, that is, Ter3 =

(Ter3 = 1) ⇒ (Ter2 = 1) ⇒ (Ter1 = 1) The first two metrics were in use in the 1940s and the third dates from the 1970s. (Ter2 = 1) became an industrial norm during the late 1960s, and projects of many millions of lines of code were tested to this level in the 1960s. This was the level selected for the DO-178B avionics standard [106]. It was supplemented by the MCDC (multiple condition-decision coverage) additional requirement that applies to conditional expressions containing the logical conjunction and disjunction operators. This requires each subcondition to be executed independently of the other subconditions. Probably thousands of projects have been tested to this level. Higher levels, such as (Ter3 = 1) have been mandated for many other projects, including aerospace, telephony, and banking. Languages of the C family require short-circuit evaluation to be performed for the logical conjunction and disjunction operators. Therefore, for these languages MCDC is simply subsumed into branch testing: the requirement to hold the other subconditions constant becomes just part of the equivalent test-set issue. From a dynamic analysis point of view MCDC adds essentially nothing for these languages. However, from the literature one would be excused for believing that this is a draconian additional requirement. Mainly organisations (such as code generator vendors) claim that it will dramatically increase their costs to test to this level. 3.7 Unit Testing One of the most common forms of dynamic analysis techniques is unit testing. In this technique, procedures or collections of procedures are tested in isolation from the complete system in order to establish that they satisfy specific requirements. The requirements are wider than those of the project so that, for example, boundary-condition testing can be performed. Testing is performed in the actual target environment using the actual semantics rather than alternatives (on a host or simulator) or models, as in formal methods. The interface to each procedure is tested for input values that may be excluded by higher-level procedures. Therefore, generality is being explored. Additionally, otherwise infeasible, protective code components can be tested. Some instances of coincidental correctness can be removed, that is, in the bigger system, a procedure may be called when it should not be, or is not called when it should be, and this may hide defects. Because we are dealing with a smaller component, it is easier to observe incorrect behaviour and hence detect faults, that is, the oracle is simpler and more sensitive.

Software Tool Issues

211

The exploration of housekeeping code can be accomplished more easily. In extreme cases, the unit is purely performing housekeeping tasks. Semantic sensitivity is increased, that is, the sensitivity of the oracle to coincidental correctness and the insensitivity of the system outputs to internal defects. The issue of how to handle procedures called by the unit under test is highly complex. It is possible to use the actual procedures or alternatively replace them by a stub. A stub usually has a significantly simplified functionality, often just a table relating inputs to outputs. A given procedure cannot usually be tested until the appropriate preconditions on the data items have been satisfied, for example, the initialisations of the global variables and actual parameters have to be performed first. This often requires a sequence of other procedures to be called in order to perform this task. In fact most unit tests will require such a sequence to be executed. Similarly often another sequence must be performed afterwards in order to perform certain checks on the outcomes. This is a major contributor to the demonstration that the code implements no unnecessary requirements. The whole of the global interface to the unit is accurately known and any defects in that interface can be detected. With the full interface being tested this is one technique that can detect logic faults. Thus missing paths can be detected. Probably the biggest contributory factor is the fact that the interface to the units has to be accurately known and hence this forces a very careful design process. 3.8 Dynamic Conformance Analysis From the design (or earlier) it is possible and sometimes required, that assertions can be made about particular characteristics of the software. – Pre- and post-conditions for procedures – Loop variants and invariants These assertions are typical of those that are required for formal methods. However, if they are embedded in the code as annotations (special forms of comments), then they can be translated automatically into code and they become part of the executing program. Tools to handle this (even some compilers) have been available since the 1960s. A special tool will provide the accompanying housekeeping code that generates the messages, collects statistics and produces reports. The advantage of this approach over the corresponding static analysis method is that it will always work regardless of dialectic issues. It also works with the actual semantics of the project. 3.9 Coincidental Correctness One of the major programs with computer software is coincidental correctness, that is, getting the right results for the wrong reasons (as mentioned above and in Chapter 1). Dynamic analysis is one way in which it can be established that the computations went through the correct code. Essentially this means that the requirements can be traced through the test data to the actual code and the hence provide complete visibility. This is a requirement of DO-178B [106].

212

M. Hennell

4 Tool Issues 1. Does the tool produce many false positive reports? These occur when the tool reports faults that are not actually present. This happens due to a number of issues, because the algorithm is unable to be sure that the fault does not exist, but warns in case it is significant. The number of false positives is not necessarily a reflection of a poor tool (that is, a bug), but rather reflects the complexity of the task and the inadequate information available due to poor language design. An example is array checking in C, where arrays can decay to pointers and then, after some pointer arithmetic, again reappear as arrays. 2. Does the tool produce false negative reports? This is the converse where the tool fails to report a defect when one is present. Again the problem is not only bugs in the tool, but is more likely to be due to the difficulty of designing algorithms that work for all the language complexity. 3. Does the tool identify too many faults? This can be a major problem. If the code is not carefully written (as in legacy code) then the number of messages can be overwhelming. It can be of the order of ten messages per statement. 4. Does the tool take too long? In general the issue of time is not significant unless of course it is excessive. The faster the tool the more often it can be used. In general the time taken depends largely on how complex are the checks being performed. 5. Does the tool work across the complete system? This is an important issue because some faults can only be detected when the whole system is analysed. There are two major problems when working with systems composed of multiple file (translation units). It is desirable to analyse each file separately so that changes to one file do not require the whole system to be re-analysed. 6. Is there inter-procedural recursion? Even in a single file, this causes problems because neither procedure can be analysed fully until the other one has been analysed. The way out is that the termination paths have to be identified and used in a first pass and then the whole must be re-analysed. If the procedures are in different files then all the files must be analysed at least twice. The second form of problem occurs when procedural dependencies are across file boundaries. Let file A contain procedures P and Q . Let file B contain procedures R and S . If procedure P calls procedure R, then file B should be analysed before file A. But if procedure S calls procedure B, then the converse is true and a conflict occurs. Resolution of this problem requires some complex analysis algorithms. Poor placement of procedures can increase the time taken for analysis. 7. What dialect dependencies does the tool possess? Most compilers implement their own version of the language. Therefore in order to analyse the resultant code the analysis tool must have the appropriate grammar. At least one tool has an adaptive grammar that can handle a wide class of similar grammars. The same problem occurs with the semantics. On the whole test tools are less sensitive to the semantic variations. An example of a dramatic variation occurs in the updating of a global variable that is also passed as parameters. Some compilers update when the parameter changes, others update only on exit from the procedure.

Software Tool Issues

213

8. What limitations does the tool possess? All tools possess limitations. This can involve the amount of code that can be analysed, the depth of blocks that can be handled, the bracket nested depth permitted, symbol table size, etc. 9. How will the tool be incorporated into the development process? Many organisations give no prior thought as to how verification tools can be incorporated into the development process. They require that the tools fit into their development scheme without difficulty. An aerospace company bought a verification tool in order to show evidence for DO-178B certification [106]. They wrote the software, loaded it into a test rig and ran tests over several months. Satisfied that the code was right they then performed the analysis. The code failed. The code had to be modified and the test rig tests had to be re-run!

5 Software Testing and Proof There is a widespread belief that formal methods with the associated concept of proof are totally orthogonal to testing techniques. In fact this is not true: under specific conditions a test activity could constitute a formal proof. If a path implements a polynomial of degree n then n + 1 distinct sets of test data that get the right results establishes the correctness of the path. If this is performed for all paths, then the whole program is rigorously proved correct. Dynamic analysis can establish if this is true. Experiments have shown that most program paths compute very simple polynomial results. The first down-side to this is that almost nobody tests every path. In practice, not every path needs to be executed, because every path is composed from smaller units that are common to all paths. For example, a basic block is also a basic path. An LCSAJ is also a basic path that is larger than a basic block. The necessary algebra for this has never been undertaken. The second down-side is that, whilst the paths as implemented have been shown to be correct, there is no evidence that the full range has been successfully implemented. This aspect can be established only by analysing the ranges covered by the test data. This explains why it has become normal practice to use systematic test data generators or unit test tools that explore the entire range of the inputs.

6 Tool Verification Software tools are software systems like any other: they can be verified just the same as any other. In particular, they can be subjected to both static and dynamic analysis. This includes compilers, verification tools, proof engines, etc. It is notable that in DO-178B it is recognised that compiler vendors are not going to demonstrate the quality of their products, so the user must provide at least a minimal demonstration of correct functionality [106]. The main reliance in DO-178B, however, is on proving the efficacy of the binaries. For DO-178C it has been proposed that auto-code tools should be verified to the level of the software that they produce. There is huge resistance to this concept. Verification tools are not required to be verified to the same level as the code they are analysing. This is because they are not in the direct line of the product.

214

M. Hennell

7 Conclusions Modern test tools are highly sophisticated and incorporate many advanced algorithms. Beware of statements such as the following. X technique is cheaper than testing. The problem is what is being defined as “testing”? How are costs being calculated? X technique finds more faults than testing. Again the question is how is this claim established? What is being compared? Testing cannot show the absence of errors. It manifestly can show the complete absence of wide classes of errors. Testing cannot prove the software is correct. Testing can prove that software is correct, the difficulty is with completeness. No matter how the software is verified it can still fail in use. The reasons are: misunderstood requirements, missing requirements, and hardware issues, such as sensor drift, that change the environment.

Software Testing Based on Formal Specification Marie-Claude Gaudel Université de Paris-Sud, LRI, F-91405 Orsay Cedex, France [email protected]

It is well accepted that formal specifications can be useful bases for software testing; some pioneering papers come back to the seventies. Given a specification and a system under test, any testing activity is, explicitly or not, based on a satisfaction relation (often called conformance relation): does the system under test satisfy the specification? Tests are derived from the specification on the basis of the satisfaction relation, and often on the basis of some additional knowledge of the system under test and of its operational environment called testability hypothesis. The verdict about the success or not of a test execution depends on the observations that can be made on the system under test, and it is based on the satisfaction relation. This course presents a generic framework for developing testing methods based on formal specifications, and its specialization to several formalisms: Finite State Machine, Algebraic Specifications, Input-Output Transition Systems and Transition Systems with Priorities. After presenting some motivation and context, Section 2 illustrates the principles sketched above in the case of algebraic specifications, for which they were developed originally. Section 3 goes back in time and slightly reformulates the work on testing based on FSM in this new framework. Section 4 presents Input-Output labelled Transition systems (IOTS), the associated ioco conformance relation, defined originally in [318], and extended to IOTS with data types in [209]. Section 5 discusses problems that arise when introducing some notion of priority between inputs and outputs, extending IOTS models and the ioco conformance relation.

1 Introduction Deriving test cases from specifications is now recognized as a major application of formal methods to software development; we refer to [79,134,51,55,36,204] among many other pioneering papers and surveys. Even though it has been recognized for a while that formal models can bring much to software testing, embedding implementation testing within a formal framework is far from being obvious. One tests a system: a system is not a formula, even if it can be (partially) described as such. Thus, testing is related to, but very different from proof of correctness based on the program text using, for example, an assertion technique. Similarly, testing is different from model checking, where verifications are based on a known model of the system: when testing, the model corresponding to the actual system under test is unknown. If it were known, testing would not be necessary. Moreover, it is sometimes difficult to observe the state of the system under test [79,135,204]. P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 215–242, 2010. c Springer-Verlag Berlin Heidelberg 2010

216

M.-C. Gaudel

These points have been successfully circumvented in several testing methods that are based on formal specifications (or models) and on conformance relations that precisely state what it means for a system under test to satisfy a specification [89,104,108,318,164,137,224,206]. The gap between systems and models is generally taken into account by explicit assumptions on the systems under test [79,36,204], which are called testability hypotheses in [36] or test hypotheses in [56]; this notion has been briefly introduced in Chapter 1. An extreme example of a non-testable system would be a demonic one that would behave well during the test experiments and change its behavior afterward. Such testability assumptions are fundamental in the proof that the success of the test suite derived from the specification establishes the conformance relation. Moreover, they provide hints on complementary tests or proofs that may be necessary to ensure this equivalence. This chapter presents a general approach to testing based on formal specifications. In this approach, assuming some testability hypothesis on the implementation under test, a notion of exhaustive test suite is derived from the semantics of the formal notation and from the definition of a correct implementation. Then, a finite test suite can be selected from the exhaustive one via some selection hypotheses. A word of warning before going further: it is definitely not the claim of this chapter that specification based testing is the unique solution to the validation and verification of software. It is clear that diverse software fault detection techniques need to be combined [75,216]. The aim here is to show that formal specifications are quite fundamental to a rigorous and systematic approach to software testing, and to propose a general framework for black-box testing based on formal specifications, applicable to arbitrary formalisms.

2 Algebraic Specifications Most approaches on deriving test cases from specifications are based on behavioral descriptions: for instance the control graph of the program, or some finite state machine or labelled transition system. In such cases, it is rather natural to base the selection of test scenarios on some coverage criteria of the underlying graph in order to check mandatory and forbidden behaviors in the SUT (System Under Test). Algebraic specifications are different: abstract data types are described in an axiomatic way. There is a signature Σ, composed of a finite set S of sorts and a finite set F of function names over the sorts in S , and there is a finite set of axioms Ax . The correctness requirement is no more, as above, the ability (or the impossibility) for the SUT to exhibit certain behaviors: what is required by such specifications is the satisfaction of the axioms by the implementation of the functions of F . As a consequence, a natural way for testing some SUT is to choose some instantiations of the axioms (or of some consequences of them) and to check that when computed by the SUT, the terms occurring in the instantiations yield results that satisfy the corresponding axiom (or consequence). This approach was first proposed by Gannon et al. [134], and Bougé et al. [50,51], and then developed and implemented by Bernot et al. [36]. Since these foundational works, testing from algebraic specifications has been investigated a lot. Numerous works have addressed different aspects. Some authors as in

Software Testing Based on Formal Specification

217

[19] or [80] focus on a target programming language (Ada or Haskell) where there is a notion of abstract data type. Testing from algebraic specifications has also been successfully adapted for testing object-oriented systems [76,108,275]. Besides, methods inspired by this approach have been applied to some other kinds of specifications like model-based specifications, first by Dick et al. [104], and more recently by Aichernig et al. [88]. Some other works explore links between test and proof [57,110]. Some tools [57,80,232], based either on resolution procedures or on specialized tactics in a proof engine, have been developed and used. More recently, some contributions [107,223,224] have taken into account structured or modular specifications aiming at defining structured test cases and at modelling the activity of both unit testing and integration testing. An issue with algebraic specifications is the abstraction gap between the abstract specification level and the concrete implementation. This raises problems for interpreting the results of test experiments with respect to the specification. This characteristic is shared with other formal methods that allow the description of complex datatypes in an abstract way as, for instance, VDM, Z and their object oriented extensions. As a consequence, in the area of testing based on algebraic specifications, a special emphasis has been put on the oracle problem [35,130,222,362], which concerns the difficulty of defining reliable decision procedures to compare values of terms computed by the SUT. Actually, implementations of abstract data types may have subtle or complex representations by a combination of concrete values, and the interface of the concrete datatypes is not systematically equipped with an equality procedure to adequately compare such value aggregates. In practice, only some basic datatypes provide a reliable decision procedure to compare values. They are said to be observable. The only way to define (partial) decision procedure for abstract data types independently of the implementation is to observe them by applying some (composition of) functions yielding an observable result: they are called observable contexts. Observational approaches of algebraic specifications bring solutions to define an appropriate notion of correctness taking into account observability issues. This section is organized as follows: Subsection 2.1 presents some necessary basic notions of algebraic specifications; Subsection 2.2 gives the basic definitions of test and test experiment against an algebraic specification; Subsection 2.3 introduces in a progressive way the notions of exhaustive test suite and testability hypothesis in a simple case. Then Subsection 2.4 addresses the issue of the selection of a finite test suite via the so-called uniformity and regularity selection hypotheses. For more developments on the case where there are observability problems, the interested reader may consult [136]. 2.1 Preliminaries on Algebraic Specifications Algebraic specifications of data types, sometimes called axiomatic specifications, provide a way of defining abstract data types by giving the properties (axioms) of their operations. There is no explicit definition of each operation (no pre- and post-condition, no algorithm) but a global set of properties that describes the relationship between the operations. This idea comes from the late seventies [143,151]. It has been the origin of numerous pieces of work that have converged on the definition of C ASL, the Common Algebraic Specification Language [39].

218

M.-C. Gaudel

from O UR /N UMBERS / WITH /B OOLS version 0.0 get NAT , B OOL spec C ONTAINERS = NAT , B OOL then generated type Container ::= [] | :: (Nat; Container) op isin : Nat × Container → Bool op remove : Nat × Container → Container ∀ x, y: Nat; c: Container • isin(x, []) = false • eq(x, y) = true ⇒ isin(x, y :: c) = true • eq(x, y) = false ⇒ isin(x, y :: c) = isin(x, c) • remove(x, []) = [] • eq(x, y) = true ⇒ remove(x, y :: c) = c • eq(x, y) = false ⇒ remove(x, y :: c) = y :: remove(x, c) end

%(isin empty)% %(isin 1)% %(isin 2)% %(remove empty)% %(remove 1)% %(remove 2)%

Fig. 1. An algebraic specification of containers of natural numbers

An example of an algebraic specification is given in Figure 1; it is a C ASL specification of containers of natural numbers: a data structure that contains possibly duplicated numbers with no notion of order. This specification states that three sorts of values are used for this example, namely Natural Numbers, Booleans and Containers. Among the operations, there is, for instance, a function named isin which, given two values of sorts natural number and container, respectively, returns a boolean value. The operations must satisfy the axioms that are the formulas itemized by big bullets. The sorts, operation names, and profiles of the operations are part of the signature of the specification. The signature gives the interface of the specified data type. Moreover, it declares some sorted variables that are used for writing the axioms. An (algebraic) signature Σ = (S , F , V ) consists of a set S of sorts, a set F of function names each one equipped with an arity in S ∗ × S and a set of variables V , each of them being indexed by a sort. In the sequel, a function f with arity (s1 . . . sn , s), where s1 . . . sn , s ∈ S , is declared as f : s1 × . . . × sn → s. In Figure 1, the sorts of the signature are Nat and Bool (specified in the O UR /N UM BERS / WITH /B OOLS specification, not given here), and Container ; the functions are [] (the empty container), :: (addition of a number to a container), isin that checks for the membership of a number in a container, and remove that takes away one occurrence of a number from a container; the variables are x , y of Nat sort, and c of Container sort. Given a signature Σ = (S , F , V ), TΣ (V ) is the set of terms with variables in V freely generated from variables and functions in Σ, and preserving the arity of functions. Such terms are indexed by the sort of their result. We denote by TΣ (V )s the subset of TΣ (V ) containing exactly those terms indexed by s. The abbreviation TΣ is the set TΣ (∅) of the ground terms, and we write TΣ,s to denote the set of ground terms of sort s. Considering the C ONTAINER specification, an example of a ground term t of sort Container is 0 :: 0 :: []. An example of a term t with variables is isin(x , 0 :: c) that is of Bool sort.

Software Testing Based on Formal Specification

219

A substitution is any mapping ρ : V → TΣ (V ) that preserves sorts. Substitutions are naturally extended to terms with variables. The result of the application of a substitution ρ to a term t is called an instantiation of t , and is denoted by t ρ. In the example, let us consider the substitution σ : { x → 0, y → 0, c → y :: [] }, the instantiation t σ is the term with variable isin(0, 0 :: y :: []). Σ-equations are formulae of the form t = t with t , t ∈ TΣ (V )s for s ∈ S . An example of an equation on containers is remove(x , []) = []. A positive conditional Σ-formula is any sentence of the form α1 ∧ . . . ∧ αn ⇒ αn+1 where each αi is a Σ-equation (1 ≤ i ≤ n + 1). Sen(Σ) is the set of all positive conditional Σ-formulae. A (positive conditional) specification SP = (Σ, Ax , C ) consists of a signature Σ, a set Ax of positive conditional formulae often called axioms, and some constraints C , which may restrict the interpretations of the declared symbols (some examples are given below). When C is empty, we write SP = (Σ, Ax ) instead of SP = (Σ, Ax , ∅). Specifications can be structured as seen in the example: a specification SP can use some other specifications SP1 , . . . , SPn . In such cases, the signature is the union of signatures, and there are some hierarchical constraints that require the semantics of the used specifications to be preserved (for more explanations see [348]). In the C ONTAINERS specification, there are six axioms, named isin empty, isin 1, isin 2, remove empty, remove 1, and remove 2, and there is a so-called generation constraint, expressed at the line beginning by generated type, that all the containers are computable by composition of the functions [] and :: . Such constraints are also called reachability constraints. The functions [] and :: are called the constructors of the Container type. In some algebraic specification languages, axioms can be formulae of first-order logic, as in C ASL. However, in this section we mainly consider positive conditional specifications. The reason is that most tools and case studies we present have been performed for and with this kind of specifications. An extension of this approach to first order logic, with some restrictions on quantifiers, was proposed by Patricia Machado in [223]. A Σ-algebra A is a family of sets As of concrete values, each of them being indexed by a sort; these sets are equipped, for each f : s1 × . . . × sn → s ∈ F , with a mapping f A : As1 × . . . × Asn → As . A Σ-morphism μ from a Σ-algebra A to a Σ-algebra B is a mapping μ : A → B such that for all s ∈ S , μ(As ) ⊆ Bs and for all f : s1 × . . . × sn → s ∈ F and all (a1 , . . . , an ) ∈ As1 × . . . × Asn μ(f A (a1 , . . . , an )) = f B (μ(a1 ), . . . , μ(an )). Alg(Σ) is the class of all Σ-algebras. Intuitively speaking, an implementation of a specification with signature Σ is a Σalgebra: it means that it provides some sets of values named by the sorts, and some way of computing the functions on these values without side effect. The set of ground terms TΣ can be extended into a Σ-algebra by providing each function name f : s1 × . . . × sn → s ∈ F with an application f TΣ : (t1 , . . . , tn ) → f (t1 , . . . , tn ) In this case, the function names of the signature are simply interpreted as the syntactic constructions of the ground terms.

220

M.-C. Gaudel

Given a Σ-algebra A, we denote by A : TΣ → A the unique Σ-morphism that maps any f (t1 , . . . , tn ) to f A (t1A , . . . , tnA ). A Σ-algebra A is said reachable if A is surjective. A Σ-interpretation in A is any mapping ι : V → A. It is just an assignment of some values of the Σ-algebra to the variables. Given such an interpretation, it is extended to terms with variables: the value of the term is the result of its computation using the values of the variables and the relevant f A . A Σ-algebra A satisfies a Σ-formula ϕ : ∧1≤i≤n ti = ti ⇒ t = t , denoted by A |= ϕ, if and only if for every Σ-interpretation ι in A, if for all i in 1..n, ι(ti ) = ι(ti ) then ι(t ) = ι(t ). Given a specification SP = (Σ, Ax , C ), a Σ-algebra A is a SPalgebra if for every ϕ ∈ Ax , A |= ϕ and A fulfils the C constraint. Alg(SP) is the subclass of Alg(Σ) exactly containing all the SP-algebras. A Σ-formula ϕ is a semantic consequence of a specification SP = (Σ, Ax ), written as SP |= ϕ, if and only if for every SP-algebra A, we have A |= ϕ. 2.2 Testing against an Algebraic Specification Let SP be a positive conditional specification and SUT a System Under Test. In dynamic testing, we are interested in the properties of the computations by SUT of the functions specified in SP. SUT provides some procedures for executing these functions. The question is whether they satisfy the axioms of SP. Given a ground Σ-term t , we denote by t SUT the result of its computation by SUT . Now we define how to test SUT against a Σ-equation. Definition 1. Given a Σ-equation , and SUT which provides an implementation for every function name of Σ, – a test for is any ground instantiation t = t of ; – a test experiment of SUT against t = t consists in the evaluation of t SUT and t SUT and the comparison of the resulting values. As a simple example of a test, consider the following. Example 1. One test of the isin empty equation in the C ONTAINERS specification of Figure 1 is isin(0, []) = false. The generalization of this definition to positive conditional axioms is straightforward. In the following, we say that a test experiment is successful if it concludes to the satisfaction of the test by the SUT , and we denote it SUT passes τ , where τ is the test, a ground formula. We generalize this notation to test suites: SUT passes TS means that ∀ τ ∈ TS • SUT passes τ . Deciding whether SUT passes τ is the oracle problem mentioned in the introduction. In the above example it is just a comparison between two boolean values. However, such a comparison may be difficult when the results to be compared have complex data types. One can find a discussion on the way it can be realized in such cases in [136]. In this chapter we deal with the case where this decision is possible for all sorts, meaning that they are all “observable”.

Software Testing Based on Formal Specification

221

Remark 1. Strictly speaking, the definition above defines a tester rather than a test data: a test t = t is nothing else than the abstract definition of a program that evaluates t and t via the relevant calls to the SUT and compares the results; a test experiment is an execution of this tester linked to the SUT. We can now introduce a first definition of an exhaustive test of an implementation against an algebraic specification. A natural notion of correctness, when all the data types of the specification are observable, is that the SUT satisfies the axioms of the specification. Thus we start with a first notion of exhaustive test inspired by the notion of satisfaction as defined in Section 2.1. 2.3 A First Presentation of Exhaustivity and Testability Definition 2. Given a positive conditional specification SP = (Σ, Ax ), an exhaustive test suite for SP, denoted ExhaustSP , is the set of all well-sorted ground instantiations of the axioms in Ax : ExhaustSP = { φρ | φ ∈ Ax , ρ ∈ V → TΣ } An exhaustive test experiment of some SUT against SP is the set of all the test experiments of the SUT against the formulas of ExhaustSP . As said above, this definition is very close to (and is derived from) the notion of satisfaction of a set of Σ-axioms by a Σ-algebra. In particular, the fact that each axiom can be tested independently comes from this notion. However, an implementation passing once all the tests in the exhaustive test suite does not necessarily mean that it satisfies the specification: first, this is true only if the SUT is deterministic; second, considering all the well-sorted ground instantiations is, a priori, not the same thing as considering all the Σ-interpretations in the values of the SUT. It may be the case that some values are not expressible by ground terms of the specification. In other words, the above test suite is exhaustive with respect to the specification, but may be not with respect to the values used by the program. Thus some testability hypotheses on the implementation under test are necessary: the success of the exhaustive test suite ensures the satisfaction of the specification by the implementation only if this implementation behaves as a reachable Σ-algebra (cf. Section 2.1). Practically, it means that: – There is a realisation of every function of Σ that is supposed to be deterministic; the results do not depend on some hidden, unspecified, internal state. – The implementation is assumed to be developed following good programming practices; any computed value of a data type must always be a result of the specified operations of this data type. – There is a comparison procedure for the values of every sort of the signature. Note that, explicitly or not, all testing methods make assumptions on SUT: a totally erratic system, or a demonic one, may pass some test suite and fail later on. Testing methods based on Finite State Machine descriptions rely on the assumption that the

222

M.-C. Gaudel

SUT behaves as an FSM with the same number of states as the specification; similarly, methods based on IO-automata or IO-Transition Systems assume that the SUT behaves as an IO-automata; consequently, it is supposed input-enabled: always ready to accept any input. In the case of algebraic specifications these hypotheses are static properties of the program. Some of them are (or could be) checkable by some preliminary static analysis of the source code. Definition 3. Given a signature Σ, an SUT is Σ-testable if it defines a reachable Σalgebra ASUT . Moreover, for each τ of the form t = t , there exists a way of deciding whether it passes or not. The Σ-testability of the SUT is called the minimal hypothesis Hmin on the SUT. Let us denote by Correct(SUT , SP) the correctness property that a given SUT behaves as a reachable SP-algebra (implying that the axioms are satisfied and all the values are specified). The fundamental link between exhaustivity and testability is given by the following formula: Hmin (SUT ) ⇒ ((∀ τ ∈ ExhaustSP • SUT passes τ ) ⇔ Correct(SUT , SP)) ExhaustSP is obviously not usable in practice since it is generally infinite. Actually, the aim of the definitions of ExhaustSP and Hmin is to provide frameworks for developing theories of black-box testing from algebraic specifications. Practical test criteria (those which correspond to finite test suites) will be described as stronger hypotheses on the implementation. This point is developed in Section 2.4 and in [136]. Before addressing the issue of the selection of finite test suites, let us come back to the definition of ExhaustSP . As it is defined, it may contain useless tests, namely those instantiations of conditional axioms where the premises are false: such tests are always successful, independently of the fact that their conclusion is satisfied by the SUT or not. Thus they can be removed. Example 2. Assuming that eq(0, 0) = true is a semantic consequence of the O UR / N UMBERS / WITH /B OOLS specification, we could derive the following useless test for the remove 2 conditional axiom in the C ONTAINERS specification of Figure 1: eq(0, 0) = false ⇒ remove(0, 0 :: 0 :: []) = 0 :: remove(0, 0 :: []) Conversely, we can derive an equational test for the remove 1 conditional axiom. This test is simply the ground equation: remove(0, 0 :: 0 :: []) = 0 :: []. In the example of Figure 1, we have distinguished a subset of functions as constructors of the Container type (namely [] and ::). Under some conditions, the presence of constructors in a specification makes it possible to characterize an equational exhaustive test suite. A signature with constructors is a signature Σ =< S , F , V > such that a subset C of elements of F are distinguished as constructors. Let Ω =< S , C, V > stand for the corresponding sub-signature of Σ, and TΩ the corresponding ground terms. A specification SP =< Σ, Ax > where Σ is a signature with constructors C is complete with respect to its constructors if and only if both following conditions hold:

Software Testing Based on Formal Specification

223

– ∀ t ∈ TΣ • ∃ t ∈ TΩ • SP |= t = t – ∀ t , t ∈ TΩ • SP |= t = t ⇒ < Σ, ∅ >|= t = t , meaning that t and t are syntactically identical Example 3. The C ONTAINERS specification of Figure 1 is complete with respect to the constructors C = { [], :: } of the Container sort: from the axioms, any ground term of Container sort containing some occurrence of the (non-constructor) remove function is equal to some ground term containing only occurrences of [] and ::. Moreover, there is only one such ground term. For such specifications and under some new hypotheses on the SUT, it is possible to demonstrate that the set of ground conclusions of the axioms is exhaustive. When removing premises satisfied by the specification, we should be careful not to remove some other premises that the SUT could interpret as true, even if they are not consequences of the specification. A sufficient condition is to suppose that the SUT correctly implements the constructors of all the sorts occurring in the premises. Let us introduce the new testability hypothesis Hmin,C for that purpose. Intuitively, Hmin,C means that the SUT implements data types with a syntax very close to their abstract denotation. It may seem to be a strong hypothesis, but in fact, it only applies to basic types, often those provided by the implementation language. As soon as the data type implementation is subtle or complex, the data type is then encapsulated. Definition 4. SUT satisfies Hmin,C iff SUT satisfies Hmin and: ∀ s ∈ S • ∀ u, v ∈ TΩ,s • SUT passes u = v ⇔ SP |= u = v A notion of exhaustive test suite can then be defined as follows. Definition 5. EqExhaustSP ,C = { ρ | ∃ α1 ∧ . . . ∧ αn ⇒ ∈ Ax , ρ ∈ V → TΩ , SP |= (α1 ∧ . . . ∧ αn )ρ } Under Hmin,C and for specifications complete with respect to their constructors EqExhaustSP ,C is an exhaustive test suite. A proof can be found in [129] or in [9]. Its advantage over ExhaustSP is that it is made of equations. Thus the test experiments are simpler. Some other approaches for the definitions of exhaustivity and testability are possible. For instance, as suggested in [37] and applied by Dong and Frankl in the ASTOOT system [108], a different possibility is to consider the algebraic specification as a term rewriting system, following a “normal-form” operational semantics. Under the condition that the specification defines a ground-convergent rewriting system, it leads to an = { t = t ↓| t ∈ TΣ }, alternative definition of the exhaustive test suite: ExhaustSP where t ↓ is the unique normal form of t . The testability hypothesis can be weakened to the assumption that the SUT is deterministic (it does not need anymore to be reachable). In [108], an even bigger exhaustive test suite was mentioned (but not used), which contained for every ground term the inequalities with other normal forms, strictly following the definition of initial semantics.

224

M.-C. Gaudel

Actually, this is an example of a case where the exhaustive test suite is not built from instantiations of the axioms, but more generally from an adequate set of semantic consequences of the specification. Other examples are shown in [136]. 2.4 Selection Hypotheses: Uniformity, Regularity A black-box testing strategy can be formalized as the selection of a finite subset of some exhaustive test suite. In the sequel, we work with EqExhaustSP ,C , but what we say is general to the numerous possible variants of exhaustive test suites. Let us consider, for instance, the classical partition testing strategy (which, more precisely, should be called sub-domain testing strategy). It consists in defining a finite collection of (possibly non-disjoint) subsets that covers the exhaustive test suite. Then one element of each subset is selected and submitted to the implementation under test. The choice of such a strategy corresponds to stronger hypotheses than Hmin on the implementation under test. We call such hypotheses selection hypotheses. In the case of partition testing, they are called uniformity hypothesis, since the implementation under test is assumed to uniformly behave on some test subsets UTSi (Uniformity Test Subset): UTS1 ∪ . . . ∪ UTSp = EqExhaustSP ,C , and ∀ i = 1, . . . , p • (∀ τ ∈ UTSi • SUT passes τ ⇒ SUT passes UTSi ) Various selection hypotheses can be formulated and combined depending on some knowledge of the program, some coverage criteria of the specification and ultimately cost considerations. Another type of selection hypothesis is regularity hypothesis, which uses a size function on the tests and has the form “if the subset of EqExhaustSP ,C made up of all the tests of size less than or equal to a given limit is passed, then EqExhaustSP ,C also is. As noticed by several authors, [108,75], and from our own experience [235], such hypotheses must be used with care. It is often necessary to choose this limit taking in consideration some “white-box knowledge" on the implementation of the datatypes: array bounds, and so on. All these hypotheses are important from a theoretical point of view because they formalize common test practices and express the gap between the success of a test strategy and correctness. They are also important in practice because exposing them makes clear the assumptions made on the implementation. Thus, they give some indication of complementary verifications, as used by Tse et al. in [75]. Moreover, as pointed out by Hierons in [168], they provide formal bases to express and compare test criteria and fault models. As said above, the choice of the selection hypotheses may depend on many factors. However, in the case of algebraic specifications, the text of the specification provides useful guidelines. These guidelines rely on coverage of the axioms and composition of the cases occurring in premise of the axioms via unfolding as stated first in [36], and extended recently in [9]. We recall that axioms are of the form α1 ∧ . . . ∧ αn ⇒ αn+1 where each αi is a Σ-equation ti = ti , (1 ≤ i ≤ n + 1).

Software Testing Based on Formal Specification

225

From the definition of EqExhaustSP ,C , a test of such an axiom is some αn+1 ρ where ρ ∈ V → TΣ is a well-typed ground substitution of the variables of the axiom such that the premise of the axiom, instantiated by ρ, is true: it is a semantic consequence of the specification (SP |= (α1 ∧ . . . ∧ αn )ρ). One natural basic testing strategy is to cover each axiom once, choosing for every axiom one adequate substitution ρ only; the uniformity hypothesis is ∀ ρ ∈ V → TΣ | SP |= (α1 ∧ . . . ∧ αn )ρ • SUT passes αn+1 ρ ⇒ (SUT passes αn+1 ρ ∧ ∀ ρ ∈ V → TΣ • SP |= (α1 ∧ . . . ∧ αn )ρ ) It defines a so-called uniformity sub-domain for the variables of the axiom that is the set of ground Σ-terms characterized by SP |= (α1 ∧ . . . ∧ αn ). Example 4. In the example of Figure 1, covering the six axioms requires six tests, for instance the following six ground equations: – isin(0, []) = false, with the whole Nat sort as uniformity sub-domain; – isin(1, 1 :: 2 :: []) = true, with pairs of Nat such that eq(x , y) = true and the whole Container sort as uniformity sub-domain; – isin(1, 0 :: 3 :: []) = isin(1, 3 :: []), with pairs of Nat such that eq(x , y) = false and the whole Container sort as uniformity sub-domain; – remove(1, []) = [], with Nat sort as uniformity sub-domain; – remove(0, 0 :: 1 :: []) = 1 :: [], with pairs of Nat such that eq(x , y) = true and the Container sort as uniformity sub-domain; – remove(1, 0 :: []) = 0 :: remove(1, []), with pairs of Nat such that eq(x , y) = false and the Container sort as uniformity sub-domain. Such uniformity hypotheses are often too strong. A method for weakening them, and getting more test cases, is to compose the cases occurring in the axioms. In the full general case, it may involve tricky pattern matching on the premises and conclusions, and even some theorem proving. However, when the axioms are in a suitable form one can use the classical unfolding technique defined by Burstall and Darlington in [62]. It consists in replacing a function call by its definition. Thus, for unfolding to be applicable, the axioms must be organized as a set of functions definitions: every function is defined by a list of conditional equations such as: ∧1≤i≤m αi ⇒ f (t1 , . . . , tn ) = t where the domain of the function must be covered by the disjunction of the premises of the list. Example 5. In the example of Figure 1, the isin function is defined by: • isin(x, []) = false • eq(x, y) = true ⇒ isin(x, y :: c) = true • eq(x, y) = false ⇒ isin(x, y :: c) = isin(x, c)

%(isin empty)% %(isin 1)% %(isin 2)%

It means that every occurrence of isin(t1 , t2 ) can correspond to the three following sub-cases:

226

M.-C. Gaudel

– t2 = []: in this case isin(t1 , t2 ) can be replaced by false; – t2 = y :: c and eq(t1 , y) = true: here it can be replaced by true; – t2 = y :: c and eq(t1 , y) = false: here it can be replaced by isin(t 1, c). A way of partitioning the uniformity sub-domain induced by the coverage of an axiom with some occurrence of f (t1 , . . . , tn ) = t is to introduce the sub-cases stated by the definition of f , and, of course, to perform the corresponding replacements in the conclusion equation to be tested. This leads to a weakening of the uniformity hypotheses. Example 6. Let us consider the isin 2 axiom. Its coverage corresponds to the uniformity sub-domain “pairs of Nat such that eq(x , y) = false” × “the Container sort". Let us unfold in this axiom the second occurrence of isin: isin(x , c). It leads to three sub-cases for this axiom: – c = []: eq(x , y) = false ∧ c = [] ⇒ isin(x , y :: []) = isin(x , []), which is false; – c = y :: c and eq(x , y ) = true: eq(x , y) = false ∧ c = y :: c ∧ eq(x , y ) = true ⇒ isin(x , y :: y :: c ) = isin(x , y :: c ), which is true; – c = y :: c and eq(x , y ) = false: eq(x , y) = false ∧ c = y :: c ∧ eq(x , y ) = false ⇒ isin(x , y :: y :: c ) = isin(x , y :: c ), which is isin(x , c ). The previous uniformity sub-domain is partitioned in three smaller sub-domains characterized by the three premises above. Covering these sub-cases leads to test bigger containers, and to check that isin correctly behaves independently of the fact that the searched number was the last to be added to the container or not. Applying the same technique to the remove 2 axiom leads to test that in case of duplicates, one occurrence only is removed. Of course, unfolding can be iterated: the last case above can be decomposed again into three sub-cases. Unbounded unfolding generally leads to infinite test suites. Actually, as it is described here, unbounded unfolding yields an infinite set of equations very close to the exhaustive test suite. The only remaining variables are those that are operands of functions without definitions, namely, in our case, constructors. Limiting the number of unfoldings is often sufficient for ensuring the finiteness of the test suite. Experience has shown (see, for instance, [89,19]) that in practice one or two levels of unfolding are sufficient for ensuring what test engineers consider as a good coverage and a very good detection power. In some rare cases, this limitation of unfolding does not suffice for getting a finite test suite: then, it must be combined with regularity hypotheses: limiting the size of the ground instantiations. Unfolding has been implemented by Bruno Marre within the tool LOFT [36,231,232] using logic programming. There are some conditions on the specifications manipulated by LOFT: – they must be complete with respect to constructors; – when transforming the specification into a conditional rewriting system (by orienting each equation t = t occuring in an axiom from left to right t → t ), the resulting conditional rewrite system must be confluent and terminating;

Software Testing Based on Formal Specification

227

– each equation t = t that is the conclusion of an axiom must be such that t may be decomposed as a function f , not belonging to the set of constructors, applied to a tuple of terms built on constructors and variables only. Under these conditions, the LOFT tool can decompose any uniformity domain into a family of uniformity sub-domains. It can also compute some solutions into a given uniformity sub-domain. These two steps correspond respectively to the computation of the uniformity hypotheses based on unfolding subdomains and to the generation of an arbitrary test case per each computed subdomain. The unfolding procedure is based on an equational resolution procedure involving some unification mechanisms. Under the conditions on the specifications given above, the unfolding procedure computes test cases such that: sub-domains are included in the domain they are issued from (soundness), and the decomposition into subdomains covers the splitted domain (completeness).

3 Testing Based on Finite State Machines Finite State Machines have been the first kind of formal description to be used as basis for testing. Actually, the very first results, in the early sixties, were targeted to hardware descriptions and testing. Their generalizations to software, first suggested by Chow in 1978 [79], have been, and are still, extremely influential. FSM are behavioral specifications of software systems. They rely on very different principles from algebraic specifications. Thus the notions of tests and test experiments, as well as the testability hypotheses are very different. The most general definition of an FSM is given below: Definition 6. A Finite State Machine (FSM) is a structure T = (S , s0 , I , O , R) where S is a finite set of states, s0 is an initial state, I is an input alphabet, O is an output alphabet and R ⊆ S × I × O × S . An output word t ∈ O ∗ is produced by an input word w ∈ I ∗ of the FSM if there is a run on w : a sequence of states s0 , s1 ....sn such that (si , wi , ti , si+1 ) ∈ R for i = 0, ..n − 1. The input/output relation is the pair (w , t ) when t is produced by w . Informally it means that there is a finite set of states. Transitions between states are triggered by inputs of a given alphabet and yield both an output of another alphabet and a change of state. The definition above considers nondeterministic FSM since transitions are modelled as a relation R. As in Lee and Yannakakis’ survey, [204], we first consider testing methods based on deterministic FSM: Instead of T = (S , s0 , I , O , R) with R ⊆ S × I × O × S , we have M = (S , I , O , δ, λ). where δ and λ are functions from S × I into S , and from S × I into O , respectively. There is not always an initial state. Functions δ and λ can ∗ be extended in a canonic way to sequences of inputs: δ ∗ is from S × I ∗ into S ∗ and λ ∗ ∗ is from S × I into O . 3.1 Testing Based on Deterministic FSM The testing problem addressed in this subsection is: given a deterministic specification FSM A, and a system under test (SUT) that is supposed to behave as some unknown deterministic FSM B, how to test that B is equivalent to A via inputs submitted to the SUT and outputs observed from the SUT?

228

M.-C. Gaudel

Note that the specification FSM must be strongly connected, with a path between every pair of states: this is necessary for designing test experiments that reach every specified state. Equivalence of FSM is defined as follows. Two states si and sj are equivalent if and only if for every input sequence, the FSM will produce the same output sequence; more ∗ ∗ precisely, for every input sequence σ, λ (si , σ) = λ (sj , σ). M and M are equivalent if and only for every state in M there is a corresponding equivalent state in M , and vice versa. When M and M have the same number of states, this notion is the same as isomorphism. Given an FSM, there are well-known polynomial algorithms for constructing a minimized (reduced) FSM, which is equivalent to the given FSM and where there are no equivalent states. The reduced FSM is unique up to isomorphism. The specification FSM is supposed to be reduced before any use for testing. Classical testability hypotheses, when the test is based on FSM, are: – – – –

The SUT behaves as some (unknown) finite state machine. This implementation machine does not change during the experiments. It has the same input alphabet as the specification FSM. It has the same number of states as the specification FSM.

However, this last, and strong hypothesis, can be weakened: it is possible to develop testing methods assuming that the number of states is greater, but known. A major problem when testing is the observation of the states of the SUT both before and after some action in order to decide if the behavior of the SUT is satisfactory. The SUT is a running system. The only means of observation is by submitting other inputs and collecting the resulting outputs. Such observations are generally destructive: they change the observed state. Given an FSM with n states, we define the following notion of state observation. Definition 7. State observations. – A Separating Family is a collection { Zi }i=1,..,n of sets of input sequences whose output sequences make it possible to distinguish the state si from any other state. – If a separating family has the same set Z for all the states (Z = Zi for all i), Z is called a Characterizing Set. – A (preset) Distinguishing Sequence σ is such that for every pair of distinct states si , sj , λ∗ (si , σ) = λ∗ (sj , σ) Every reduced FSM has a separating family and a characterising set containing at most n − 1 sequences, each of length less or equal to n − 1. For some classes of FSM, it is possible to use more efficient way for observing states. A distinguishing sequence is a special case of a characterising set where Z is a singleton. Unfortunately, deciding whether a given FSM has a distinguishing sequence is PSPACE-complete. However, it is polynomial for adaptative distinguishing sequences (input trees where choices of the next input are guided by the outputs of the SUT), and

Software Testing Based on Formal Specification

229

it is possible to construct one of quadratic length. There are variants of these notions, depending on the considered FSM [204]. Tests and exhaustive test suites are modelled by the notion of checking sequence. A checking sequence is a sequence of inputs that distinguishes the description FSM from any other FSM with the same number of states. The construction of checking sequences depends on two important characteristics of the specification FSM: the existence of a reliable reset and the existence of a distinguishing sequence. A reliable reset is a specific input symbol that leads an FSM from any state to the same state: for every state s, δ(s, reset ) = s. For FSM without reliable reset, the so-called homing sequences are used to start the checking sequence. A homing sequence is an input sequence σh such that, from any state, the output sequence produced by σh determines uniquely the arrival state. For ∗ ∗ every pair of distinct states si , sj , λ (si , σh ) = λ (sj , σh ) ⇒ δ ∗ (si , σh ) = δ ∗ (sj , σh ). Every reduced FSM has a homing sequence of polynomial length, constructible in polynomial time. Definition 8. Let A be a specification FSM with n states and initial state s0 . A checking sequence for A is an input sequence σcheck such that for every FSM B with initial state s0 , the same input alphabet, and at most n states, that is not isomorphic to A, λ∗B (s0 , σcheck ) = λ∗A (s0 , σcheck ). An example of a checking sequence is a sequence of inputs leading to a trace that traverses every transition, and for each transition, all the elements of the separating set of its arrival state. Let p be the size of the input alphabet. For FSM with a reliable reset, there is a polynomial time algorithm, in O (p.n 3 ), for constructing a checking sequence of polynomial length, also in O (p.n 3 ) [79]. For FSM with a distinguishing sequence there is a deterministic polynomial time algorithm to construct a checking sequence of length polynomial in the length of the distinguishing sequence. In other cases, checking sequences of polynomial length also exist, but finding them requires more involved techniques such as randomized algorithms. More precisely, one can construct with high probability in randomized polynomial time a checking sequence of length O (p.n 3 + n 4 . log n). The only known deterministic complexity of producing such a polynomial size sequence is exponential, which just consists in derandomizing the above algorithm (note that one can check that a sequence is a checking sequence in deterministic exponential time, by enumerating all possible adversary FSM B). The above definitions and results generalize to the case where FSM B has more states than FSM A, and their number of states is known. 3.2 Nondeterminism The concepts presented so far are suitable when both the specification FSM and the SUT are deterministic.

230

M.-C. Gaudel

Depending on the context, and on the particular approach, a nondeterministic specification FSM A can have different meanings: it may be understood as describing a class of acceptable deterministic implementations; or it can be understood as describing some nondeterministic acceptable implementations. In both cases, the notion of equivalence of the specification FSM A and of the implementation FSM B is no more an adequate basis for testing. Depending on the aapproach, the required relation between a specification and an implementation is called the “satisfaction relation” (B satisfies A) or the “conformance relation” (B conforms to A). Generally it is not an equivalence, but a preorder. A natural definition for this relation could be the so-called trace inclusion relation: any trace of the implementation must be a trace of the specification. Unfortunately, this definition accepts, as a conforming implementation of any specification, the idle implementation, with an empty set of traces. Several more elaborated relations have been proposed either for FSM or for Labelled Transition Systems and their variants. The most well-known are the conf relation, between Labelled Transition Systems [55], and the ioco relation for Input-Output Transition Systems [318], which is presented in Section 4.2. The intuition behind these relations is that when a trace σ (including the empty one) of a specification A is executed by some SUT B, after σ B can be idle only if A may be idle else B must perform some action performable by A after σ. For Finite State Machines, it can be rephrased as: an implementation FSM B conforms to a specification FSM A if all its possible responses to any input sequence could have been produced by A. Not surprisingly, nondeterminism introduces major complications when testing. First, checking sequences are no longer adequate since some traces of the specification may not be executable by the SUT. One has to define adaptative checking sequences (which, actually, are covering trees of the specification) in order to let the SUT choose among the possible behaviors. Second, in the case of nondeterministic implementations, passing a test once is no longer sufficient since the SUT may exhibit different behaviors in presence of the same input sequence [137,169]. The so-called “complete testing assumption” is a frequently used testability hypothesis which states that there is an upper bound to the number of test executions required for observing all the possible behaviors of a nondeterministic SUT for a given test. 3.3 Symbolic Traces and Constraints Solver Finite state machines (or finite transition systems) have a limited description power. In order to address the description of realistic systems, various notions of extended finite state machines (EFSM) or symbolic labelled transition systems (SLTS) are used. They are the underlying semantic models in a number of industrially significant specification techniques, such as SDL or Statecharts, to name just a few. To make a long story short, such models are enriched by a set of typed variables that are associated with the states. Transitions are labelled as in FSM or LTS, but, in addition, they have associated guards and actions, that are conditions and assignments on the variables. There are numerous variants of EFSM. As an example, we present here the original definition of [204], slightly enriched with a notion of input parameter.

Software Testing Based on Formal Specification

231

Definition 9. An Extended Finite State Machine (EFSM) is a structure (S , s0 , I , IP, O , T , V , v0 ) where S is a finite set of states with initial state s0 , I is a set of input values and IP is a set of input parameters (variables), O is a set of output values, T is a finite set of symbolic transitions, V is a finite list of variables and v0 is a list of initial values of the variables. Each symbolic transition t in T is a 6-tuple t = (st , st , it , ot , Gt , At ) where st , st are respectively the current state, and the next state of t ; it is an input value or an input parameter; ot is an output expression that can be parameterized by the variables and the input parameter. Gt is a predicate (guard) on the current variable values and the input parameter and At is an update action on the variables that may use values of the variables and of the input. Initially, the machine is in an initial state s0 with initial variable values: v0 . Each association of a state and variable values is called a configuration. An EFSM operates as follows: being in some configuration, it receives some input and computes the guards that are satisfied for the current configuration. The satisfied guards identify enabled transitions. A single transition among those enabled is fired. When executing the chosen transition, the EFSM – updates the variables according to the action of the transition – moves from the initial to the final state of the transition. – produces some output, which is computed from the values of the variables and of the input via the output expression of the transition. Transitions are atomic and cannot be interrupted. Given an EFSM, if each variable has a finite number of values (Boolean variables, for instance), then there is a finite number of configurations, and hence there is an equivalent (ordinary) FSM with configurations as states. Therefore, an EFSM with finite variable domains is a compact representation of an FSM. However, constructing this FSM is not easy because of the reachability problem, and particularly the issue of determining if a configuration is reachable from the initial state. This is undecidable if the variable domains are infinite, and PSPACE-complete otherwise. As said above, there are numerous variants of the notions of EFSM and SLTS. The complexity of their analysis (and thus of their use as a basis for black box testing) is strongly dependent on the types of the variables and of the logic used for the guards. A symbolic trace t1 , . . . , tn of an EFSM is a sequence of symbolic transitions such that st1 = s0 and for i = 1, . . . n − 1, sti = sti+1 . A trace predicate is the condition on inputs which ensures the execution of a symbolic trace. Such a predicate is built by traversing the trace t1 , . . . , tn in the following way: – the initial index of each variable x is 0, and for each variable x there is an equation x0 = v0 – for i = 1 . . . n, given transition ti with guard Gi , and action Ai : ˜ i where each variable of G has been • guard Gi is transformed into the formula G indexed by its current index, and the input parameter (if any) is indexed by i.

232

M.-C. Gaudel

• each assignment in Ai of an expression expr to some variable x is transformed into an equation xk +1 = expr i where k is the current index of x and expr i is the expression expr where each variable is indexed by its current index, and the input parameter (if any) is indexed by i. • The current indexes of all assigned variables are incremented. – The trace predicate is the conjunction of all these formulae. A symbolic trace is feasible if its predicate is satisfiable, meaning that there exists some sequence of input values which ensures that, at each step of the trace, the guard of the symbolic transition is true. Such a sequence of inputs characterizes a trace of the EFSM. A configuration is reachable if there exists a trace leading to it. EFSM testing methods must perform reachability analysis: to compute some input sequence that exercises a feature (trace, transition, state) of a given EFSM, a feasible symbolic trace leading to and covering this feature must be identified and its predicate must be solved. Depending on the kind of formula and expression allowed in guards and actions, different constraint solvers may be used. Some tools combine them with SATsolvers, model-checking techniques, symbolic evaluation methods including abstract interpretation, to eliminate some classes of clearly infeasible symbolic traces. The notion of EFSM is very generic. The corresponding test generation problem is very similar to test generation for programs in general. The current methods address specific kinds of EFSM or SLTS. There are still a lot of open problems to improve the levels of generality and automation.

4 Transition Systems with Inputs and Outputs In classical labelled transition systems, interactions between processes are modeled by undirected synchronizations. More realistic models of communication between processes are provided by IOTS (Input Output Transition Systems), or IOSM (Input Output State Machines) where the origin of an interaction is taken into account in the semantics. IOTS are a development of Input Output Automata proposed by Lynch in [219]. Test methods based on such models have been proposed in [318] and improved in [209]. In chapter 3, IOLTS (Input Output Labelled Transition Systems) has been introduced. In the current chapter, an IOTS is an IOLTS that satisfies some properties, as defined in Section 4.1. This section is organized as follows. Subsection 4.1 introduces the models that will be used for implementations and specifications. Subsection 4.2 gives the definition of the implementation relation ioco. Subsection 4.3 gives the testability hypotheses and the exhaustive test suite for ioco. An algorithm which generates the exhaustive test suite and a proof of validity and unbias for the defined test suite and hypotheses can be found in [209]. 4.1 Models with Inputs and Outputs Classically, specifications of communicating systems are based on Labelled Transitions Systems (LTS). As previously discussed, an LTS is made of a set of states S and a set of transitions R between them labelled with actions belonging to an alphabet L, and an initial state s0 .

Software Testing Based on Formal Specification

233

Definition 10. A Labelled Transition system on a finite alphabet L is a structure of the form (S , s0 , L, R) where S is a finite set of states, s0 is an initial state, and the set of labelled transitions between states is given by R ⊆ S × L ∪ { τ } × S . A run on a word w ∈ (L ∪ { τ })∗ is a sequence of states s0 , s1 , ..., sn such that (si , wi , si+1 ) ∈ R for i = 0, ..., n − 1. Some actions are considered as not observable by the environment (they are also called hidden, or internal). Most authors use for them the generic name τ . Sequences of observable actions corresponding to a run of a transition system are called its traces. Due to the possible nondeterminism of these systems (R is a relation), a trace can lead to several states. The following classical notations are used: a

– s −→ s where s, s ∈ S , a ∈ L, when (s, a, s ) ∈ R, or a ∃ s ∈ S such that (s, τ, s ) ∈ R and s −→ s , a

– s −→ when there is no state s such that (s, a, s ) ∈ R and no state s ∈ S such a that (s, τ, s ) ∈ R and s −→ s . To test implementations modeled by LTS, one studies the sequences of observable actions they can perform or not. One uses tests (or more exactly tester processes) built on the same language of actions as the specification. Test execution consists in running the implementation and the test in parallel. During this execution, the implementation and the test synchronize on each action of L. There are several variants of parallel compositions of LTS. In this section we use the totally synchonized one, denoted by ||, where the two processes must be ready to perform some transition with the same label to progress concurrently. Let (S 1, s10, L, R1) and (S 2, s20, L, R2) be two LTS with the same labels. Given s1 ∈ S 1 and s2 ∈ S 2, the following transition rules are presented as illustration: a

a

a

(s1 −→ s1 and s2 −→ s2 ) =⇒ s1 || s2 −→ s1 || s2 a

a

a

a

s1 −→ =⇒ s1 || s2 −→ s2 −→ =⇒ s1 || s2 −→ τ τ s1 −→ s1 =⇒ s1 || s2 −→ s1 || s2 τ τ s2 −→ s2 =⇒ s1 || s2 −→ s1 || s2 LTS have proven to be very useful models, especially at the specification level. However, when a system interacts with its environment, there is a difference between inputs and outputs. In this case implementations are often modeled by IOTS, which differ from LTS in that the set of observable actions is divided into two subsets: LI the set of input actions and LU the set of output actions. Some examples of IOTS can be found in Chapter 3 (Section 3.2). There, as already mentioned, the notation used is IOLTS. An IOTS is an IOLTS that satisfies some properties, as further discussed in the remainder of this section. Parallel composition is defined on pairs of IOTS where the set of input actions of the first one is the set of output actions of the second one, and conversely. Input actions synchronize with output actions of the same name and vice-versa.

234

M.-C. Gaudel

A specificity of IOTS is that a trace can lead to some state in which no output action can be performed. As already explained in Chapter 3, such a state is called a quiescent state: the IOTS remains idle until some input action is executed by its environment. IOTS considered in the current chapter are those that are strongly convergent: they cannot execute an infinite sequence of internal actions. Without this restriction, the observation of quiescence of implementations would not be possible. The difference between input actions and output actions is modeled by the way the IOTS interacts with its environment. IOTS have the input enabledness property: input actions are controlled by the environment and cannot be refused, while output actions are controlled by the system and cannot be refused by the environment. Input-enabledness is a very strong assumption: it means that in any state the system accepts any input. It is not adequate for many systems, such as interactive ones: much systems refuse some inputs in some situations. Such impossibilities may come from the logic of the system (for instance, it is meaningless to paste if nothing has been cut or copied before), or from some physical constraints. In both cases, it may be of interest to specify and test these impossibilities. This is not feasible using IOTS and ioco. In [208] and [210], Lestiennes developped a new kind of model named RI-OLTS (RestrictedInput Output LTS) and a conformance relation rioco for which an exhaustive test suite is stated and proved. The semantics of systems with restricted inputs is a bit complex since it must deal both with quiescence and deadlock. 4.2 The ioco Conformance Relation The ioco relation relates specifications, which are expressed as LTS, and implementations, which are modeled as IOTS. This choice comes from the fact that using IOTS at the specification level turns out to be tedious. Input enabledness requires to describe the reaction of the system to any input, even in the case the input is ineffective or forbidden. Modelling the observation of quiescent states of the SUT raises a problem. Practically it is done by a time-out mechanism which checks that the system remains inactive when there is no input. At the model level, this idleness is described by a virtual action, classically denoted δ, which loops on quiescent states. Thus among the output actions of IOTS models there is always this special fictive action δ, which actually models the fact that there is no output. In an IOTS where quiescent states are decorated with δ-labelled loops, we have: ((s, δ, s) ∈ R) =⇒ ( ∃ s ∈ S , o ∈ LU such that (s, o, s ) ∈ R) The introduction of this virtual output loop on quiescent states makes easier the definition of the ioco relation. Let LTS (L) stand for the class of labelled transition systems with labels in L, and IOTS (I , O ) the class of IOTS with input actions in I and output actions in O ; Definition 11. The ioco relation is defined over the domain IOTS (LI , LU ∪ { δ }) × LTS (L) with L = LI ∪ LU and LI ∩ LU = ∅ and δ ∈ / L. Let Imp ∈ IOTS (LI , LU ∪ { δ }) and SP ∈ LTS (L). Imp ioco SP ⇐⇒ ∀ σ ∈ STraces(SP) • Out (Imp after σ) ⊆ Out (SP after σ)

Software Testing Based on Formal Specification

235

The following notations are used – Straces(SP) is the set of traces of SP enriched by δ actions after quiescent states. For instance a system that iteratively reads i and then writes o has among its Straces δ∗ i o δ∗ . . . – Given an LTS or an IOTS M and a trace (or a Strace) σ, M after σ is the set of possible states of M after σ (in Definition 10, R is a relation between states, thus LTS and IOTS may be nondeterministic). – Out (S), where S is a set of states, is the set of possible outputs from this set of states, δ being considered as an output. Note that if δ ∈ Out (S) there is at least one quiescent state in S. Moreover, because of the introduction of δ in the models, Out (M after σ), where σ ∈ Straces(M ), cannot be empty. The introduction of δ makes it possible to express in a compact way that Imp must never issue an output that is not specified, and that Imp can reach a quiescent state only when SP can. 4.3 Tests, Test Experiments and Testability As said before, to test a process specified as an LTS, one uses tests (or more exactly tester processes) built on the same language of observable actions as the specification. As for the model of implementation, this language is divided into two subsets of actions: the output actions corresponding to the input actions of the implementation, and the input actions corresponding to the output actions of the implementation. Moreover, the special action δ of the specification corresponds to some time-out mechanism in the test (that, for simplicity, is also denoted by δ). Test execution consists in running the implementation and the test in parallel. During this execution, the implementation and the test synchronize on each action of L. This corresponds to the synchronization of the input actions of the implementation with the output actions of the test, and vice versa. A verdict about the success of a test execution is given, depending on the implementation relation and the observations made (actions or quiescence) during the test execution. This section gives some testability hypotheses and then some exhaustive test suite for ioco. Namely, for a specification SP, the test suite exhaustiveioco (SP) is defined. Although hypotheses are presented first, they were defined concurrently with the test suite. The definitions of exhaustive test suite for ioco, are based on the following testability hypotheses. 1. One makes the assumption that the implementation is input-enabled (see Section 4.1). Thus, all the input actions must be implemented and possible in any state. 2. When testing a nondeterministic implementation, one has to make the assumption that after a sufficient number of executions of the same test, all the paths corresponding to this test in the implementation have been taken. This classical assumption is known as the “complete testing assumption” (cf Section 3.2). It ensures that the implementation will not have any other behaviors that those observed during the

236

M.-C. Gaudel

test executions. Thus each test t is performed several times. t is passed successfully by an implementation if and only if the verdict of all these executions is a success. 3. For implementations including some internal parallel executions, actions are supposed to be atomic or at least observable as if they were atomic. Moreover, they are supposed to conform to the interleaving model of parallelism. Therefore, such implementations can be tested as if they were sequential. 4. A hypothesis always assumed when the tests are a set of independent tests (as in this chapter) is that the system can be correctly reset to be able to execute the tests starting from the initial state. 5. As said in Section 4.1, implementations have to be strongly convergent, to be able to check quiescence. These five hypotheses characterize the class of implementations for which the test suites built in the next sections give a correct result. These hypotheses will be called HypoI /0 in the sequel. According to the definition of ioco, one can draft exhaustiveioco to stand for a first exhaustive test suite for a specification SP in LTS (LI ∪ LU ), which is defined as the set below of tests in IOTS (LU , LI ): exhaustiveioco (SP) = { σ; a; stop | σ ∈ Straces(SP), a ∈ LU ∪ { δ } } The syntax used to describe a test is similar to LOTOS: the symbol “;” is used for action prefixing and “stop” is a special symbol indicating termination. Moreover, the symbol [ ] is used later for the choice between behaviors. A sequence of observed actions is written using the symbol “.”. The verdicts corresponding to this test suite are: – if σ.a cannot be performed by the implementation, then success – else if σ.a can be performed by the specification then success else failure These tests wait for an output a after a trace σ. If the SUT performs this output, it must be allowed by the specification. The first kind of verdict above corresponds to two cases: – if σ has been performed and a is refused, it is not a failure since ioco only requires that Out (Imp after σ) ⊆ Out (SP after σ), which means inclusion and not strict equality; – if σ has not been performed, it may come from the fact that with ioco some Straces of the specification may not be executable by the implementation. Remember that ioco allows the implementation to be more deterministic than the specification. Thus it is not a failure when some Strace cannot be executed. From the definition of ioco the only failures correspond to some unspecified output after an Strace of the specification: this corresponds to the second kind of verdict above. None of these tests ends by an action belonging to LI . This comes from the first hypothesis: as soon as it is assumed that any input action can be performed by the implementation in any state, such tests would be useless. The test suite above can be improved. Let us consider the tests corresponding to an Strace σ.a of the specification. Due to the second case of the verdict, the associated test

Software Testing Based on Formal Specification

237

executions will lead to a success whatever the way the implementation behaves. Thus these tests can be removed. Moreover, testing the quiescence twice consecutively, or more, is useless because once an implementation is idle, it remains idle until an input action is executed by its environment. Thus tests corresponding to an Strace with two or more consecutive δ actions can also be removed from the exhaustive test suite. The test suite resulting from the suppression of these two kinds of tests and some factorisation is called exhaustiveioco exhaustiveioco (SP) = { σ; [ ]a∈A a; stop | σ ∈ Straces(SP),

A = LU ∪ { δ } \ Out (SP after σ), and ( ∃ σ , σ | σ; a = σ ; δ; δ; σ ) } The verdicts corresponding to these tests are: – if no σ; a can be performed by the implementation then success – else failure and the associated verdicts is that after The intuition behind the tests of exhaustiveioco each correct behavior, one tries to provoke an incorrect output (or quiescence). Thus the execution of a test of exhaustiveioco results in a failure if and only if it arrives to the end of the test. The definition of ioco naturally leads to such tests: specified outputs are not mandatory, but unspecified ones are forbidden. In [209] the interested reader will find a nondeterministic algorithm which generates , and the proof of the validity and unbias of the pair the tests belonging to exhaustiveioco (HypoI /0 , exhaustiveioco ), namely (SP) • Imp passes T ) ⇔ Imp ioco SP) HypoI /O (Imp) ⇒ ((∀ T ∈ exhaustiveioco

5 Enriching IOTS and ioco with Priorities This section presents some enrichment of the notion of IOTS in order to take into account priorities between actions: in many cases, nondeterminism is not sufficient for modeling and testing realistic systems; different priority levels are associated to possible actions (for instance, interruptions, as in Chapter 2). In some states, some output actions must have high priority (for instance, some alarms), or it may be the case that input actions have always priority on output ones (examples are screen-savers), or some specific actions (for instance, emergency stop) may have priority to any other one in any state. A few approaches have been proposed for modelling such systems, mainly by extending the CCS process algebra. We mention two recent ones. In [278], Phillips enriches the choice operator with priority guards. Actions do not have global priority level: priorities are local to a specific choice. The approach presented in [82] by Cleaveland et al. is different: some actions are “prioritized”, and some are “unprioritized”. Special operators make it possible to change the priority of an action. These works define bisimulation equivalences, and in the second case, behavioral equivalence. They do not address the issue of testing based on the proposed models.

238

M.-C. Gaudel

Since testing is the focus of this chapter, and IOTS, ioco and some variants have been developed and used in this context, we sketch here how to develop extensions of IOTS and ioco to take priorities into account. An essential point is the necessity to introduce not only a new kind of model for the SUT and its specification, but also another kind of model for the testers, where it is possible to specify that several actions are triggered simultaneously. Moreover, a new parallel composition between these models needs to be defined (note that [278] and [82] had to redefine choice and parallel composition as well). On these bases, we define tests and verdict for a class of systems modelled by IOTS with priorities. We give some hints on the definition of an exhaustive test suite. 5.1 Models with Priorities The models described below consider two levels of priority. It is sufficient for most practical systems. The generalisation to more levels is still to be studied, but should not raise big technical difficulties. The levels of priority of possible actions may vary depending on the states. This leads to a classification of the labelled transitions with the same initial state: those with low priority, and those with high priority. Definition 12. An IOTS with priorities (IOTSwp) is a structure (S , s0 , LI , LU , TLP , THP ) where S is a finite set of states, s0 is an initial state, LI is the set of observable input actions, U is the set of observable output actions. TLP ⊆ S × ( LI ∪LU ∪{ δ }∪{ τ })×S is the set of transitions with low priority. THP ⊆ S × ( LI ∪ LU ) × S is the set of transitions with high priority. τ is the internal transition, which is always associated with low priority transitions. It must be noted that by construction IOTSwp are not input-enabled. In a state where there is an output action with high priority, input actions with low priority are refused. However, there is still a notion of quiescent state, where all the outgoing transitions are labelled by input actions. We keep the convention of decorating such states by a loop labelled by the fictive output action δ: it makes the definition of the conformance relation easier. Those loops always have low priority. The following property is required on the transitions: TLP ∩THP = ∅. Other reasonable restrictions could be introduced on transitions: for instance, unicity of high level priority transition from a given state, and one can imagine several other ones depending on the systems and environments that are considered. In this first approach, we have made the choice to remain as general as possible. When there are several high level priority transitions from one state, one of them must be nondeterministically chosen. However, it will turn out in the sequel that for reasons of observability, some restriction on the nondeterminism at the specification level and at the implementation level must be introduced in order to get correct verdicts. An example of such a model is given in [211]. It gives the specification of a car control speed system where, when the control is activated, braking is an input with high priority, and when the speed is too high, slowing down is an output with high priority. Let us define: Prio(s) = { a ∈ LI ∪ LU | ∃ s ∈ S , (s, a, s ) ∈ THP } and Prio(S), where S is a set of states, the set of actions associated with high priority transitions from these states. We also use:

Software Testing Based on Formal Specification

239

NonPrio(s) = { a ∈ LI ∪ LU ∪ {δ} | ∃ s ∈ S , (s, a, s ) ∈ TLP } and NonPrio(S), where S is a set of states. As a direct consequence of these definitions, there are sets S such that the following condition holds: Prio(S) ∩ NonPrio(S) = ∅. A simple example is S = { s1, s2 }, Prio(s1) = { a } and Prio(s2) = { b }, NonPrio(s1) = { b }, NonPrio(s2) = { a }. The parallel composition operator and the choice operator must be revised in order to ensure the preemption of high priority transitions: a

a

– s1 −→ s1 and s2 −→ s2 and (a ∈ Prio(s1) ∩ Prio(s2) or a Prio(s1) ∩ Prio(s2) = ∅) =⇒ s1 || s2 −→ s1 || s2 a a – s1 −→ s1 and s2 −→ s2 and a ∈ / Prio(s1) ∩ Prio(s2) and a

Prio(s1) ∩ Prio(s2) = ∅ =⇒ s1 || s2 −→ a a – (s1 −→ s1 and (a ∈ Prio(s1) or Prio(s2) = ∅)) =⇒ (s1[ ]s2 −→ s1 and a s2[ ]s1 −→ s1 ) These rules are the main ones; we do not give the full set for conciseness. 5.2 Priorities and Conformance The following definition of conformance keeps the main requirement of ioco and add the requirement that the SUT implements exactly all the transitions with high priority of the specification. Definition 13. The iocowp relation is defined over the domain IOTSwp(LI , LU ∪ { δ }) × IOTSwp(LI , LU ∪ { δ }) with LI ∩ LU = ∅ and δ∈ / LI ∪ LU . Let Imp and SP ∈ IOTS (LI , LU ∪ { δ }), then Imp iocowp SP ⇐⇒ ∀ σ ∈ STraces(SP) • Out (Imp after σ) ⊆ Out (SP after σ) ∧ Prio(Imp after σ) = Prio(SP after σ) Testing the first requirement can be done by the tests defined in Section 4.3, namely (SP) that are modelled by IOTS (LU , LI ). However, the second those in exhaustiveioco requirement leads to new developments, since observation of priorities demands the application of simultaneous actions to the SUT. 5.3 Models for Priority Test Drivers For actually testing priority requirements, one needs some test environments that can emit simultaneously several inputs toward the SUT, and simultaneously are receptive to any output of the SUT. Therefore, a new sort of model is required, that we call Concurrent Input Output LTS, CIOLTS for short. In these models, transitions may be labelled by sets of simultaneous actions. Definition 14. A CIOLTS is a structure (S , s0 , LI , LU , T ) where S is a finite set of states, s0 is the initial state, LI and LU are respectively the sets of observable input actions and observable output actions. CIOLTS is input-enabled and T obeys the following property: T ⊆ (S × LI ∪ LU ∪ { δ } ∪ P + (LU ) × S ).

240

M.-C. Gaudel

In the definition above, given a set of actions L, P + (L) is the set of non-empty subsets of L: some transitions can be labelled by non-empty sets of output actions. CIOLTS are input enabled in order to avoid any bias in the observation of the output actions of the IOTSwp under test. Actually, in CIOLTS models, in addition to the specified labels, any transition is implicitly labelled by the set of all the input actions, that will turn out, when performing tests, to be the output actions of the SUT. Such an extension of IOTS implies quite a tricky revision of the operators of the underlying process algebra. Modelling concurrency of CIOLTS by action interleaving is no more possible since the classical rules that express parallel composition || as choice [ ] are no more adequate. Some specificities of multi-labelled transitions is illustrated below. A multi-labelled transition, when synchronised with a choice of uniquely labelled transitions performs one of those where the label belongs to the set of its labels: s s

{ a1 ,...,an }

a

a

a

a

−→

s and s1 −→ s1 and a ∈ { a1 , . . . , an } =⇒ s || (s1[ ]s2) −→ s1

−→

s and s2 −→ s2 and a ∈ { a1 , . . . , an } =⇒ s || (s1[ ]s2) −→ s2

{ a1 ,...,an }

When synchronising with an environment where there is some concurrency all the possible actions of the set of labels are performed concurrently: – s

{ a1 ,...,an }

−→

a

s and s1 −→ s1 with a ∈ { a1 , . . . , an },

{ a,b }

b

s2 −→ s2 with b ∈ { a1 , . . . , an } =⇒ s || (s1 || s2) −→ s || (s1 || s 2) { a1 ,...,an }

a

– s −→ s and s1 −→ s1 with a ∈ { a1 , . . . , an }, b a / { a1 , . . . , an } =⇒ s || (s1 || s2) −→ s || (s1 || s2 ) s2 −→ s2 with b ∈ – ... The complete development of an operational semantics for CIOLTS is not of primary interest here. It would be similar to the one of Petri nets, where execution steps correspond to multisets of simultaneous transitions. Since CIOLTS are introduced in order to formalise the requirements for test environments of IOTSwp, the issue is the parallel composition of IOTSwp and those CIOLTS that allows to test priorities. WP WP , TLP ) ∈ IOTS (LI , LU ) and Given WP = (S WP , s0WP , LI , LU , THP C C C C = (S , s0 , LU , LI , T ) ∈ CIOLTS (LU , LI ), we consider the various synchronisation cases for a

s WP −→ s WP , where s WP , s WP ∈ S WP and a ∈ LI ∪ LU , and a1 ,...,an s C −→ s C , where s C , s C ∈ S C and { a1 , ..., an } ⊆ LI . The intuition behind the rules below is that when a is a high priority input, and it belongs to { a1 , ..., an } it may be performed. When a is a high priority output, it may always be performed since CIOLTS are input enabled. When a is a low priority input, and it belongs to { a1 , ..., an } it may be performed only if there is not high priority action that may be performed. If it is a low priority output, it may be performed only if there is not high priority actions that may be performed. a

a1 ,...,an

– s WP −→ s WP and s C −→ s C and a ∈ Prio(s WP ), a • a ∈ { a1 , ..., an } ∪ LU =⇒ s WP || s C −→ s WP || s C a

• a∈ / { a1 , ..., an } ∪ LU =⇒ s WP || s C −→

Software Testing Based on Formal Specification a

– s WP −→ s WP and s C

241

a1 ,...,an

−→ s C and a ∈ NonPrio(s WP ), a

• Prio(s WP ) ∩ ({ a1 , ..., an } ∪ LU ) = ∅ =⇒ s WP || s C −→ • Prio(s WP ) ∩ ({ a1 , ..., an } ∪ LU ) = ∅ and a ∈ { a1 , ..., an } ∪ LU =⇒ a s WP || s C −→ s WP || s C • Prio(s WP ) ∩ ({ a1 , ..., an } ∪ LU ) = ∅ and a ∈ / { a1 , ..., an } ∪ LU =⇒ a

s WP || s C −→ 5.4 Testing Priorities Given some IOTSwp(LI , LU ) SP, what are the relevant tests for checking the conformance to SP w.r.t. iocowp? Among the CIOLTS (LU , LI ) that are used for the description of test drivers, the interesting ones for testing priorities are based on those Straces σ of SP that lead to sets of states such that Prio(SP after σ) = ∅. Moreover, inputs of the SUT are under the control of the test environment, but it is not the case for outputs. It leads to different forms of tests for inputs with high priority and outputs with high priority. Let Sa be a set of actions; we write { Sa } for the simultaneous ouput actions of Sa in a CIOLTS. Given an Strace σ of SP, such that Prio(SP after σ) = ∅ – For an input action a ∈ Prio(SP after σ) ∩ LI a test of the priority of a is: σ; { { a } ∪ (NonPrio(SP after σ) ∩ LI ) }; stop – For output actions a ∈ Prio(SP after σ) ∩ LU a test of their priority is: σ; { NonPrio(SP after σ) } ∩ LI ; stop In the first case, after the execution of σ, the driver simultaneously sends to the SUT input a and the inputs with low priority after σ. a must be accepted, but it is also acceptable that an output with high priority is performed by the SUT. No action with low priority must be performed by the SUT. In the second case, after the execution of σ, the driver simultaneously sends to the SUT the inputs with low priority after σ. a, or any other output with high priority, must be performed. No action with low priority must be performed by the SUT. The various cases are summarized below (see the explanation of the verdict in subsection 4.3): – – – –

If the test execution stops before the end of σ, success, if a is performed after σ, success, if a is not performed, but some output action in Prio(SP after σ) is, success, if a is not performed, and another action not in Prio(SP after σ) is, failure.

Considering these tests for all Straces(SP) such that Prio(SP after σ) = ∅ is a good basis for the definition of an exhaustive test suite. However, the verdict raises difficulties due to the fact that there may be several states reachable after σ. This verdict is correct only if Prio(SP after σ) ∩ NonPrio(SP after σ) = ∅ (see the remark in sub-section 5.1 above). This implies that SP must satisfy the following restriction concerning nondeterminism:

242

M.-C. Gaudel

∀ σ ∈ Straces(SP)• ∀ s, s ∈ SP after σ • Prio(s) = Prio(s ) ∧ NonPrio(s) = NonPrio(s ) This condition is satisfied by a rather large class of systems. Determinism of SP is a sufficient condition. Moreover, it is satisfied by specifications where priorities of actions are fixed independently of the considered transition, as it was the case in the very first version of CCS prio [81]. For instance, nondeterministic systems with interruptions can be tested. Another important requirement on the nondeterminism of the SUT is the complete testing assumption mentioned in Section 3.2, namely the fact that after a sufficient number of experiments all the possible behavior associated to a test are observed. It is especially important for the observation of the SUT outputs, which are not controllable from the test driver.

6 Conclusion The approach presented here provides a generic way of using formal specifications for deriving test cases, independently of the used formalism. Moreover, the fact that the testability and selection hypotheses are made explicit is a significant advantage: it helps at choosing sound strategies. It gives some guidelines on the kind of development constraints or verification activities, which must complement the submission of the derived test cases. In addition to the formalism presented her (Algebraic Specifications, FSM, IOTS, and IOTS with priorities) this approach has been applied to more formalisms: COOPN2 by Cécile Peraire and Didier Buchs at EPFL [275], Lustre by Bruno Marre [233,234], and recently CSP, with Ana Cavalcanti [73]. A generalization of this last work to the Circus specification language [356] is in progress.

A Systematic Introduction to Mutation Testing in Unifying Theories of Programming Bernhard K. Aichernig 1

Institute for Software Technology, Graz University of Technology, Graz, Austria 2 International Institute for Software Technology United Nations University (UNU-IIST), Macao S.A.R. China

This chapter presents a theory of testing that integrates into Hoare and He’s Unifying Theories of Programming (UTP). We give test cases a denotational semantics by viewing them as specification predicates. This reformulation of test cases allows for relating test cases via refinement to specifications and programs. Having such a refinement order that integrates test cases, we develop a testing theory for fault-based testing. As discussed in Chapter 1, fault-based testing uses test data designed to demonstrate the absence of a set of pre-specified faults. A well-known fault-based technique is mutation testing. In mutation testing, first, faults are injected into a program by altering (mutating) its source code. Then, test cases that can detect these errors are designed. The assumption is that other faults will be caught, too. In this chapter, we apply the mutation technique to both specifications and programs. Using our theory of testing, two test-case generation laws for detecting injected (anticipated) faults are presented: one is based on the semantic level of UTP design predicates, the other on the algebraic properties of a small programming language. The chapter is structured as follows. After a general introduction, Section 2 introduces UTP [173], the mathematical framework used throughout this article. In particular, an overview of the theory of designs is presented. Section 3 briefly discusses the testing terminology and, based on the theory of designs, a series of formal definitions of concepts like test cases, test equivalence classes, faults, and a criterion for finding faults is presented. It is this section that highlights the important role of refinement in our testing theory. The next two sections include the main technical ideas of this chapter. Section 4 contains a construction for test cases that will find anticipated errors in a design. This test-case generation method works on the semantic level of designs. In Section 5 a purely algebraic (syntax-oriented) test-case generation algorithm is presented. It is based on the algebraic properties of a small, but non-trivial, programming language. Next, Section 6 discusses tool building and application domains. Finally, in Section 7 we discuss the results, give a review of related work, and present an overview of further research directions.

1 Introduction A theory of programming explores the principles that underlie the successful practice of software engineering. Consequently, a theory of programming should not lack a theory P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 243–287, 2010. c Springer-Verlag Berlin Heidelberg 2010

244

B.K. Aichernig

of testing. Understanding of the fundamentals of software testing enables the experience gained in one language or application domain to be generalised rapidly to new applications and to new developments in technology. It is the aim of this chapter to show how a theory of (mutation) testing integrates into Hoare & He’s Unifying Theories of Programming [173]. Any testing process needs to be based on solid foundations. In the case of software testing, formal program semantics and its associated mathematical tool-set provide these foundations. This chapter illustrates this connection between semantics and testing by developing a theory of mutation testing. The formal semantics is used to define concepts like failure and conformance, and leads to new techniques for test-case generation. The added value of such a theoretical account of mutation testing is that the ideas and techniques can be easily transferred to other programming languages or application domains. The concrete languages and applications will change, but the theory stays the same. This generality is even more emphasized by formulating the testing theory in the Unifying Theories of Programming (UTP). UTP was developed by Hoare and He to convincingly explain a broader range of phenomena in computer software. It is unifying in the sense that it links a set of existing theories without aiming to replace them. Simplicity was a major motivation of the UTP. Hence, a basic knowledge of first-order predicate logic is sufficient for studying it. The same holds for the testing theory formulated in UTP: specifications, programs and their associated test cases are interpreted as predicates, implication defines conformance, and non-conformance serves as the basis for the fault-models. The UTP-theory on mutation testing clearly demonstrates that research on testing ranges from practical to theoretical aspects, both being equally important. The theory is designed to be a complement to the existing body of knowledge. Traditionally, theories of programming focus on semantic issues, like correctness, refinement and the algebraic properties of a programming language. A complementary testing theory should focus on the dual concept of fault. As already mentioned in Chapter 1, the main idea of a fault-centered testing approach, also called fault-based testing, is to design test data to demonstrate the absence of a set of pre-specified faults. Our fault-based testing approach is based on a rather old technique from program testing called mutation testing, which was presented in Chapter 1. We extend this technique by applying it to first-order predicates denoting program behavior. This lifts mutation testing to the specification level. By using UTP’s design predicates, that is, pairs of predicates that define a precondition and a postcondition, we have the full theory of programming available, including notions of correctness based on the widely accepted concept of refinement. We use this theory to define criteria for finding or even generating the test cases that would detect the injected errors (mutations). Hence, rather than inventing test cases and assessing them, we are interested in constructive methods of generating the adequate test cases. By test cases we mean program stimuli (inputs) together with predicted reactions (outputs). An example will clarify the general idea. The Object Constraint Language (OCL) of UML 2.x has been chosen to highlight the relevance of our technique, but any specification language for expressing pre-postcondition specifications would do.

Introduction to Mutation Testing of Programming

245

context Ttype(a: int, b: int, c: int): String pre: a < (b+c) and b < (a+c) and c < (a+b) post: if((a = b) and (b = c)) then result = "equilateral" else if ((a = b) or (a = c) or (b = c)) then result = "isosceles" else result = "scalene" endif endif Fig. 1. Original specification of triangle types in OCL context Ttype(a: int, b: int, c: int): String pre: a < (b+c) and b < (a+c) and c < (a+b) post: if((a = a) and (b = c)) then result = "equilateral" else if ((a = b) or (a = c) or (b = c)) then result = "isosceles" else result = "scalene" endif endif Fig. 2. A mutant triangle specification with a faulty conditional

Example 1. We consider the well-known Triangle example, specified in Figure 1. The operation Ttype returns the type of a triangle represented by three given side-lengths a, b, c. The precondition restricts the problem specification to cases where the values of a, b, c form a valid triangle. The specification can be mutated (altered) in several ways. One common mistake made by programmers is to mess up variables in a condition. Such a mistake can be represented as early as on the specification level by mutating a specification variable as shown in the mutant in Figure 2. Note that this faulty version would pass a type checker, because the correct variable name has been replaced by a valid identifier. In general, we exclude mutations from our testing strategy that can be trivially detected by static type checking. The aim of our mutation testing strategy is to design test cases that detect implementations of the faulty specification (mutant). The following test cases covering each case of the original demonstrate that simple case analysis (branch coverage) is too weak. a = 2, b = 2, c = 1, result = isosceles a = 2, b = 3, c = 4, result = scalene a = 1, b = 1, c = 1, result = equilateral None of these test cases would be able to distinguish the original from the mutant, since the predicted results (result ) of the test cases and the actual results of a faulty implementation of the mutant would coincide. In contrast, the following test case is able to distinguish the original triangle operation from its mutant.

246

B.K. Aichernig

a = 1, b = 2, c = 2, result = isosceles In this case, the predicted result is in conflict with the actual outcome of the mutant, that is, isosceles = equilateral . This chapter is devoted to techniques for finding such fault-detecting test cases systematically. Note that the strategy of mutation testing is not based on the syntactical structure of the source code, like statement or branch-coverage techniques. Its aim is to cover anticipated faults. This is why it is called a fault-based testing technique. It is this fundamentally different philosophy of our fault-based testing theory that adds a further dimension to the theories of programming (UTP). Rather than doing verification by testing, a doubtful endeavour anyway, here we focus on falsification. It is falsification, because the tester gains confidence in a system by designing test cases that would uncover an anticipated error. If the falsification (by running such tests) fails, it follows that a certain fault does not exist. The fascinating point is that the notion of program refinement, usually applied for rigorous development or verification by proof, plays a key role in our theory of testing. However, due to the concentration on faults, we are interested in the cases where refinement does not hold — again falsification rather than verification. The interesting questions that arise from focusing on faults are: Does a mistake made by a designer or programmer lead to an observable failure? Do my test cases detect such faults? How do I find a test case that uncovers a certain fault? What are the equivalent test cases that would uncover such a fault? Finally and most importantly: how to automatically generate test cases that will reveal certain faults? All these questions are addressed in this chapter. They have been addressed before, but rarely on a systematic and scientifically defendable basis linking theories of testing and programming. The testing theory presented here has been published in [6]. For this chapter, we have added a discussion of applications and possible implementation techniques leading to automated test-case generators. Exercise 1. Find another mutation of the Triangle example in Figure 2 that cannot be distinguished by the four test cases presented in Example 1. Add an additional test case that is able to detect this new mutant. Exercise 2. Some mutants cannot be distinguished by external observation. For such equivalent mutants no discriminating test case exists. Find an equivalent mutant for the triangle example in Figure 2.

2 Unifying Theories of Programming The present theory of testing is based on the work of Hoare and He on Unifying Theories of Programming originally published in [173]. In the following we present a brief introduction to the UTP and give motivations for its relevance to testing.

Introduction to Mutation Testing of Programming

247

2.1 Unification of Theories In every scientific discipline phases of specialisation and unification can be observed. During specialisation scientists concentrate on some narrowly defined phenomenon and aim to discover the laws which govern it. Over time, it is realized that the laws are special cases of a more general theory and a unification process starts. The aim is a unifying theory that clearly and convincingly explains a broader range of phenomena. A proposed unification of theories often receives spectacular confirmation and reward by the prediction of new discoveries or by the development of new technologies. However, a unifying theory is usually complementary to the theories that it links, and does not seek to replace them. In [173] Hoare and He are aiming at unification in computer science. They saw the need for a comprehensive theory of programming that – includes a convincing approach to the study of a range of languages in which computer programs may be expressed, – must introduce basic concepts and properties which are common to the whole range of programming methods and languages, – must deal separately with the additions and variations which are particular to specific groups of related programming languages, – should aim to treat each aspect and feature in the simplest possible fashion and in isolation from all the other features with which it may be combined or confused. Our theory of testing originated out of these motivations for unification. With the advent of formal methods, formal verification and testing split into separate areas of research. Over time both areas advanced. In the 1990s a unification process started in the area of verification [135]. As a result, most of today’s formal method tools have a test-case generator, acknowledging the fact that every proof needs systematic testing of its underlying assumptions. Furthermore, researchers in testing investigate the role of formal models in the automation of black-box testing. However, many results in testing and formal methods remain unrelated. In this work, we aim to contribute to a further unification. The notion of testing and test cases is added to the existing UTP. Remarkable is the fact that the concept of refinement is used to relate test cases with a theory of specifications and programs. 2.2 Theories of Programming An essential key to the success of natural sciences was the ability to formulate theories about observable phenomena. These observables are described in a specialised language, that name and relate the outcome of experiments. Often equations or inequations are used, in general mathematical relations. The same holds for the science of programming, where the descriptions of observables are called by their logical term predicate. In the following theory of testing, predicates are used in the same way as in a scientific theory, to describe the observable behavior of a program when it is executed by a computer. In fact, we will define the meaning of a program, as well as the meaning of its test cases, as a predicate in first order logic.

248

B.K. Aichernig

Every scientific theory contains free variables that represent measurable quantities in the real world. In our theory of testing these free variables stand for program variables, conceptual variables representing a system’s state, or observable input-output streams. The chosen collection of names is called the alphabet. In engineering, predicates are not solely used to describe existing phenomena, but to specify desired properties. Such requirements specifications describe the behavior of a device in all possible circumstances and are a starting point for developing a product. In addition, the test cases are derived from these requirements descriptions. It will be seen that test cases are actually a special form of requirements specification, designed for experimentation (or in computer science terms, for execution). 2.3 A Theory of Designs In UTP by convention, observations made at the beginning of an experiment are denoted by undecorated variables (x , y), whereas observations made on later occasions are decorated (x , y ). As already mentioned, the set of these observation capturing variables is called the alphabet. During experiments it is usual to wait for some initial transient behavior to stabilise before making any further observation. In order to express this a Boolean variable ok and its decorated version ok are introduced. Here, a true-valued variable ok stands for a successful initialisation and start of a program and ok = true denotes its successful termination. In the theory of programming not every possible predicate is useful. It is necessary to restrict ourselves to predicates that satisfy certain healthiness conditions; for example, a predicate describing a program that produces an output without being started (like, for instance, ¬ok ∧ x = 1) should be excluded. In addition, the results of the theory must match the expected observations in reality. For example, a program that fails to terminate sequentially composed with any program must always lead to non-termination of the whole composition (this is the technical motivation for introducing ok , ok ). We call the subset of predicates that meet our requirements designs. The following definitions and theorems are a reproduction of the original presentation of UTP [173]. Definition 1 (Design). Let p and Q be predicates not containing ok or ok and p having only undecorated variables. p Q =df (ok ∧ p) ⇒ (ok ∧ Q ) A design is a relation whose predicate is (or could be) expressed in this form. This is a non-standard, more direct definition of designs. Our designs satisfy healthiness conditions H1–H3 in [173]. As can be seen, a design predicate represents a prepostcondition specification, a concept well-known from VDM [182], RAISE [149], B [1] and more recently OCL [336]. Example 2 (Square Root). The following contract is a design of a square root algorithm using a program variable x for input and output. A constant e specifies the precision of the computation. 2 (x 0 ∧ e > 0) (−e x − x e)

Introduction to Mutation Testing of Programming

249

Every program can be expressed as a design. This makes the theory of designs a tool for expressing specifications, programs, and, as it will be shown, test cases. In the following, some basic programming constructs are presented. Definition 2 (Assignment). Given a program variable x and an expression e x := e =df (wf (e) x = e ∧ y = y ∧ · · · ∧ z = z ) with wf being the predicate defining the well-formedness (definedness) of the expression e. Definition 3 (Conditional). P b Q =df (wf (b) (b ∧ P ∨ ¬b ∧ Q )) with wf being the predicate defining the well-formedness of the Boolean expression b. In the further discussion we will maintain the simplifying assumption that all program expressions are everywhere well-formed, and thus wf = true. A sequential composition P; Q is defined in the obvious way, via the existence of an intermediate state v0 of the variable vector v . Here the existential quantification hides the intermediate observation v0 . In addition, the output alphabet (out αP) and the input alphabet (with all variables dashed, inα Q ) of P and Q must be the same. Definition 4 (Sequential Composition). P(v ); Q (v ) =df ∃ v0 • P(v0 ) ∧ Q (v0 ) ,

provided out αP = inα Q = {v }

Nondeterministic, demonic choice is defined as logical or. Definition 5 (Demonic Choice). P Q =df P ∨ Q The UTP provides a series of theorems and lemmas expressing the basic algebraic properties of such programming constructs. Examples are given below. Theorem 1. (p1 Q1 ) (p2 Q2 ) = (p1 ∧ p2 Q1 ∨ Q2 ) (p1 Q1 ) b (p2 Q2 ) = (p1 b p2 Q1 b Q2 ) (p1 Q1 ); (p2 Q2 ) = (p1 ∧ ¬(Q1 ; ¬p2 ) Q1 ; Q2 ) What keeps the theory simple and elegant is the fact that in the UTP correctness is represented by logical implication. Hence, implication establishes a refinement order (actually a lattice) over designs. Thus, more concrete implementations imply more abstract specifications.

250

B.K. Aichernig

Definition 6 (Refinement). D1 D2 =df ∀ v , w , · · · ∈ A • D2 ⇒ D1 , for all D1 , D2 with alphabet A. Alternatively, using square brackets to denote universal quantification over all variables in the alphabet, we write [D2 ⇒ D1 ], or simply in refinement calculus [18,252] style D1 D2 . Obviously, this gives the well-known properties that under refinement, preconditions are weakened and postconditions are strengthened (become more deterministic). Theorem 2 (Refinement of Designs). [(P1 Q1 ) ⇒ (P2 Q2 )]

iff

[P2 ⇒ P1 ] and [(P2 ∧ Q1 ) ⇒ Q2 ]

In our theory, the worst of all programs is a non-terminating one, sometimes called Abort or Chaos. Definition 7 (Abort). ⊥ =df true

= false true = false false = false Q

The observations of non-terminating programs are completely unpredictable; anything is possible. Therefore programmers aim to deliver better programs. In our theory of designs the notion of “better” is captured by the implication ordering. Thus every program P that terminates is better than a non-terminating one. ∀ v, w, · · · ∈ A • P ⇒ ⊥ ,

for all P with alphabet A.

This refinement ordering defines a complete lattice over designs, with ⊥ as the bottom element and , representing a non-implementable, magic-like, program as top element. Definition 8 (Magic). =df (true false) The program is called magic since it magically refines (implements) every possible design D : [D ⇐ ] , for all D with alphabet A. Another interpretation of magic is that it can never be started, since = ¬ok . From the definitions above it follows that the meet operator of this lattice is the deterministic choice and its dual join operator is defined as follows. Definition 9 (Join). P Q =df P ∧ Q The following theorem shows the effect of joining designs. Theorem 3. (p1 Q1 ) (p2 Q2 ) = (p1 ∨ p2 ((p1 ⇒ Q1 ) ∧ (p2 ⇒ Q2 ))) The new design comprises two cases defined by a precondition. In the case of overlapping preconditions, the postconditions must not contradict each other, otherwise we have a magic design. We use the join operator for composing test suites out of test cases.

Introduction to Mutation Testing of Programming

251

Finally, iteration is expressed by means of recursive definitions. Since designs form a complete lattice and the operators are monotonic, the weakest fixed point exists. This ensures that the result of any recursion is still a design. Exercise 3. Use the UTP definitions of the programming statements above in order to prove the following properties. – – – –

P bP = P P b (Q c R) = (P b Q ) c (P b R) (P b Q ); R = (P; R) b (Q ; R) x := a; x := b = x := b

Exercise 4. Given a specification S =df x + y = 2(x + y), show that the implementation I =df (x := y − x ) (y := 3x + y), modeling a parallel assignment of x and y, is a valid refinement of S . Note, that the preconditions of both, S and I are true and therefore omitted. Exercise 5. Translate the triangle specification of Example 1 into a UTP design. Exercise 6. Prove that the triangle and the mutant of Example 1 are non-equivalent using the UTP theory of designs.

3 Modelling Faults in Designs In this section, we relate test cases via refinement to designs and programs. This is possible, since we give test cases a denotational semantics by viewing them as specification predicates. The result is a test-case generation technique based on the theory of refinement. As pointed out in Chapter 1, the vocabulary of computer scientists is rich with terms for naming the unwanted: bug, error, defect, fault, failure, and so on, are commonly used without great care. However, in a discussion on testing it is necessary to differentiate between them in order to prevent confusion. We adopt the standard terminology as recommended by the Institute of Electronics and Electrical Engineers (IEEE) Computer Society [177] (see Chapter 1). In this work we aim to design test cases on the basis of possible mistakes during the design of software. Examples of such mistakes might be a missing or misunderstood requirement, a wrongly implemented requirement, or simple coding mistakes. In order to represent these mistakes we will introduce faults into formal design descriptions. The faults will be introduced by deliberately changing a design, resulting in wrong behavior possibly causing a failure. 3.1 Test Cases What distinguishes the following theory from other testing theories is the fact that we define all test artifacts as designs. This means that we give test cases, test suites and even test equivalence classes a uniform (predicative) semantics in the UTP. The fundamental insight behind this approach is that all these artifacts represent descriptions of a system

252

B.K. Aichernig

to be built (or under test). They simply vary with respect to information content. A test case, for example, can be seen as a specification of a system’s response to a single input. Consequently a test suite, being a collection of test cases, can also be considered as a (partial) specification. The same holds for test equivalence classes that represent a subset of a system’s behavior. Viewing testing artifacts as designs results in a very simple testing theory in which test cases, specifications and implementations can be easily related via the notion of refinement. A consequence of this semantic interpretation is that test cases are actually abstractions of a system specification. This seems often strange to people since a test case is experienced as something very concrete. However, from the information content point of view a test case is perfectly abstract: only for a given stimulus (input) the behavior is defined. It is this limited information that makes test cases so easily understandable. The following example demonstrates the refinement and abstraction of designs, the latter leading to special designs representing test cases. Example 3. Consider a design from a set theoretic point of view. A design represents a relation between the pre- and poststates of a system. The precondition represents a set of prestates for which the postcondition relation is defined. Outside the precondition the relation is completely underspecified, relating the prestate with every possible poststate. The figure below describes an example graphically.

pre

v

v’

According to Theorem 2 one possible form of refinement is to strengthen the postcondition, which is the reduction of nondeterminism.

pre

v

v’

Another form of refinement according to Theorem 2 is the weakening of the precondition, which is the extension of the set of prestates for which the relation is defined. For example, we may extend the precondition to the complete set of the domain, giving a total relation.

Introduction to Mutation Testing of Programming

253

pre

v

v’

Technically, abstraction is the opposite process of refinement. Therefore, Theorem 2 points us to two ways of abstraction, namely, postcondition weakening and precondition strengthening. The latter is of our main concern. Proceeding with our example, we abstract the relation by undoing the last refinement step.

pre

v

v’

Reducing the precondition is a form of abstraction, where we narrow down the scope of our predicted behavior. Further precondition strengthening gives the following.

pre

v

v’

Finally, we may abstract to a point where the precondition is limited to a single prestate. This gives us a design that represents a test case.

pre

v

v’

The behavior is defined for a single prestate. Viewing the prestate as the input of a test case, the last design relates a single input to a set of predicted outputs. Hence, we have derived a test case by abstraction from a specification design.

254

B.K. Aichernig

Hence, we take the point of view that test cases are specifications that define for a given input the expected output. Consequently, we define test cases as a sub-theory of designs. Definition 10 (Test Case, deterministic). Let i be an input vector and o be an expected output vector, both being lists of values, having the same length as the variable lists v and v , respectively. Then, a test case T is defined being a design predicate: T (i, o) =df v = i v = o Sometimes test cases have to take nondeterminism into account, therefore we define nondeterministic test cases as follows. Definition 11 (Test Case, nondeterministic). T (i, c) =df v = i c(v ) where c is a condition on the after state space defining the set of expected outcomes. Obviously, nondeterministic test cases having the same input can be compared regarding their strength. Thus, we have the following result. Theorem 4. [T (i, c) ⇒ T (i, d )]

iff

[c ⇒ d ]

This shows that nondeterministic test cases form a partial order. The proof of this theorem, and of other theorems and lemmas in the sequel, can be found in [4,6]. If we fix the input i the test case T (i, true) is the smallest test case and T (i, false) the largest. However, the question arises as to how to interpret these limits, and if they are useful as test cases. T (i, true) is a test case without any output prediction. It is useful in Robustness Testing, where i lies outside the specified input domain, or where one is just interested in exploring the reactions to different inputs. T (i, false) is equivalent to ¬(ok ∧ v = i) and means that such programs cannot be started with input i; such tests are infeasible. Definition 12 (Explorative Test Case). T? (i) =df T (i, true) Definition 13 (Infeasible Test Case). T∅ (i) =df T (i, false) We get the following order of test cases. Theorem 5 (Order of Test Cases). For a given input vector i, output vector o and condition c ⊥ T? (i) T (i, c) T (i, o) T∅ (i) ,

provided c(o) holds.

As defined in Chapter 1, a collection of test cases is called a test suite, and we define it as the least upper bound of its test cases.

Introduction to Mutation Testing of Programming

255

Definition 14 (Test Suite). Given a set s of test cases t1 , . . . , tn TS (s) =df t1 . . . tn The definition coincides with our intuition: an implementation has to pass all test cases in a test suite. In case of contradicting test cases, this is impossible, which is expressed by the test suite being equal to magic (). From lattice theory it follows that adding test cases is refinement. Theorem 6. Let T1 , T2 be test cases of any type Ti T1 T2 ,

i ∈ {1, 2}

Given a program under test, we can talk about an exhaustive test suite, covering the whole input and output domain. Definition 15 (Exhaustive Test Suite). Let D be a design, its set of exhaustive test suites is defined as TSexhaustive =df {TS (s) | TS (s) = D } In this definition the notion of exhaustiveness is based on designs, not on the program under test. Thus, an exhaustive test suite only needs to cover the defined (specified) input domain. The following theorem states this more explicitly. Theorem 7. For a design D = p Q and one of its exhaustive test suites tsexhaustive , that is, a test suite tsexhaustive ∈ TSexhaustive tsexhaustive T? (i) = tsexhaustive ,

provided p(i) holds.

This theorem says that explorative test cases for specified behavior do not add new information (test cases) to the set of exhaustive test cases. We note, however, that it might be useful to add explorative test cases with inputs outside the precondition p for exploring the unspecified behavior of a program. The last theorem leads us to a more general observation: adding an extra test case to an exhaustive test suite is redundant. Theorem 8. Given a design D , an exhaustive test suite tsexhaustive ∈ TSexhaustive , and a test case t D expressing the fact that t has been derived from D , then, tsexhaustive t = tsexhaustive Having clarified the relations between different test cases, in the following, their relation to specifications and implementations is rendered more precisely. Previous work of the author [3] has shown that refinement is the key to understand the relation between test cases, specifications and implementations. Refinement is an observational order relation, usually used for stepwise development from specifications to implementations, as well as to support substitution of software components. Since we view a test case as (a special form of) specification, it is obvious that a correct

256

B.K. Aichernig

implementation should refine its test cases. Thus, test cases are abstractions of an implementation if, and only if, the implementation passes the test cases. This view can be lifted to the specification level. When test cases are properly derived from a specification, then these test cases should be abstractions of the specification. Formally, we have the following terminology. Definition 16. Let T be a test suite, S a specification, and I an implementation, all being designs, such that T S I we define – – – –

T as a correct test suite with respect to S , all test cases in T as correct test cases with respect to S , that the implementation I passes the test suite (test case) T , that the implementation I conforms to the specification S 1 .

The following theorem makes the relation of input and output explicit. Theorem 9. T (i, o) D T (i, c) D

iff iff

v := o (v := i; D ) c(v ) (v := i; D )

So far the discussion has focused on correctness, thus when implementations pass the test cases. However, the aim of testing is to find faults. In the following, we concentrate on faults and discuss how they are modelled in our theory of testing, leading to a faultbased testing strategy. 3.2 Faults As explained in Chapter 1, faults represent mistakes. These mistakes can be introduced during the whole development process in all artifacts created. Consequently, faults appear on different levels of abstraction in the refinement hierarchy ranging from requirements to implementations. Obviously, early introduced faults are the most dangerous (and most expensive) ones, since they may be passed on during the development process; or formally, a faulty design may be correctly refined into an implementation. Again, refinement is the central notion in order to discuss the roles and consequences of certain faults and design predicates are most suitable for representing faults. Definition 17 (Faulty Design). Let D be a design, and D m its mutated version, meaning that D m has been produced by slightly altering D . Furthermore, let the mutation represent a fault model. Then, the mutated design D m is defined to be a faulty design (or a faulty mutation) of D , if D D m (or ¬(D D m )). 1

In testing, refinement between a specification and an implementation under test is called conformance.

Introduction to Mutation Testing of Programming

257

Not all mutations (changes) of a design lead to observable failures. In mutation testing, mutants that behave equivalent to the original are called equivalent mutants. These mutants are excluded from the set of faulty designs. Strictly speaking, equivalent mutants are not observational equivalent and, therefore should be named refining mutants. For example, they may produce additional observations outside the precondition of the original design D . However, since we ignore test cases outside the precondition of the original design, this additional behavior cannot be detected by testing. In any case, we stick to the common notion of equivalent mutant keeping in mind its refinement semantics. Exercise 7. Prove that a test suite of two contradicting test cases is equivalent to an infeasible test case. T (i, o1 ) T (i, o2 ) = T∅ (i)

provided o1 = o2

Exercise 8. Let Tri be the triangle specification of Exercise 5. Show in UTP that result := equilateral (a, b, c := 5, 5, 5); Tri Exercise 9. Formalize the test cases of Example 1 as a test suite and show that the triangle specification (see Exercise 5) is a refinement of this test suite. Exercise 10. Prove that the mutant of Example 1 is a faulty design.

4 Designing Test Cases It is common knowledge that exhaustive testing of software cannot be achieved in general. Therefore, the essential question of testing is the selection of adequate test cases. What is considered being adequate depends highly on the assumptions made — the test hypothesis. Typical types of test hypotheses are regularity and uniformity hypotheses. An example of the former is the assumption that if a sorting algorithm works for sequences up to 10 entries, it will also work for more; an example of the latter is the assumption that certain input (or output) domains form equivalence partitions, and that consequently only one test case per partition is sufficient. In general, the stronger the hypothesis the fewer test cases are necessary. The test hypothesis is closely related to the notion of test coverage. It defines a unit to measure the adequacy of a set of test cases based on a test hypothesis. Traditionally, test coverage has been defined based on program text, like statement, branch, and dataflow coverage. For example, aiming for statement coverage is based on the uniformity hypothesis that it is sufficient to execute every statement in a program once — a rather strong assumption. Here, we take a fault-based approach: test cases will be designed according to their ability to detect anticipated faults. 4.1 Fault-Based Testing In fault-based testing a test designer does not focus on a particular coverage of a program or its specification, but on concrete faults that should be detected. The focus on

258

B.K. Aichernig

possible faults enables a tester to incorporate his expertise in both the application domain and the particular system under test. In testing the security or safety of a system, typically a fault-based test design strategy is applied. Perhaps, the most well-known fault-based strategy is mutation testing, where faults are modelled as changes in the program text. Mutation testing has been introduced by Hamlet [153] and DeMillo et al. [101]. Often it is used as a means of assessing test suites. When a program passes all tests in a suite, mutant programs are generated by introducing small errors into the source code of the program under test. The suite is assessed in terms of how many mutants it distinguishes from the original program. If some mutants pass the test suite, additional test cases are designed until all mutants that reflect errors can be distinguished. The number of mutant programs to be generated is defined by a collection of mutation operators that represent typical mistakes made by programmers. A hypothesis of this technique is that programmers only make small mistakes. In previous work [3,4] we have extended mutation testing to the notion of contracts in Back and von Wright’s refinement calculus [18]. In this section, we first translate these results to the theory of designs. Then, we provide a more constructive rule for designing fault-based test cases. First, the following theorem links the existence of non-passing (or incorrect) tests to faulty designs. Theorem 10. Given a design D , and a faulty design D m , there exists a test case t , with t D , such that t D m . Finding such a test case t is the central strategy in fault-based testing. For example, in classical mutation testing, D is a program and D m a mutant of D . Then, if the mutation in D m represents a fault, a test case t should be included to detect the fault. Consequently, we can define a fault-detecting test case as follows. Definition 18 (Fault-detecting Test Case). Let t be either a deterministic or nondeterministic input-output test case. Furthermore, D is a design and D m its faulty version. Then, t is a fault-detecting test case when (t D ) and (t D m ) We say that a fault-detecting test case detects the fault in D m . Alternatively we can say that the test case distinguishes D and D m . In the context of mutation testing, one says that t kills the mutant D m . It is important to point out that in case of a nondeterministic D m , there is no guarantee that the fault-detecting test case will definitely kill the mutant. The mutant might always produce the output consistent with the test case. However, the test case ensures that whenever a wrong output is produced this will be detected. This is a general problem of testing nondeterministic programs. We also want to remind the reader that our definitions solely rely on the lattice properties of designs. Therefore, our fault-detecting testing strategy scales up to other lattice-based test models as long as an appropriate refinement definition is used. More precisely, this means that the refinement notation must preserve the same algebraic laws. It is this lattice structure that enabled us to translate our previous results into the

Introduction to Mutation Testing of Programming

259

theory of designs. In [4] we came to the same conclusions in a predicate transformer semantics, with refinement defined in terms of weakest preconditions. The presented definition of a fault-detecting test case, able to detect a certain fault, presents a property that could be exploited by constraint solvers to search for a solution of such a test case in a finite domain. However, although feasible in principle, it is not the most efficient way to find such test cases. The reason is that the definition, because of its generality, does not exploit the refinement definition in the concrete test model. In the following we present a more constructive way to generate test cases for designs. 4.2 Fault-Detecting Test Equivalence Classes A common technique in test-case generation is equivalence class testing — the partitioning of the input domain (or output range) into equivalence classes (see, for example, [32,183]). The motivation is the reduction of test cases, by identifying equivalently behaving sets of inputs. The rationale behind this strategy is a uniformity hypothesis assuming an equivalence relation over the behavior of a program. A popular equivalence class testing approach for formal specification is DNF partitioning — the rewriting of a formal specification into its disjunctive normal form (see, for example, [104,308,164]). Usually DNF partitioning is applied to relational specifications, resulting in disjoint partitions of the relations (although disjointness of the input domain is not guaranteed in DNF partitioning). We call such relational partitions test equivalence classes. In general for a test equivalence class t= and its associated design D , refinement holds: t= D . Definition 19 (Test Equivalence Class). Given a design D = (p Q ), we define a test equivalence class T∼ for testing D as a design of form T∼ = d⊥ ; D such that [d ⇒ p]. The condition d is called the domain of the test equivalence class. The definition uses the assertion operator b⊥ =df true ((v = v ) b ⊥), leading to a design which has no effect on variables v if the condition holds (skip), and behaves like abort (non-termination) otherwise. We observe that here a test equivalence class is a design denoting an input-output relation. It is defined via a predicate d that itself represents an equivalence class over input values. Given the definitions above a design is obviously a refinement of an associated test equivalence class. Theorem 11. Given a design D = p Q and one of its equivalence classes, then, T∼ D Obviously, DNF partitioning can be applied to design predicates. However, in the following we focus on fault-detecting test equivalence classes. This is a test equivalence class where all test inputs are able to detect a certain kind of error. Definition 20 (Representative Test Case). A test case t = T (i, c) is a representative test case of a test equivalence class T∼ = d⊥ ; D , with D = p Q if, and only if, d (i) ∧ p(i) ∧ [Q (i) ≡ c] This definition ensures that the output condition of a representative test case is not weaker than the test equivalence class specifies.

260

B.K. Aichernig

The following theorem provides an explicit construction of a test equivalence class that represents a set of test cases that are able to detect a particular fault in a design. The rational behind this construction is the fact that, for a test case to be able to distinguish a design D from its faulty sibling D m , refinement between the two must not hold. Furthermore, for designs one may observe two causes for refinement to be violated related to the precondition and the postcondition. The domain d of T∼ represents these two classes of test inputs, related to the two sources of violation. The first class are test inputs that work for the correct design, but cause the faulty design to abort. The second class are the test inputs which will produce different output values. Theorem 12 (Fault-detecting Equivalence Class). We consider a design D = p Q and its faulty design D m = p m Q m with D D m . For simplicity, we assume that Q ≡ (p ⇒ Q ). Then every representative test case of the test equivalence class with d = ¬p m ∨ ∃ v • (Q m ∧ ¬Q )

T∼ =df d⊥ ; D , is able to detect the fault in D m .

Proof. We first show that a representative test case t = T (i, c) is a correct test case with respect to D . t D = [(v = i) ⇒ p]

∧ [((v = i) ∧ Q ) ⇒ c]

= true ∧ [((v = i) ∧ Q ) ⇒ c] = true ∧ true

[since p(i) holds by Definition 20] [since [Q (i) ≡ c] holds by Definition 20]

= true Next, we prove that a representative test case t covers the fault in the mutant t D m . From the definition of the test equivalence class, we see that we have two cases for t . Case 1: t = T (i, c) and (¬p m )(i): t D m = ¬[(v = i) ⇒ p m ]

∨ ¬[((v = i) ∧ Q m ) ⇒ c] [by definition of the representative test case]

= ∃ v • ((v = i) ∧ ¬p m ) = true

∨

∨

∃ v , v • ((v = i) ∧ Q m ∧ ¬c)

∃ v , v • ((v = i) ∧ Q m ∧ ¬c) [i is a witness to first disjunct, since p m (i) holds]

= true Case 2: t = T (i, c) and (∃ v • (Q m ∧ ¬Q ))(i): t D m = ¬[(v = i) ⇒ p m ]

∨ ¬[((v = i) ∧ Q m ) ⇒ c] [by definition of the representative test case]

Introduction to Mutation Testing of Programming

= ∃ v • ((v = i) ∧ ¬p m )

∨

∃ v , v • ((v = i) ∧ Q m ∧ ¬c)

= ∃ v • ((v = i) ∧ ¬p m )

∨

∃ v , v • ((v = i) ∧ Q m ∧ ¬Q (i))

= ∃ v • ((v = i) ∧ ¬p m )

∨

true

261

[i is a witness to second disjunct]

= true The above proof gives an example of how our theory supports formal reasoning about a central property of mutation testing. That this is not an academic exercise, but an essential step towards the building of trustworthy tools, is shown in Section 6. However, before discussing tool building, we extend our mutation testing techniques to the realm of executable designs, that is, programs.

5 Testing for Program Faults So far our discussion on testing has focused on the semantic model of designs. In this section we turn from semantics to syntax. The motivation is to restrict ourselves to a subclass of designs that are expressible, or at least implementable, in a certain programming language. Thus, we define a program as a predicate expressed in the limited notations (syntax) of a programming language. From the predicate semantics of the programming language operators, algebraic laws can be derived (see [173]). In the following, we will use this algebra of programs as a means to reason about faults in a program on a purely syntactical basis. The result is a test-case generation algorithm for fault-based testing that works solely on the syntax of a programming language. We define the syntax as follows. program ::= true | variable list := expression list | program Boolean Expression program | program ; program | program program | recursive identifier | μ recursive identifier • program The semantics of the operators follows the definitions in Section 2.3. The recursive statement using the least fix-point operator μ will be discussed separately in Section 5.4. 5.1 Finite Normal Form Algebraic laws, expressing familiar properties of the operators in the language, can be used to reduce every expression in the restricted notation to an even more restricted notation, called a normal form. Normal forms play an essential role in an algebra of programs: they can be used to compare two programs, as well as to study properties of existing semantics given by equations. Our idea is to use a normal form to decide if two programs, the original one and the faulty one (also called the mutant) can be distinguished by a test case. When the normal

262

B.K. Aichernig

forms of both are equivalent, then the error does not lead to an (observable) fault. This solves the problem of equivalent mutants in mutation testing. Furthermore, the normal form will be used to derive test equivalence classes on a purely algebraic (syntactic) basis. Our normal form has been designed for this purpose: in contrast to the normal form in [173], we push the conditions outwards. The following assignment normal form is taken from [173], though. Definition 21 (Assignment Normal Form). The normal form for assignments is the total assignment, in which all the variables of the program appear on the left-hand side in some standard order. x , y, . . . , z := e, f , . . . , g The assignments v := g or v := h(v ) will be used to express the total assignment; thus the vector variable v is the list of all variables and g and h denote lists of expressions. A non-total assignment can be transformed to a total assignment by (1) addition of identity assignments (a, . . . := a, . . . ) and (2) reordering of the variables with their associated expressions. The law that eliminates sequential composition between normal forms is (v := g; v := h(v )) = (v := h(g)) (L1) where h(g) is calculated by substituting the expressions in g for the corresponding variables in v (see [173]). Since our language includes nondeterminism, we translate conditionals to nondeterministic choices of guarded commands. Theorem 13 (Conditional Elimination). (P c Q ) = (c ∧ P) (¬c ∧ Q ) With this elimination rule at hand we are able to define a nondeterministic normal form. Definition 22 (Nondeterministic Normal Form). A nondeterministic normal form is defined to be a nondeterministic choice of guarded total assignments. (g1 ∧ v := f ) (g2 ∧ v := g) . . . (gn ∧ v := h) Let A be a set of guarded total assignments, then we write the normal form as

A.

The previous assignment normal form can be easily expressed in this new normal form as disjunction over the unit set v := g = {(true ∧ v := g)} The easiest operators to eliminate is disjunction itself (see [173]). ( A) ( B) = (A ∪ B)}

(L2)

Introduction to Mutation Testing of Programming

263

The conditional is eliminated by splitting each guarded assignment into two cases. ( A) d ( B) = ( {((d ∧ b) ∧ P) | (b ∧ P) ∈ A}) (L3) ( {((¬d ∧ c) ∧ Q ) | (c ∧ Q ) ∈ B}) Sequential composition is reduced by ( A); ( B) = {(b ∧ (P; c)) ∧ P; Q | (b ∧ P) ∈ A ∧ (c ∧ Q ) ∈ B} (L4) Here, all nondeterministic combinations of sequential composition are formed. The following lemma shows that our non-deterministic normal form is a design, given the fact that total assignments are designs. Lemma 1. (gi ∧ (pi Qi )) = ( (gi ⇒ pi )) ( (gi ∧ Qi )) , i

i

provided

i

gi = true

i

The program constant true is not an assignment and cannot in general be expressed as a finite disjunction of guarded assignments. Its introduction into the language requires a new normal form. Definition 23 (Nontermination Normal Form). A Nontermination Normal Form is a program represented as a disjunction b ∨ P where b is a condition for non-termination and P a nondeterministic normal form. Any previous normal form P that terminates can be expressed as false ∨ P and the constant true as true ∨ v := v The other operators between the new normal forms can be eliminated by the following laws (b ∨ P) (c ∨ Q ) = (b ∨ c) ∨ (P Q ) (b ∨ P) d (c ∨ Q ) = ((b ∧ d ) ∨ (c ∧ ¬d )) ∨ (P d Q ) (b ∨ P); (c ∨ Q ) = (b ∨ (P; c)) ∨ (P; Q )

(L5) (L6) (L7)

Laws L5 and L7 are taken from [173]. The occurrences of each operator on the right-hand side can be further reduced by the laws of the previous sections. Again for reducing (P; c) an additional law is needed; this time for the previous nondeterministic normal form. ( A); c = {(g ∧ (P; c)) | (g ∧ P) ∈ A} (L8)

264

B.K. Aichernig

The algebraic laws above allow any non-recursive program in our language to be reduced to a finite normal form: b ∨ {(gi ∧ v := ei ) | 1 i n} i

The following lemmas show that our non-termination normal form is a design, given the fact that the non-deterministic normal form is a design. Lemma 2. b ∨ (p Q ) = (¬b ∧ p) Q

Lemma 3. If i gi = true, then {(gi ∧v := ei ) | 1 i n} = (¬b∧ (gi ⇒ wf (ei ))) ( (gi ∧v = ei )) b∨ i

i

i

Next, it is shown how this normal form facilitates the generation of fault-detecting test cases. The technique is to introduce faults into the normal form and then search for test cases that are able to detect these faults. 5.2 Introducing Faults In the discussion so far, we have always assumed that faults are observable, that is, D D m . However, a well-known practical problem is the introduction of such that do not lead to refinement. In mutation testing of programs this is called the problem of equivalent mutants. In our framework, the problem of equivalent mutants can be simplified by reducing any non-recursive program into our finite normal form. More precisely, both the original program and the mutated one (the mutant) are transformed into normal form. Then, refinement can be checked by the following laws. For assignments that are deterministic, the question of refinement becomes a simple question of equality. Two assignment normal forms are equal if, and only if, all the expressions in the total assignment are equal (see [173]). (v := g) = (v := h)

iff

[g = h]

(L9)

The laws that permit detection of refinement mutants for the nondeterministic normal form are as follows. R ( A) iff ∀ P : P ∈ A • (R P) (L10) ((g1 ∧ P1 ) . . . (gn ∧ Pn )) (b ∧ Q )

(L11)

iff [∃ i • ((gi ∧ Pi ) ⇐ (b ∧ Q ))] [(g ∧ v := f ) ⇐ (b ∧ v := h)] iff [b ⇒ (g ∧ (f = h))]

(L12)

The first law (see [173]) enables a nondeterministic normal form to be split into its component guarded assignments, which are then checked individually by the second law. We note that L12 is not decidable, in general. However, a combination of symbolic simplifiers and constraint solvers may deal with this reduced problem in practice.

Introduction to Mutation Testing of Programming

265

Example 4. Consider the following example of a program Min for computing the minimum of two numbers. Min =df z := x x y z := y In mutation testing, an assumption is made that programmers make small mistakes. A common mistake is to mix operators. The mutant Min1 models such a mistake. Min1 =df z := x x y z := y By means of the normal form it is now possible to show that this mutation represents a fault. Thus, we have to prove that Min Min1 Proof. In the following derivations, we will skip trivial simplification steps. Min = x , y, z := x , y, x x y x , y, z := x , y, y [adding identity assignments] = (x y) ∧ x , y, z := x , y, x ) ¬(x y) ∧ x , y, z := x , y, y) [by L3] Next, we reduce Min1 to normal form Min1 = x , y, z := x , y, x x y x , y, z := x , y, y [adding identity assignments] = ((x y) ∧ x , y, z := x , y, x ) (¬(x y) ∧ x , y, z := x , y, y) [by L3] Assume Min Min1 then according to L10 we must show that the two refinements hold ((x y) ∧ x , y, z := x , y, x ) ¬(x y) ∧ x , y, z := x , y, y) (x y) ∧ x , y, z := x , y, x

(Case 1)

((x y) ∧ x , y, z := x , y, x ) ¬(x y) ∧ x , y, z := x , y, y) ¬(x y) ∧ x , y, z := x , y, y

(Case 2)

We start checking the cases with law L11 and L12. Case 1 iff [((x y ∧ (x , y, x ) := (x , y, x )) ⇐ (x y ∧ (x , y, x ) := (x , y, x )))

[by L11]

∨ ((¬(x y) ∧ (x , y, x ) := (x , y, y)) ⇐ (x y ∧ (x , y, x ) := (x , y, y)))] = [(x y ⇒ (x y ∧ true)) ∨ (x y ⇒ (x > y ∧ x = y))] = [(x y ⇒ x y) ∨ (x y ⇒ false)] = [x y ∨ x < y] = false

[by L12]

266

B.K. Aichernig

It follows that refinement does not hold and that the mutation introduces an observable fault. The next example demonstrates the detection of an equivalent mutant. Example 5. Consider again the program Min for computing the minimum of two numbers of Example 4. Another mutation regarding the comparison operator is Min2 =df z := x x < y z := y By means of normal form reduction it is now possible to show that this mutation does not represent a fault. Thus, we show that Min Min2 Proof. Since the normal form of Min has already been computed, we start with normalising Min2 . Min2 = x , y, z := x , y, x x < y x , y, z := x , y, y

[adding identity assignments]

= ((x < y) ∧ x , y, z := x , y, x ) (¬(x < y) ∧ x , y, z := x , y, y)

[byL3]

Again, two refinements must hold according to L10. ((x y) ∧ x , y, z := x , y, x ) ¬(x y) ∧ x , y, z := x , y, y)

(Case 1)

(x < y) ∧ x , y, z := x , y, x ((x y) ∧ x , y, z := x , y, x ) ¬(x y) ∧ x , y, z := x , y, y)

(Case 2)

¬(x < y) ∧ x , y, z := x , y, y We check the cases. Case 1 iff [((x y ∧ (x , y, x ) := (x , y, x )) ⇐ ((x < y) ∧ x , y, z := x , y, x )) ∨

[by L11]

((¬(x y) ∧ (x , y, x ) := (x , y, y)) ⇐ ((x < y) ∧ x , y, z := x , y, x ))] = [x < y ⇒ x y ∨ x < y ⇒ (x > y ∧ x = y)] = [x y ∨ x y ∨ x y ∨ false] = [true ∨ x y ∨ false] = true

[by L12]

Introduction to Mutation Testing of Programming

267

Case 2 iff [((x y ∧ (x , y, x ) := (x , y, x )) ⇐ (¬(x < y) ∧ x , y, z := x , y, y)) ∨ ((¬(x y) ∧ (x , y, x ) := (x , y, y)) ⇐ (¬(x < y) ∧ x , y, z := x , y, y))][by L11] = [x y ⇒ (x y ∧ x = y) ∨ x y ⇒ x > y]

[by L12]

= [x < y ∨ x = y ∨ x < y ∨ x > y] = [x y ∨ x > y] = true Since, both cases are true, we have refinement and the mistake made, represented by the mutation, cannot be detected. Such, mutations must be excluded from the faultdetecting test-case generation process. These examples demonstrate how normal forms can be used to exclude equivalent mutants from the test-case generation process. In the following, we are going to extend the laws to cover non-termination as well. For the non-termination normal form the laws for testing the refinement are below. (c ∨ Q) (b ∨ P) iff [b ⇒ c] and (c ∨ Q ) P (c ∨ (g1 ∧ P1 ) . . . (gn ∧ Pn )) (b ∧ Q ) iff

(L13) (L14)

[c ∨ (∃ i • (gi ∧ Pi ) ⇐ (b ∧ Q ))]

Law L13 is taken from [173]; L14 follows directly from L11. Again an example serves to illustrate the rules for non-termination. Example 6. Let us again consider the simple problem of returning the minimum of two numbers. If both inputs are natural numbers, the following program computes the minimum of x , y in x . MinNat =df (x < 0 ∨ y < 0) ∨ (x := x (x − y) < 0 x := y) First, an equivalent mutant is produced that can be detected by a derivation of the normal form. MinNat1 =df (x < 0 ∨ y < 0) ∨ (x := x (x − y) < 1 x := y)

268

B.K. Aichernig

Proof. First, both normal forms are derived. MinNat = (x < 0 ∨ y < 0) ∨ ((x , y := x , y) (x − y) < 0 (x , y := y, y)) = (x < 0 ∨ y < 0) ∨ (((x − y) < 0 ∧ x , y := x , y) (¬((x − y) < 0) ∧ x , y := y, y)) MinNat1 = (x < 0 ∨ y < 0) ∨ (((x − y) < 1 ∧ x , y := x , y) (¬((x − y) < 1) ∧ x , y := y, y)) Since, both have the same non-termination condition, we have to check according to law L13 that MinNat ((x − y) < 1 ∧ x , y := x , y) (¬((x − y) < 1) ∧ x , y := y, y) According to law L10 we have to show two refinements. MinNat ((x − y) < 1 ∧ x , y := x , y) MinNat (¬((x − y) < 1) ∧ x , y := y, y)

(Case 1) (Case 2)

We verify the cases. Case 1 iff [(x < 0 ∨ y < 0)

[by L14]

∨ (((x − y) < 0 ∧ x , y := x , y) ⇐ ((x − y) < 1 ∧ x , y := x , y)) ∨ ((¬((x − y) < 0) ∧ x , y := y, y) ⇐ ((x − y) < 1 ∧ x , y := x , y))] = [(x < 0 ∨ y < 0) ∨ ((x − y) < 1) ⇒ (x − y < 0 ∧ true) ∨ ((x − y) < 1) ⇒ (x − y 0 ∧ x = y)]

[by L12]

= [(x < 0 ∨ y < 0) ∨ ((x − y) 1) ∨ (x − y < 0) ∨ ((x − y) 1) ∨ x = y]

[by L12]

= [(x < 0 ∨ y < 0) ∨ x > y ∨ x < y ∨ x = y] = true The fact that Case 2 holds can be shown by a similar derivation.

Introduction to Mutation Testing of Programming

269

It has been shown that the presented refinement laws can be used to automatically detect equivalent mutants for non-recursive programs. Next, test-case generation is discussed. 5.3 Test-Case Generation The presented normal form has been developed to facilitate the automatic generation of test cases that are able to detect anticipated faults. Above, it has been demonstrated that algebraic refinement laws solve the problem of equivalent mutants that have an alternation not representing a fault. The above laws also build the foundation of our test-case generation process. The following theorem defines the test equivalence class that will detect an error. Theorem 14. Let P = (p Q ) be a program and P m = (p m Q m ) a faulty mutation of this program with normal forms as follows. {(aj ∧ v := fj ) | 1 j m} P =c ∨ j

P

m

=c

m

∨

{(bk ∧ v := hk ) | 1 k n}

k

We assume that the non-termination conditions c and c m represent the preconditions as follows: p = ¬c and p m = ¬c m . Furthermore, Q ≡ (p ⇒ Q ). Then, every representative test case of the test equivalence class with d = (¬c ∧ c m ) ∨ (¬c ∧ bk ∧ (¬aj ∨ (fj = hk ))) T∼ =df d⊥ ; P , k

j

is able to detect the fault in D m . In order to detect an error, the domains of the test equivalence classes must contain those input values where refinement does not hold. We have two cases of non-refinement: (1) P m does not terminate but P does; (2) both are terminating but with different results. 1. Test cases have to be added where the mutant does not terminate, but the original program does. That is when (¬c ∧ c m ) holds. 2. In the terminating case, by the Laws L10 and L11, it follows that all combinations of guarded commands must be tested regarding refinement of the original one by the mutated one. Those for which this refinement test fails contribute to the test equivalence class. Law L12 tells us that refinement between two guarded commands holds if, and only if, [bk ⇒ (aj ∧ (fj = hk ))]. Negating this gives ∃ v , v • bk ∧ (¬aj ∨ (fj = hk )). Since we are only interested in test cases where the output is defined, we add the constraint ¬c. We see that this condition is at the heart of our test domain. Since we have to show non-refinement, this must hold for ). Finally, each nondeterministic choice all the nondeterministic choices of P ( j of P m may contribute to non-refinement ( k ). In case of true nondeterminism, that is, some guards are true, detection of the errors can only happen if the faulty part is chosen to be executed. Since, by definition of

270

B.K. Aichernig

nondeterminism, a tester has no means to influence this decision, it may go undetected for a while. However, under the assumption of a fair selection policy, the fault will eventually be detected. Thus, when we say a test case (or its equivalence class) will detect an error, we really mean it is able to do so over a period of time. Example 7. Consider the program and its mutant in Example 4. According to Theorem 14 we have the following fault-detecting domain. d = (¬false ∧ false) ∨

k ∈{1,2} (¬false

∧ bk ∧

j ∈{1,2} (¬aj

∨ (fj = hk )))

= (x y ∧ (x > y ∨ false) ∧ (x y ∨ x = y)) ∨ (x < y ∧ (x > y ∨ x = y) ∧ (x y ∨ false)) = (x y ∧ x > y ∧ x = y) ∨ (x < y ∧ x = y ∧ (x y) =x >y ∨x
μ X • F (X ) = F n (true) n 0

where F (X ) =df true, and F

n+1

(X ) =df F (F n (X ))

Operators that distribute through least upper bounds of descending chains are called continuous. Fortunately, all operators in our language are continuous and, therefore, this normal form transformation can be applied. Unfortunately, this infinite normal form can never be computed in its entirety; however, for each n, the finite normal form can be readily computed. The normal form for our full programming language is, thus, defined as follows. Definition 24 (Infinite Normal Form). An infinite normal form for recursive programs is a program theoretically represented as the least upper bound of descending chains of finite normal forms. Formally, it is of form

S with S = (cn ∨ Qn ) | n ∈ N S being a descending chain of approximations and Q being a nondeterministic normal form, that is, a disjunction of guarded commands.

Introduction to Mutation Testing of Programming

271

For test-case generation, again, refinement between the original and the mutant must be checked. Fortunately, the following law from [173] tells us that we can decompose the problem.

( S ) ( T ) iff ∀ i : i ∈ N • Si ( T ) (L15) The central idea to deal with recursive programs in our test-case generation approach is to approximate the normal form of both the program and the mutant until nonrefinement can be detected. For equivalent mutants an upper limit n will determine when to stop the computations. Such a decision represents a test hypothesis (that is, a regularity hypothesis according to [137]), where the tester assumes, that if n iterations did not reveal a fault, an equivalent mutant has been produced. An example illustrates the approximation. Example 8. Assume that we want to find an index t pointing to the smallest element in an array A[1..n], where n is the length of the array and n > 0. A program for finding such a minimum can be expressed in our programming language as follows. MIN =df k := 2; t := 1; μ X • ((B; X ) k n k , t := k , t ) B =df (t := k ; k := k + 1) A[k ] < A[t ] k := k + 1 Since, the normal form of μ X • F (X ) is infinite and has to be approximated, we first convert F (X ) into a (finite) normal form. F (X ) = ((k n ∧ A[k ] < A[t ]) ∧ (k , t := k + 1, k ; X )) ((k n ∧ A[k ] A[t ]) ∧ (k , t := k + 1, t ; X )) ((k > n) ∧ k , t := k , t ) Next, the first elements in the approximation chain are computed. According to Kleene’s theorem we have S 1 =df F (true) = (k n) ∨ ((k > n) ∧ k , t := k , t ) The first approximation describes the exact behavior only if the iteration is not entered. The second approximation describes the behavior already more appropriately, taking one iteration into account. Note how the non-termination condition gets stronger. S 2 =df F (S 1 ) = (k + 1 n ∧ A[k ] < A[t ]) ∨ (((k n ∧ k + 1 > n ∧ A[k ] < A[t ]) ∧ (k , t := k + 1, k )) (k + 1 n ∧ A[k ] A[t ]) ∨ ((k n ∧ k + 1 > n ∧ A[k ] A[t ]) ∧ (k , t := k + 1, t )) (false) ∨ ((k > n) ∧ k , t := k , t )

272

B.K. Aichernig

= (k < n) ∨ (((k = n ∧ A[k ] < A[t ]) ∧ (k , t := k + 1, k )) ((k = n ∧ A[k ] A[t ]) ∧ (k , t := k + 1, t )) ((k > n) ∧ k , t := k , t )) The third approximation describes MIN up to two iterations, leading to more choices. S 3 =df F (S 2 ) = (k + 1 < n) ∨ (((k + 1 = n ∧ A[k ] < A[t ] ∧ A[k + 1] < A[t ]) ∧ (k , t := k + 2, k + 1)) ((k + 1 = n ∧ A[k ] < A[t ] ∧ A[k + 1] A[t ]) ∧ (k , t := k + 2, k )) ((k + 1 = n ∧ A[k ] A[t ] ∧ A[k + 1] < A[t ]) ∧ (k , t := k + 2, k + 1)) ((k + 1 = n ∧ A[k ] A[t ] ∧ A[k + 1] A[t ]) ∧ (k , t := k + 2, t )) ((k = n ∧ A[k ] < A[t ]) ∧ (k , t := k + 1, k )) ((k = n ∧ A[k ] A[t ]) ∧ (k , t := k + 1, t )) ((k > n) ∧ k , t := k , t ))) It can be seen from the first three approximations that our normal form approximations represent computation paths as guarded commands. As the approximation progresses, more and more paths are included. Obviously, the normal form approximations of the whole program, including the initialisations of k and t , can be easily obtained by substituting 2 for k and 1 for t in S1 , S2 , . . . . Next, we illustrate our fault-based testing technique, which first introduces a mutation, and then tries to approximate the mutant until refinement does not hold. A common mistake is to get the loop termination condition wrong. We can model this by the following mutant: MIN1 =df k := 2; t := 1; μ X • ((B; X ) k < n k , t := k , t ) Its first approximation gives S11 =df F (true) = (k < n) ∨ ((k n) ∧ k , t := k , t )

Introduction to Mutation Testing of Programming

273

By applying Theorem 14 to find test cases that can distinguish the two first approximations, we realize that such a test case does not exist, because S 1 S11 . The calculation of the test equivalence class domain predicate d 1 gives false. d1 = (¬(k n) ∧ k < n) ∨ (¬(k n) ∧ k n ∧ (¬(k > n) ∨ false)) [by Theorem 14] = false ∨ false = false It is necessary to consider the second approximation of the mutant. S12 =df F (S1 ) = (k + 1 < n) ∨ (((k + 1 = n ∧ A[k ] < A[t ]) ∧ (k , t := k + 1, k )) ((k + 1 = n ∧ A[k ] A[t ]) ∧ (k , t := k + 1, t )) ((k n) ∧ k , t := k , t )) This time test cases exist. By applying Theorem 14 we get the test equivalence class that can find the error. d (k , t ) = (¬(k n) ∧ k < n) ∨ (k n ∧ k + 1 = n ∧ A[k ] < A[t ] ∧ . . . ∨ (k n ∧ k + 1 = n ∧ A[k ] A[t ] ∧ . . . ∨ (k n ∧ k n ∧ (¬(k = n ∧ A[k ] < A[t ]) ∨ true) ∧ (¬(k = n ∧ A[k ] A[t ]) ∨ true) ∧ (¬(k > n) ∨ false))

[by Theorem 14]

= false ∨ (k n ∧ k n) = (k = n) By substituting the initialisation values (k = 2 and t = 1) the concrete fault-detecting test equivalence class is: T∼ 1 = (n = 2)⊥ ; MIN

274

B.K. Aichernig

The result is somehow surprising. The calculated test equivalence class says that every array with two elements can serve as a test case to detect the error. One might have expected that the error of leaving the loop too early could only be revealed if the minimum is the last element (A[2] < A[1]) resulting in different values for t (2 and 1). However, this condition disappears during the calculation. The reason is that the counter variable k is observable and that the two program versions can be distinguished by their different values for k (3 and 2). In practice, k will often be a local variable and not part of the alphabet of the program. In such a case a stronger test equivalence class will be obtained. This illustrates the fact that it is important to fix the alphabet (the observables), before test cases are designed. We observe also that the test equivalence class T∼ 1 is just an approximation of the complete test equivalence class. More precisely, it has to be an approximation, since the complete test equivalence class is infinite. Next we investigate a mistake where the programmer forgets to increase the index variable k . MIN2 =df k := 2; t := 1; μ X • (( B2 ; X ) k n k , t := k , t ) B2 =df (t := k ; k := k + 1) A[k ] < A[t ] k := k Obviously, S 1 = S21 since the mistake has been made inside the loop. Therefore, immediately the second approximation of the mutant S22 is presented. S22 =df (k + 1 n ∧ A[k ] < A[t ]) ∨ (((k n ∧ k + 1 > n ∧ A[k ] < A[t ]) ∧ (k , t := k + 1, k )) (k n ∧ A[k ] A[t ]) ∨ ((k n ∧ k > n ∧ A[k ] A[t ]) ∧ (k , t := k , t )) (false) ∨ ((k > n) ∧ k , t := k , t ) We see that the second case becomes infeasible (the guard equals false), and that consequently the non-termination condition is weakened. S22 = (k < n ∨ (k = n ∧ A[k ] A[t ])) ∨ (((k = n ∧ A[k ] < A[t ]) ∧ (k , t := k + 1, k )) ((k > n) ∧ k , t := k , t )) Clearly, a weaker non-termination condition leads to non-refinement. Therefore, Theorem 14 gives us for this case the test equivalence class representing the cases where MIN terminates and MIN2 does not.

Introduction to Mutation Testing of Programming

275

T∼ 2 (k , t ) = (k = n ∧ A[k ] A[t ])⊥ ; MIN ) T∼ 2 = (n = 2 ∧ A[2] A[1])⊥ ; MIN ) The calculated test cases are indeed those where MIN2 fails to terminate due to the missing increment of k . The example demonstrated how to calculate test cases for detecting faulty designs even when recursion is present. However, in cases where refinement cannot be falsified, we have to stop the approximation process at a certain point. An upper limit n must be chosen by the tester, to determine how many approximation steps should be computed. Exercise 11. Prove Lemma 2. Exercise 12. Show via normal forms that the following mutant of a minimum computation represents a faulty design: Min3 =df z := x x y z := x Exercise 13. Show via normal forms that MinNat Min. Exercise 14. – Translate the triangle specification and the mutant of Example 1 into normal forms. – Show in the normal form that this mutation represents a fault. Exercise 15. Calculate from the normal forms of Min and Min3 the fault-detecting equivalence class. Min3 =df z := x x y z := x Exercise 16. Calculate from the normal forms of MinNat and MinNat2 the faultdetecting equivalence class. MinNat2 =df (x < 0 ∨ y < 0) ∨ (x := x (x − y) > 0 x := y) Exercise 17. Calculate a fault-detecting equivalence class by step-wise fix-point approximation for the following mutatant: MIN =df k := 2; t := 1; μ X • ((B; X ) k n k , t := k , t ) B =df (t := k ; k := k + 1) A[ t ] < A[t ] k := k + 1

6 Beyond Theory Nothing is as practical as a good theory. Hence, based on the presented theory we have developed some prototypical fault-based test-case generators. In this section, we discuss several possible implementation techniques turning our theory into practice. Furthermore, we discuss application domains. The first two subsections discuss applications to pre-postcondition contracts stated in OCL and Spec#. Both notations can be

276

B.K. Aichernig

interpreted as designs and solvers can be used to find the discriminating test cases for a set of mutants. In the case of OCL, a special-purpose constraint solver has been developed; in the case of Spec#, an existing SMT-solver can be exploited. Next, for executable models or programs, we show how symbolic execution techniques can support the construction of our program normal form. This establishes a novel link between symbolic execution and mutation testing. Finally, we demonstrate that our theory is not limited to notions of UTP-refinement and sequential programs. Extended with an appropriate conformance relation, our technique has been applied to the mutation testing of concurrent protocol models. 6.1 Mutation Testing of OCL Contracts The first prototype tool we developed is a test-case generator for the Object Constraint Language OCL [8]. Here, the user either introduces faults interactively via a GUI or uses a set of standard mutation operators to generate mutant specifications automatically. The tool generates one test case out of the test equivalence class that will detect the error. The theoretical foundation of the tool is Theorem 12. The automation exploits the fact that we are interested in non-refinement. Thus, instead of showing refinement where we need to demonstrate that the implication holds for all possible observations, here the existence of one (counter)example is sufficient. Hence, the problem of finding a test case can be represented as a constraint satisfaction problem (CSP). A CSP consists of a finite set of variables and a set of constraints. Each variable is associated with a set of possible values, known as its domain. A constraint is a relation defined on some subset of these variables and denotes valid combinations of their values. A solution to a constraint satisfaction problem is an assignment of a value to each variable from its domain, such that all the constraints are satisfied. Formally, the conjunction of these constraints forms a predicate for which a solution should be found. We have developed such a constraint solver that searches for an input solution satisfying the domain of the fault-detecting test equivalence class. Here the CSP variables are the observation variables of an OCL specification. The constraints are obtained by applying Theorem 12 to the original and mutated specifications. If an input able to kill the mutant has been found, then the complete test case is produced by generating the expected (set of) output values. We note that constraint solving operates on finite domains. Hence, in case the tool cannot find a test case it is unknown if the mutant refines the original or if a fault outside the search space exists. We say that the mutant refines the original specification in the context of the finite variable domains. In order to compare our fault-based testing approach to more conventional techniques, the tool is also able to generate test cases using DNF partitioning. In this classical testing strategy, first, the disjunctive normal form (DNF) of a formal specification is generated, and then one representative test case from each disjunct is selected [104]. The tool is able to generate test cases for the triangle example (Example 1).

Introduction to Mutation Testing of Programming

277

context Ttype(a: int, b: int, c: int): String pre: a >= 1 and b >= 1 and c >= 1 and a < (b+c) and b < (a+c) and c < (a+b) post: if((a = b) or (a = c) or (b = c)) then result = "isosceles" else if ((a = b) and (b = c)) then result = "equilateral" else result = "scalene" endif endif Fig. 3. A triangle with mutated if-statements context Ttype(a: int,b: int,c: int): String pre: a >= 1 and b >= 1 and c >= 1 and a < (b+c) and b < (a+c) and c < (a+b) post: if((a = b) and (b = 1)) then result = "equilateral" else if((a = b) or (a = c) or (b = c)) then result = "isosceles" else result = "scalene" endif endif Fig. 4. Mutant replacing a variable with a constant

Example 9. The specification of Figure 1 can be mutated in several ways. In addition to variable name mutations as shown in Figure 2, a designer might get the order of the nested if-statements wrong. This is modelled in Figure 3. The two test cases generated by the tool are a = 1, b = 2, c = 2, result = “isosceles for the mutant in Figure 2 and a = 1, b = 1, c = 1, result = “equilateral for the mutant in Figure 3. We can easily see that each test case is able to distinguish its mutant from the original, since the mutants would produce different results. Hence, these test cases are sufficient to detect such faults in any implementation of Ttype. Alternatively, by choosing the DNF partitioning strategy the tool returns five test cases, one for each partition. We observe that the tool partitions the isosceles case into three cases. a = 2, b = 2, c = 1, result = isosceles a = 2, b = 1, c = 2, result = isosceles a = 1, b = 2, c = 2, result = isosceles a = 2, b = 3, c = 4, result = scalene a = 1, b = 1, c = 1, result = equilateral

278

B.K. Aichernig context Ttype(a: int,b: int,c: int): String pre: a >= 1 and b >= 1 and c >= 1 and a < (b+c) and b < (a+c) and c < (a+b) post: if((a = b) and (b = c)) then result = "equilateral" else if((a = b) or (a = c) or (b = 2)) then result = "isosceles" else result = "scalene" endif endif Fig. 5. Mutant with a fault in the second conditional

Analysing these test cases generated by the DNF partitioning strategy, we observe that the five test cases are also able to detect the faults presented in Figures 2 and 3. Therefore, one could argue that the fault-based test cases do not add further value. However, in general DNF partitioning may not detect all possible faults. We consider the two additional mutated specifications shown in Figures 4 and 5. We can easily see that the five DNF test cases are not able to reveal these faults, but the fault-based strategy generates precisely the following test cases that are needed to reveal the faults in Figures 4 and 5. a = 2, b = 2, c = 2, result = “equilateral covers the mutant in Figure 4, and a = 3, b = 2, c = 4, result = “scalene covers the mutant in Figure 5. It is also possible to integrate the DNF approach and ask the tool to generate all fault-based test cases for every domain partition. Then, the additional test case a = 1, b = 3, c = 3, result = “isosceles for the mutant in Figure 5 is returned as well. This example, although trivial, demonstrates the automation of our approach to software testing: instead of focusing on covering the structure of a specification, which might be rather different to the structure of the implementation, one focuses on possible faults. Of course, the kind of faults one is able to model depend on the level of abstraction of the specification — obviously one can only test for faults that can be anticipated. It should be added that the test case generator also helps in understanding the specification. Experimenting with different mutations and generating fault-detecting test cases for them is a valuable vehicle for validation. For further details of the tool’s search algorithm we refer to [8]. 6.2 Mutation Testing of Spec# Contracts More recently, we have applied Theorem 12 to Spec# contracts [201]. Spec# is the specification language for C#, providing an annotation language for specifying pre-,

Introduction to Mutation Testing of Programming

279

postconditions and invariants [23]. In our work, we mutate the Spec# pre- and postconditions and generate test cases for testing C# programs. We base the test-case generation on the Spec# verifier, that is able to generate counterexamples, for example in case a postcondition is violated by a C# method. This verification tool is composed of two components: (1) Boogie for generating verification conditions [22] and (2) Z3, an SMT-solver for finding counterexamples [253]. In order to use the existing Spec# verifier for refinement checking between two contracts, we delegate the implementation of the first contract by calling another method specified by a second contract. Then, Boogie generates a verification condition for checking if the second contract is a refinement of the first. Combining this idea with our mutation strategy, we use the following procedure for generating mutation tests in Spec#. We consider a correct (original, not mutated) contract C ok , and a mutated version C m . We assume that C ok passes the static verifier without any error. Also, we use Cpre to refer to the precondition of a contract C , and Cpost to refer to its postcondition. 1. Create a method O with C ok . 2. Create a method M with an empty body. M has initially no contract. 3. Add to M the [Verify(false)] attribute that switches off the verification of M ’s method body. m ok 4. If a postcondition was mutated, add Cpost to M . Otherwise add Cpost . m ok 5. If a precondition was mutated, add Cpre to M . Otherwise add Cpre . 6. Within O place a call to M as the sole element of the method body (Delegation). 7. If O has a non-void return type, assign an arbitrary, but type-correct, return value within the body of M and place a return statement before the call of M in the body of O . 8. Call the static verifier: – If one or more counterexamples can be found, create test cases. – If no counterexample can be found, C ok C m holds and the mutation generates valid behaviour. So no test is needed. Example 10. Listing 8.1 shows the triangle example in Spec# notation. The contract of method TriangleType O represents the original specification. The method is implemented by calling the mutated version, shown in Listing 8.2. The original specification treats triangles with the lengths of sides j = k = l as equilateral. The mutated contract falsely treats triangles with the property j = k = 1 as equilateral. Obviously, the contract of TriangleType O is not refined () by the mutated contract of TriangleType M and. Because we call TriangleType M within the body of TriangleType O, Boogie returns the counterexample j = k = l = 2. Boogie outputs the counterexample in its internal representation form and therefore a further analysis is needed to construct the test case. Billeter’s tool for executing Boogie counterexamples [40] shows that this is feasible. For demonstration, we have also included a commented out, mutated precondition, as can be seen in Line 16. The mutated precondition says that variable l must be smaller than 1. When uncommented, Boogie reports a counterexample (j = k = l = 1) for the precondition. We note that we only mutate either pre- or postcondition – never both at the same time.

280

B.K. Aichernig 1

using Microsoft . C o n t ra c t s ;

2 3 4 5 6

p u b l i c c l a s s Program { p u b l i c enum T ri a n g l e E n u m { S c a l e n e , I s o s c e l e s , E q u i l a t e r a l } ;

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

/ / T h i s method i m p l e m e n t s t h e o r i g i n a l c o n t r a c t . public s t a t i c T ri a n g l e E n u m T r i a n g l e T y p e _ O ( i n t j , i n t k , i n t l ) / / unmodified precondition r e q u i r e s ( j >= 1 ) && ( k >= 1 ) && ( l >= 1 ) && ( j < ( k+ l ) ) && ( k < ( j + l ) ) && ( l < ( j +k ) ) ; / / unmodified pos tcondition ensures ( ( j == k ) && ( k == l ) ) ? ( r e s u l t == T ri a n g l e E n u m . E q u i l a t e r a l ) : ( ( ( j == k ) | | ( j == l ) | | ( k == l ) ) ? ( r e s u l t == T ri a n g l e E n u m . I s o s c e l e s ) : ( r e s u l t == T ri a n g l e E n u m . S c a l e n e ) ) ; { / / c o n s t r a i n t h e r e s u l t so as t o l i e w i t h i n / / i n c o r r e c t " bounds " r e t u r n T ri a n g l e T y p e _ M ( j , k , l ) ;

25

/ ∗ I f we had a r e a l i m p l e m e n t a t i o n h e r e , we c o u l d use a mutated v e rs i o n of the implementation !

26 27 28 29 30 31 32 33 34 35 36 37

∗/

38 39

However , a s we a r e n o t w o rk i n g on model p r o g r a m s t h i s t i me , we i g n o r e t h e i m p l e m e n t a t i o n . T ri a n g l e E n u m r e s ; i f ( ( j == k ) && ( k == 1 ) ) r e s = T ri a n g l e E n u m . E q u i l a t e r a l ; e l s e i f ( ( j == k ) | | ( j == l ) | | ( k == l ) ) r e s = T ri a n g l e E n u m . I s o s c e l e s ; e l s e r e s = T ri a n g l e E n u m . S c a l e n e ; return r e s ;

}

Listing 8.1. The Triangle-example in Spec#

In practice, the number of possible mutations for a given program specification is large. Hence, a tool is needed to generate and manage the mutants. Recently, we have extended Nester (nester.sourceforge.net/ (Last visit: 1.10.2009)), a free tool for mutation testing of C# programs, in order to cover Spec# as well. In our new version of Nester, a GUI allows to choose from a set of Spec#-mutation operators. Then, Nester produces all possible mutations and runs a given test suite against all of them. What we would like to do is to generate these test cases automatically from the Spec# mutations. This link between test-case generation and the extended Nester tool is future work. 6.3 Program Normal Form via Concolic Execution In this section we discuss a possible automation technique for the test-case generation approach induced by Theorem 14. This technique for mutation testing of programs or executable specifications is based on a normal form. Here, we show how such a normal form can be computed by means of a special form of symbolic execution, called concolic execution.

Introduction to Mutation Testing of Programming

281

/ / We u s e t h i s method f o r t h e s o l e p u r p o s e o f / / c o n s t r u c t i n g t h e c o u n t e r e x a m p l e , t h e r e f o r e we / / a r e n o t i n t e r e s t e d i n t h e i m p l e m e n t a t i o n and s a y : [ Verify ( f a ls e ) ] public s t a t i c T ri a n g l e E n u m T ri a n g l e T y p e _ M ( i n t j , i n t k , i n t l ) / / copy o r i g i n a l p r e c o n d i t i o n ( n o t n e c e s s a r y ) requires ( j >= 1 ) && ( k >= 1 ) && ( l >= 1 ) && ( j < ( k+ l ) ) && ( k < ( j + l ) ) && ( l < ( j +k ) ) ;

1 2 3 4 5 6 7 8 9 10 11

/ ∗ I f t h e p r e c o n d i t i o n i s t o be mu t a t e d , t h e n i t h a s t o be i n c l u d e d h e r e i n s t e a d o f t h e o r i g i n a l precondition : requires ( ( j >= 1 ) && ( k >= 1 ) && ( l < 1 ) && ( j < ( k+ l ) ) && ( k < ( j + l ) ) && ( l < ( j +k ) ) ) ; ∗/

12 13 14 15 16 17 18 19

/ / mutate p o s t c o n d i t i o n ensures ( ( j == k ) && ( k == 1 ) ) ? ( r e s u l t == T ri a n g l e E n u m . E q u i l a t e r a l ) : ( ( ( j == k ) | | ( j == l ) | | ( k == l ) ) ? ( r e s u l t == T ri a n g l e E n u m . I s o s c e l e s ) : ( r e s u l t == T ri a n g l e E n u m . S c a l e n e ) ) ;

20 21 22 23 24 25 26

{

27

/ / T h i s body i s i g n o r e d when c h e c k i n g / / TriangleType_O ! r e t u r n T ri a n g l e E n u m . E q u i l a t e r a l ; / / keep c o m p i l e r from c o m p l a i n i n g . .

28 29 30 31

}

32 33

s t a t i c v o i d Main ( s t r i n g ! [ ] ! a r g s ) { }

34 35 36 37

}

Listing 8.2. A Mutant of the Triangle-example in Spec#

We first have a closer look at the normal form and how it is generated. A nondeterministic normal form (see Definition 22) is constructed by pushing the two conditions of each conditional statement outwards. Hence, each guard represents a certain path through the program. For programs without iteration, this normal form is finite since we have a finite number of paths. As we have seen in Section 5.4, for programs with recursion our normal form becomes infinite, due to the infinite number of paths. Hence, in testing often an upper limit on iterations is chosen in order to cover a finite number of paths. Our guards representing a path through the program play a central role in the symbolic execution [193] of programs, where they are called path conditions. Symbolic execution uses symbols to represent arbitrary values during execution. When encountering a conditional statement, the run is forked and the positive and negative conditionals become part of the path conditions, restricting the symbolic value during execution. This results in a symbolic execution tree covering all paths in the program. In contrast, concolic execution (also called dynamic symbolic execution) calculates the symbolic execution in parallel with a concrete run that is actually taken, avoiding the usual problem of eliminating infeasible paths. Decisions on conditional statements are

282

B.K. Aichernig

recorded, resulting in a set of conditions over the symbolic values that have to evaluate to true for the path to be taken. The conjunction of these conditions constitutes the path condition; it represents an equivalence class of concrete input values for executing the same path. Concolic execution starts with an arbitrary input value and calculates the path condition until termination. Then, a new input value satisfying the negated path condition is selected and concolically executed. This continues until no more more input values can be found, because the negated path condition is false, meaning that all paths have been covered. The new input values are found by using SMT-solvers. This is the technique used by the PEX tool discussed in Chapter 5. Recently, we have applied concolic execution to distributed system models [146]. The most relevant observation here is that concolic execution is a technique for enumerating all paths by computing successively all path conditions. This computation is done by a symbolic execution of the assignment statements along the execution. Hence, at the end of one execution, we have not only the path condition calculated, but also the transition relation in form of an assignment, the latter representing the final state of a path in terms of the symbolic input values. Example 11. We consider again the program MIN to find the index of the smallest element in an array (see Example 8). The input to MIN is an array A of arbitrary length n. Hence, a concolic interpreter will choose to execute A both symbolically and concretely. We use as to denote the initial symbolic value of A and ns its length. In our program A is not updated during execution, so its symbolic value is used in path conditions only and the final state will be concrete. A concolic interpreter first chooses an arbitrary but valid concrete value as input, that is, an arbitrary non-empty array. The simplest valid array to be chosen is an array of length one, for example, A = [0]. Starting the execution with this array leads to the first conditional test 2 1 evaluating to false. This decision is recorded as path condition ¬(2 ns ). Not entering the recursive branch, the first run R1 terminates and the concolic execution returns the following path condition plus final state. R1 = ¬(2 ns ) ∧ k , t := 2, 1 This run corresponds to the first fixpoint approximation S1 after initializing k and t . Next, a new initial value for A is sought satisfying the negated path condition 2 ns . A solver typically returns the next simplest array of length two, for example, A = [0, 0]. The second run R2 searches the new array and takes the other branch of the conditional adding the conjunct (2 ns ) to the path condition. When entering the body B, the second condition As [k ] < As [t ] = As [2] < As [1] does not hold. This is recorded by taking the negated condition as an additional conjunct of the path condition, yielding (2 ns ) ∧ ¬(As [2] < As [1]). Then, k is increased to 3; since 3 2 fails, the run terminates with the last decision added to the path condition. R2 = (2 ns ) ∧ ¬(As [2] < As [1]) ∧ ¬(3 ns ) ∧ k , t := 3, 1 = (2 = ns ) ∧ ¬(As [2] < As [1]) ∧ k , t := 3, 1

Introduction to Mutation Testing of Programming

283

The run R2 corresponds to the second case in the fixpoint approximation S2 . For the third concolic run, the array must not satisfy the path conditions executed so far. Hence an array satisfying (2 ns ) ∧ ¬(2 = ns ∧ ¬(As [2] < As [1])) = (2 ns ) ∧ (2 = ns ∨ As [2] < As [1]) = 2 < ns ∨ (2 ns ∧ As [2] < As [1]) must be found. The disjunction in the last simplified condition indicates that there are three cases to explore in finding the next array, satisfying either the left, the right, or both subconditions. Which condition is chosen determines the search strategy through the symbolic execution tree. Selecting the right subcondition results, for example, in a new concrete array A = [1, 0], This represents the first test case altering the initial index t . Here, the symbolic execution returns R3 = (2 ns ) ∧ (As [2] < As [1]) ∧ ¬(3 ns ) ∧ k , t := 3, 2 = (2 = ns ) ∧ (As [2] < As [1]) ∧ k , t := 3, 2 This last run R3 corresponds to the first case of the fixpoint approximation S2 . We leave it as an exercise to calculate further test cases. The next test case demands an array with three elements, and so on. It is future work to combine the concolic execution approach with our mutation testing formula. This kind of analysis results in finer grained equivalence classes taking the possible faults along a path into account. 6.4 Mutation Testing of Protocol Specifications UTP’s refinement of designs is not the only correctness relation that can be used for mutation testing. Any precisely defined conformance relation may serve to discriminate a faulty mutant from its original. In recent years we have successfully applied mutation testing to several protocol implementations, including HTTP, Session Initiation Protocol (SIP) and the Conference Protocol [337]. All these protocols have been modeled in the process algebra LOTOS [240]. These models represent the exchange of messages as events and describe the possible sequences of such events in form of processes. Consequently, the conformance (refinement) relation between such models is defined on the event level. The conformance relation of our choice was ioco, the input-output conformance relation of Tretmans [318]. Informally, ioco conformance is given if, for all event traces possible in the specification model, the implementation does not produce output that is not allowed by the specification. Like refinement of designs, it supports incomplete (partial) specification models. This is realized by splitting the events into input and output events, that is, controllable and observable events. Hence, a PUT may react to unspecified input events in an arbitrary way, like programs outside their specification precondition. This is an important feature for industrial application, where it is unrealistic to model the complete behavior of a PUT. In addition, ioco knows a special quiescense event for modeling timeouts.

284

B.K. Aichernig

We have implemented an automatic mutator for LOTOS specifications, altering the model according to a set of predefined mutation operators. Furthermore, we have developed an ioco-checker that returns a counterexample in the cases where the original model and the mutant do not conform. This counterexample serves as test purpose for the test-case generator TGV, producing all possible test cases leading to the error. The test cases are then mapped to and executed at the concrete protocol-level. Our case studies on real protocol implementations demonstrated that this technique is complementary to other testing approaches. In the case of the SIP protocol mutation testing revealed an additional fault that was detected neither by random testing nor by scenario-based testing. For further details we refer to [337]. Recently, we have reformulated ioco in the process theory of the UTP [338]. This together with our practical testing results demonstrates that our UTP approach to testing is not limited to small sequential programs, but extends to the domain of distributed, concurrent systems.

7 Conclusions Summary. This chapter has presented a theory of testing with a focus on fault detection. This fault-based testing theory is a conservative extension of the existing Unifying Theories of Programming [173]. It extends the application domain of Hoare & He’s theory of programming to the discipline of testing. It has been demonstrated that the new theory enables formal reasoning about test cases, more precisely about the fault detecting power of test cases. As a consequence, new test-case generation methods could be developed. The first test-case generation method (see Definition 18) is a general criterion for fault-detecting test cases. It is not completely new, but has been translated from our previous work [4] to the theory of designs. It states that a test case, in order to find a fault in a design (which can range from specifications to programs), must be an abstraction of the original design; and in addition it must not be an abstraction of the faulty design. No such test case exists if the faulty design is a refinement of the original one. We note that the translation of this criterion from a different mathematical framework was straightforward. Since our previous definition was solely based on the algebraic properties of refinement, we just had to change the definition of refinement (from weakest precondition inclusion to implication). In [5] we applied this technique to labelled transition systems for testing web-servers. In [7] we extended the labelled transition system work to an industrial case study. This demonstrates the generality of our refinement-based testing theory. The second test-case generation method (see Theorem 12) is more constructive and specialised for designs. It can be applied to specification languages that use pre- and postconditions, including VDM-SL, RSL, Z, B, OCL and Spec#. Its finding is based on the conditions, when refinement between designs does not hold. It uses the operations on predicates (conditions and relations) to find the test cases. This approach forms the basis for our constraint solving approach to generate test cases from OCL specifications. The third approach (see Theorem 14) lifts the test-case generation process to the syntactical level. By using a normal form representation of a given program (or specification), equivalence classes of test cases can be generated or, in the case of recursive

Introduction to Mutation Testing of Programming

285

programs, approximated. This is the technique that is most likely to scale up to more complex programming and design languages. We have demonstrated the approach by using a small and simple programming language. However, the language is not trivial. It includes nondeterminism and general recursion. A tool that uses this technique will combine constraint solving and symbolic execution. UTP’s aim is simplicity and our work definitely benefitted from its simple predicative semantics. Having implication as the refinement order made the theories simpler and the proofs shorter, than, for instance, using a weakest-precondition semantics. Furthermore, the relational design predicates can be directly fed into a constraint solving system, a BDD checker, or SMT solver. Most importantly, UTP’s links to other programming paradigms, like, for example, parallel programming or object-orientation, keep our testing theory open for extensions. Related Work. Fault-based testing was born in practice when testers started to assess the adequacy of their test cases by first injecting faults into their programs, and then observing if the test cases could detect these faults. This technique of mutating the source code became well-known as mutation testing and goes back to the late 1970s [153,101]; since then it has found many applications and has become the major assessment technique in empirical studies on new test-case selection techniques [355]. To our present knowledge Budd and Gopal were the first who mutated specifications [60]. They applied a set of mutation operators to specifications given in predicate calculus form. Tai and Su [312] proposed algorithms for generating test cases that guarantee the detection of operator errors, but they restrict themselves to the testing of singular Boolean expressions, in which each operand is a simple Boolean variable that cannot occur more than once. Tai [311] extends this work to include the detection of Boolean operator faults, relational operator faults and a type of fault involving arithmetic expressions. However, the functions, represented in the form of singular Boolean expressions, constitute only a small proportion of all Boolean functions. Until a few years ago, most of the research on testing from formal specifications ignored fault-based testing. The current approaches generate test cases according to the structural information of a model in a formal specification language, like for example VDM [104], Z, B [205], or Lotos [137]. Only few noticed the relevance of a fault-based strategy on the specification level. Stocks applied mutation testing to Z specifications [308]. In his work he extends mutation testing to model-based specification languages by defining a collection of mutation operators for Z. An example of his specification mutations is the exchange of the union operator ∪ for sets with intersection ∩. He presented the criteria to generate test cases to discriminate mutants, but did not automate his approach. Later, Simon Burton presented a similar technique as part of his test-case generator for Z specifications [63]. He uses a combination of a theorem prover and a collection of constraint solvers. The theorem prover generates the DNF, simplifies the formulas, and helps formulating different testing strategies. This is in contrast with our implementation of the OCL test-case generator, where Constraint Handling Rules [127] are doing the simplification prior to the search — only a constraint satisfaction framework is needed. Here, it is worth pointing out that it is the use of Constraint Handling Rules that saves us

286

B.K. Aichernig

from having several constraint solvers, like Burton does. As with Stocks’ work, Burton’s conditions for fault-based testing are instantiations of our general theory. Fault-based testing has also been discovered by the security community. Wimmel and Jürjens [347] use mutation testing on specifications to extract those interaction sequences that are most likely to find vulnerabilities. Here, mutants of an Autofocus model are generated. Then, a constraint solver is used to search for a test sequence that satisfies a mutated system model (a predicate over traces) and that does not satisfy a security requirement. If a test case able to kill the mutant can be found, then the mutation introduces a vulnerability and the test case shows how it can be exploited. Again, this approach is a special instantiation of our more general refinement technique supporting our proposal that a general theory of fault-based testing should be based on refinement. Black et al. showed how model checkers (for example, SMV) can be applied in mutation testing [44]. There are basically two methods. The first is similar to our approach: the original and a mutated model are integrated as SMV models and then a temporal formula stipulates that the output variables of the models must always be equivalent. A counterexample in the form of a trace through the state space serves as the test case distinguishing the mutation. In the second method, the temporal formulas themselves are mutated. In [125] we have demonstrated how this technique can be extended and optimised for fast regression testing. The limitation of this model-checking approach is that it only works for deterministic models. In the general case of nondeterminism, tree-shaped test cases are needed as it is the case in tools like TGV [5]. Our general observation regarding testing and model checking is that in most cases the work lacks the support of a precise testing theory; for example, the definition of a conformance relation is missing. A group in York has started to use fault-based techniques for validating their CSP models [304,305]. Their aim is not to generate test cases, but to study the equivalent mutants. Their work demonstrates that semantic issues of complex concurrent models can be detected by understanding why alternated (mutated) models are observationally equivalent. Their reported case study in the security domain indicates the relevance of fault-based testing in this area. Similar research is going on in Brazil with an emphasis on protocol specifications written in the Estelle language [302]. Our testing theory relies on the notion of refinement. Of course, the relation between testing and refinement is not completely new. In [260], de Nicola and Hennessy developed a testing theory that defines the equivalence and refinement of processes in terms of testers. Similarly, the notion of failure-divergences refinement of CSP [172] is inspired by testing, since it is defined via the possible observations of a tester. Later, these theories led to Tretmans’ work on conformance testing based on labelled transition systems [319,320]. They are the foundations of Peleska’s work on testing, as well [274]. However, these theories do not focus on the use of abstraction (the reverse of refinement) in order to select a subset of test cases. Furthermore, these existing testing theories focus on verification. This restricts their use, either to the study of semantic issues (like the question of observable equivalence of processes [260]), or to the testing of very abstract (finite) models for which exhaustive testing is feasible (like in protocol testing [319]). In contrast, this work focuses on falsification.

Introduction to Mutation Testing of Programming

287

It was Stepney, in her work on Object-Z, who first promoted explicitly the use of abstraction for designing test cases [306]. The application of a refinement calculus to define different test-selection strategies is a contribution of the first author’s doctoral thesis [3]. It was in this thesis that the idea of applying refinement to mutation testing has been presented the first time. Although others worked on specification-based mutation testing before, the use of a refinement relation and a normal form is completely new. Future Work. The presented theory is far from being final or stable. It is another step in our research aim to establish a unifying theory of testing. Such a theory will provide semantic links between different testing theories and models. These links will facilitate the systematic comparison of the results in different areas of testing, hopefully leading to new advances in testing. For example, the relationship between abstraction in model checking and the abstraction techniques in test-case selection deserves a careful study. A further research area where a unifying theory might be applied is the combination of testing and formal proofs. This is related to the highly controversial discussion in philosophy of science on how theories can be confirmed by observations. Currently we work on models of concurrency and investigate the mutation testing of hybrid systems. Another branch of future work is automation. We are currently working on extensions of the prototype test-case generators discussed in Section 6. Especially, the proper sequencing of test cases for bringing a system into a target state has to be addressed. The work in [205] demonstrates the use of constraint solving to master this problem. Besides the study of algorithms for automation, another research agenda is language design. We believe that the design of future specification and programming languages will be highly influenced by tool support for static and dynamic analysis, including testcase generation and automatic verification. Therefore, a careful study of the properties of programming languages with respect to automation will gain importance. All in all, we believe this is a quite challenging and exciting area of research. The authors hope that the presented theory of fault-based testing will inspire the reader to new contributions to testing, and verification in general.

Practice Areas In this appendix we briefly relate testing to the other 28 practice areas in the SEI’s product line framework, detailing what has been discussed in Chapter 4.

Architecture Definition

Architecture Evaluation

Component Development

Software Engineering The architecture is defined to a specific level of detail. The test cases for the architecture must be defined to the same level of detail as the architecture. The architecture is “tested” by comparing its definition to the requirements by the techniques defined in the Architecture Evaluation practice. This area includes the ATAM technique. Architecture reviewers adopt the testing perspective to help them do an in-depth evaluation. It uses scenarios as does the Guided Inspection technique. These scenarios play the role of test cases. The scenarios are chosen to sample over the set of requirements. The sampling can be biased toward those portions of th erequirements that are a higher priority than some other area. The output of these techniques is not a “correct" or “incorrect" decision. It produces a set of risks that identify places in the architecture where it may be difficult to maintain correctness over the life of the architecture. The explicit interface of a component is used as the basis for functional unit test cases. In a product line it is likely that the interface will include some type of variability mechanism. This will increase the number of test cases needed to thoroughly test the component. Likewise the implementation may be more complex to support the variations required of the component. A test infrastructure that is compatible with the development method will be needed. For example, JUnit works well in a development method that is iterative and incremental. JUnit supports the evolution of test cases as the component matures.

Practice Areas

Using Externally Available Software

Mining Existing Assets

Requirements Engineering

Software System Integration

Understanding Relevant Domains

289

Software imported into the core asset base must be tested to the same levels of coverage as in-house built code. Due diligence may find that the supplier of the software has already achieved this level of quality or the inhouse team may have to develop addiitonal test cases. The test cases should be automated so that they can be reapplied when a new release of the external software is released to the organization. When a fragment is mined from existing assets it needs to be tested in its new, more isolated context to be certain that the new asset is appropriately complete with no hidden dependencies. The requirements are one of the primary sources of input to the system test process. The requirements are also subject to verification by a Guided Inspection. The production method should prescribe the manner in which the requirements will be represented and how they can be linked to system test cases. Each level of integration is accompanied by exercising a test suite. It is at this level that previously unanticipated behaviors emerge. Test cases should cover significant interactions between the pieces being integrated to determine that each piece fulfills its responsibilities. The domains about which our programs compute are an important source of information for testing the requirements, architecture and assembled system. In MDD a domain specific language forms the basis for automating development. In a software product line domain models, from which domain specific langauges are created, should be tested before the language is created. A Guided Inspection of the model using the product line requirements as the context can identify incomplete, incorrect, or inconsistent models.

290

Practice Areas

Technical Management The test assets must be controlled and linked so that there is traceability to the sources of the test cases and to the asset they are intended to test. The configuration management system in a software product line uses a meta-data approach for linking since each asset will be used in many builds of many different products. A reusable component is stored in a core asset base repository. Then any use of the component in a product is expressed, in the product’s CM tree as a configuration item that refers to the actual component’s location. The tests for a product are managed in a similar manner. A product’s CM tree contains a reference to the actual test cases that are stored in the core asset storage area. Both core asset and product developer’s guide must provide nstructions on how to access needed assets and how to enter new assets into the storage area. Measurement and Tracking The test results should be tracked to understand how the defect profile is changing. This data should be used to manage the test process and to provide information for improving the development process. When a threshold value for the number of hours needed to find another defect is crossed, the test activity is halted. Make/Buy/Mine/Commission Analysis During the Buy or Commission analyses, there should be due diligence activities that investigate the quality culture of the potential supplier. The costs of testing, based on the supplier’s culture, must be factored into these decisions. Configuration Management

Practice Areas

Process Discipline

Scoping

Technical Planning

Technical Risk Management

291

The numerous test processes must be coordinated with each other and with other processes that impact the same personnel and the same assets. The main threads in a software product line are the core asset development and product development efforts. As can be seen in the chain of quality in 1, there are numerous places where testig is conducted. There are also prerequsites in which one type of tesing should be conducted before another. The “discipline” in process discipline means that personnel who plan testing activities will provide a realistic schedule for them and the personnel assigned to carry out these activities will perform them according to the established process definitions. A well-defined, constrained scope limits the range of values that have to be considered during testing. It also limits the interactions and therefore the interaction tests that must be provided. A change in scope should trigger a review of various test plans to determine where additional tests are needed. Technical planning coordinates the various development and testing activities within the core asset and product building teams and between them as well. As periodic releases of the core asset base are scheduled the development schedules must include the appropriate time for test activities. Technical risk management will be informed by the results of testing. For example, if the defects found per hour of testing is not decreasing sufficiently, a risk should be raised. Testing will also identify areas of the architecture or specific types of code assets that are most problematic.

292

Practice Areas

Tool Support

The development tools and the test tools need to be compatible. They both must support the range of binding times required by the selected variation points. The production method defines all of the technologies and related tools that are used for transformations, generation and other construction techniques. It will also specify testing tools that complement those construction techniques.

Practice Areas

293

Organizational Management The business case includes an analysis of the costs and time required for testing to the level expected in the culture and domain of the products being developed. The business case relies on a specific version of the product line scope to determine the levels of effort required for testing. Customer Interface Management Test results provide useful data for technical customers. The test results should be linked to the corresponding test cases and assets under test using the CM system. Customers can be engaged as testers early in the development cycle so that their suggestions can be incorporated during later iterations. Developing an Acquisition Strategy Test tools and assets may be acquired using this strategy. Funding Testing is expensive but high quality is no accident. Funding is necessary to support the planning, construction and execution of test cases. Launching and Institutionalizing Although most organizations test during traditional development, adopting the product line approach will call for a shift with a heavier testing effort earlier in the cycle to ensure that the reusable assets are sufficiently robust. Market Analysis This analysis may influence the levels of coverage that are set for various types of testing. If customers are dissatisfied or if competitors are delivering a higher level of quality for a comparable price this may require an increase in test coverages. Operations Each specific test point should be assigned to a team with the appropriate knowledge and responsibility. For example, system test should be done by a team that has product line wide scope and responsibility. Organizational Planning During organizational planning issues such as desired levels of quality and product differentiation drive the specification of a production strategy that influences how the testing method is defined. Their efforts are coordinated through an organizational test plan that defines the overall test strategy. Building a Business Case

294

Practice Areas

Organizational Risk Management

Structuring the Organization

Technology Forecasting

Training

Testing reduces the risk of litigation by identifying defects so they can be removed and not reach the customer. Test artifacts such as test results and reports are evidence that customers can review to understand the levels of quality being produced. In a product line organization, different products will present different levels of risk and will require different levels of testing. Personnel who are assigned to testing fulltime may form a functional team which is then matrixed to specific core asset and product teams as needed. Personnel with part-time responsibility for testing should be trained and placed in the appropriate places in the organization to have the authority necessary to be effective. This authority should include veto over passing defective assets or products on to the next step in development. The development technologies that will be used in the future may require different testing technology. Planners use the forecast to determine future needs for training, process definition and tool support. Most personnel will be trained in the testing perspective and Guided Inspection techniques. Developers will be trained to effectively test their units. Personnel assigned responsibility for other types of testing may need training in the tools used to generate test cases.

References

295

References 1. Abrial, J.-R.: The B Book: Assigning Programs to Meanings. Cambridge University Press, Cambridge (1996) 2. Agrawal, H., DeMillo, R.A., Hathaway, R., Hsu, W., Hsu, W., Krauser, E.W., Martin, R.J., Mathur, A.P., Spafford, E.H.: Design of mutant operators for the C programming language. Technical Report SERC-TR41-P, Software Engineering Research Center, Purdue University (1989) 3. Aichernig, B.K.: Systematic Black-Box Testing of Computer-Based Systems through Formal Abstraction Techniques. PhD thesis, Institute for Software Technology (2001) 4. Aichernig, B.K.: Mutation Testing in the Refinement Calculus. Formal Aspects of Computing Journal 15(2), 280–295 (2003) 5. Aichernig, B.K., Delgado, C.C.: From faults via test purposes to test cases: on the faultbased testing of concurrent systems. In: Baresi, L., Heckel, R. (eds.) FASE 2006. LNCS, vol. 3922, pp. 324–338. Springer, Heidelberg (2006) 6. Aichernig, B.K., Jifeng, H.: Mutation Testing in UTP. Formal Aspects of Computing Journal 21(1-2), 33–64 (2009) 7. Aichernig, B.K., Peischl, B., Weiglhofer, M., Wotawa, F.: Protocol conformance testing a SIP registrar: An industrial application of formal methods. In: Hinchey, M., Margaria, T. (eds.) 5th IEEE International Conference on Software Engineering and Formal Methods, pp. 215–224. IEEE Press, Los Alamitos (2007) 8. Aichernig, B.K., Salas, P.A.P.: Test case generation by OCL mutation and constraint solving. In: Cai, K.-Y., Ohnishi, A., Lau, M.F. (eds.) 5th International Conference on Quality Software, pp. 64–71. IEEE Computer Society Press, Los Alamitos (2005) 9. Aiguier, M., Arnould, A., Boin, C., Le Gall, P., Marre, B.: Testing from algebraic specifications: Test data set selection by unfolding axioms. In: Grieskamp, W., Weise, C. (eds.) FATES 2005. LNCS, vol. 3997, pp. 203–217. Springer, Heidelberg (2006) 10. Alexander, R.T., Offutt, J., Bieman, J.M.: Syntactic fault patterns in oo programs. In: 8th International Conference on Engineering of Complex Computer Software, p. 84. IEEE Computer Society, Los Alamitos (2002) 11. Anand, S., Godefroid, P., Tillmann, N.: Demand-driven compositional symbolic execution. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 367–381. Springer, Heidelberg (2008) 12. Anand, S., Pasareanu, C.S., Visser, W.: JPF-SE: A symbolic execution extension to java pathfinder. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424, pp. 134– 138. Springer, Heidelberg (2007) 13. Andersen, M., Barnett, M., Fhndrich, M., Grunkemeyer, B., King, K., Logozzo, F.: (2008), http://research.microsoft.com/en-us/projects/contracts/, http://research.microsoft.com/projects/contracts/ 14. Andrews, J.H., Briand, L.C., Labiche, Y.: Is mutation an appropriate tool for testing experiments? In: 27th International Conference on Software Engineering, pp. 402–411. ACM Press, New York (2005) 15. Astels, D.: Test-Driven Development: A Practical Guide. Prentice-Hall, Englewood Cliffs (2003) 16. Atkinson, C., Gross, H.-G.: Built-in contract testing in model-driven, component-based development. In: ICSR Workshop on Component-Based Development Processes (2002) 17. Bach, J.: Exploratory testing explained (2003), www.satisfice.com/articles/ 18. Back, R.-J., von Wright, J.: Refinement Calculus, a Systematic Introduction. Graduate Texts in Computer Science. Springer, Heidelberg (1998)

296

References

19. Barbey, S., Buchs, D.: Testing Ada abstract data types using formal specifications. In: Toussaint, M.-J. (ed.) Ada-Europe 1994. LNCS, vol. 887, pp. 76–89. Springer, Heidelberg (1994) 20. Barbosa, D.L., Lima, H.S., Machado, P.D.L., Figueiredo, J.C.A., Jucá, M.A., Andrade, W.L.: Automating functional testing of components from uml specifications. International Journal of Software Engineering and Knowledge Engineering 17, 339–358 (2007) 21. Barbosa, E.F., Maldonado, J.C., Vincenzi, A.M.R.: Towards the determination of sufficient mutant operators for C. Software Testing, Verification and Reliability 11(2), 113–136 (2001) 22. Barnett, M., Chang, B.-Y.E., DeLine, R., Jacobs, B., Leino, K.R.M.: Boogie: A modular reusable verifier for object-oriented programs. In: de Boer, F.S., Bonsangue, M.M., Graf, S., de Roever, W.-P. (eds.) FMCO 2005. LNCS, vol. 4111, pp. 364–387. Springer, Heidelberg (2006) 23. Barnett, M., Leino, K.R.M., Schulte, W.: The Spec# programming system: An overview. In: Barthe, G., Burdy, L., Huisman, M., Lanet, J.-L., Muntean, T. (eds.) CASSIS 2004. LNCS, vol. 3362, pp. 49–69. Springer, Heidelberg (2005) 24. Barroca, L.M., McDermid, J.A.: Formal methods: Use and relevance for the development of safety-critical systems. Computer Journal 35(6), 579–599 (1992) 25. Basanieri, F., Bertolino, A., Marchetti, E.: The Cow Suite Approach to Planning and Deriving Test Suites in UML Projects. In: Jézéquel, J.-M., Hussmann, H., Cook, S. (eds.) UML 2002. LNCS, vol. 2460, pp. 383–397. Springer, Heidelberg (2002) 26. Basili, V.R., Selby, R.W.: Comparing the effectiveness of software testing strategies. IEEE Transactions on Software Engineering 13(12), 1278–1296 (1987) 27. Basili, V.R., Shull, F., Lanubille, F.: Building Knowledge Through Families of Experiments. IEEE Transactions on Software Engineering 25(4), 456–473 (1999) 28. Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice. Addison-Wesley, Reading (2003) 29. Beck, K.: Smalltalk Best Practice Patterns, 1st edn. Prentice-Hall, Englewood Cliffs (1996) 30. Beck, K.: Test Driven Development: By Example. Addison-Wesley, Reading (2003) 31. Beck, K., Andres, C.: Extreme Programming Explained: Embrace Change, 2nd edn. Addison-Wesley Professional, Reading (2004) 32. Beizer, B.: Software Testing Techniques, 2nd edn. Van Nostrand Reinhold (1990) 33. Beizer, B.: Black-Box Testing: Techniques for Functional Testing of Software and Systems. John Wiley & Sons, Chichester (1995) 34. Berman, B.L., Gibson, B.F. (eds.): The three-body force in the three-nucleon system. Lecture Notes in Physics, vol. 260. Springer, Heidelberg (1986) 35. Bernot, G.: Testing against formal specifications: a theoretical view. In: Abramsky, S. (ed.) TAPSOFT 1991, CCPSD 1991, and ADC-Talks 1991. LNCS, vol. 494, pp. 99–119. Springer, Heidelberg (1991) 36. Bernot, G., Gaudel, M.-C., Marre, B.: Software testing based on formal specifications: A theory and a tool. Software Engineering Journal 6(6), 387–405 (1991) 37. Bernot, G., Gaudel, M.-C., Marre, B.: A formal approach to software testing. In: 2nd International Conference on Algebraic Methodology and Software Technology (AMAST). Worshops in Computing Series, vol. 670, pp. 243–253. Springer, Heidelberg (1992) 38. Bidoit, M., Kreowski, H., Lescanne, P., Orejas, F., Sannella, D. (eds.): Algebraic system specification and development. Springer, New York (1991) 39. Bidoit, M., Mosses, P.D. (eds.): CASL User Manual. LNCS, vol. 2900, pp. 221–223. Springer, Heidelberg (2004) 40. Billeter, J.: Counterexample execution. Master’s thesis, ETH Zuerich (2008) 41. Binder, R.V.: Testing Object-Oriented Systems—Models, Patterns and Tools. AddisonWesley, Reading (2000)

References

297

42. Biolchini, J., Mian, P.G., Natali, A.C.C., Travassos, G.H.: Systematic Review in Software Engineering. Technical report RT-ES 679/05, PESC - COPPE/UFRJ (2005) 43. Bjørner, N., Tillmann, N., Voronkov, A.: Path feasibility analysis for string-manipulating programs. In: Kowalewski, S., Philippou, A. (eds.) Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2009). LNCS, vol. 5505, pp. 307–321. Springer, Heidelberg (2009) 44. Black, P.E., Okun, V., Yesha, Y.: Mutation of model checker specifications for test generation and evaluation. In: Mutation Testing For The New Century, pp. 14–20. Kluwer Academic Publishers, Dordrecht (2001) 45. Böckle, G., Clements, P., McGregor, J.D., Muthig, D., Schmid, K.: A cost model for software product lines. In: 5th International Workshop on Product Family Engineering, pp. 310–316 (2003) 46. Böckle, G., Clements, P., McGregor, J.D., Muthig, D., Schmid, K.: Calculating roi for software product lines. IEEE Software 21(3), 23–31 (2004) 47. Boehm, B., Basili, V.R.: Software defect reduction top 10 list. Computer 34(1), 135–137 (2001) 48. Boehm, B.W.: Industrial software metrics top 10 list. IEEE Software 4(5), 84–85 (1987) 49. Boonstoppel, P., Cadar, C., Engler, D.: Rwset: Attacking path explosion in constraint-based test generation. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 351–366. Springer, Heidelberg (2008) 50. Bougé, L.: Modélisation de la notion de test de programmes, application à la production de jeux de test. Ph. D. thesis, Université de Paris 6 (1982) 51. Bougé, L., Choquet, N., Fribourg, L., Gaudel, M.-C.: Test set generation from algebraic specifications using logic programming. Journal of Systems and Software 6(4), 343–360 (1986) 52. Brace, K.S., Rudell, R.L., Bryant, R.E.: Efficient implementation of a BDD package. In: DAC 1990: Proceedings of the 27th ACM/IEEE Conference on Design Automation, pp. 40–45. ACM Press, New York (1990) 53. Brereton, P., Kitchenham, B.A., Budgen, D., Turner, M., Khalil, M.: Lessons from applying the systematic literature review process within the software engineering domain. The Journal of Systems and Software 80(4), 571–583 (2007) 54. Briand, L.C., Labiche, Y., Sówka, M.M.: Automated, contract-based user testing of commercial-off-the-shelf components. In: 28th International Conference on Software Engineering, pp. 92–101. ACM Press, New York (2006) 55. Brinksma, E.: A theory for the derivation of tests. In: Protocol Specification, testing and Verification VIII, pp. 63–74. North-Holland, Amsterdam (1988) 56. Brinksma, E., Tretmans, J.: Testing transition systems: An annotated bibliography. In: Cassez, F., Jard, C., Rozoy, B., Dermot, M. (eds.) MOVEP 2000. LNCS, vol. 2067, pp. 187–195. Springer, Heidelberg (2001) 57. Brucker, A.D., Wolff, B.: Symbolic test case generation for primitive recursive functions. In: Grabowski, J., Nielsen, B. (eds.) FATES 2004. LNCS, vol. 3395, pp. 16–32. Springer, Heidelberg (2005) 58. Budd, T.A.: Mutation Analysis of Program Test Data. PhD thesis, Yale University (1980) 59. Budd, T.A.: Computer Program Testing. In: Mutation Analysis: Ideas, Example, Problems and Prospects, pp. 129–148. North-Holand Publishing Company (1981) 60. Budd, T.A., Gopal, A.S.: Program testing by specification mutation. Computer Languages 10(1), 63–73 (1985) 61. Bueno, P.M.S., Crespo, A.N., Jino, M.: Analysis of the maturity of a testing process oriented to artefacts. In: 7th International Symposium on Improvements of Software Processes (2005) (in Portuguese)

298

References

62. Burstall, R.M., Darlington, J.: A transformation system for developing recursive programs. Journal of the Association for Computing Machinery 24(1), 44–67 (1977) 63. Burton, S.: Automated Testing from Z Specifications. Technical Report YCS 329, Department of Computer Science, University of York (2000) 64. Mendonça, M.G., Maldonado, J.C., Oliveira, M.C.F., Carver, J., Fabbri, S.C.P.F., Shull, F., Travassos, G.H., Hohn, E.N., Basili, V.R.: A framework for software engineering experimental replications. In: 13th IEEE International Conference on Engineering of Complex Computer Systems, pp. 203–212. IEEE Press, Los Alamitos (2007) 65. Cabral, G., Sampaio, A.C.A.: Formal specification generation from requirement documents. In: Brazilian Symposium on Formal Methods, pp. 217–232 (2006) 66. Cadar, C., Dunbar, D., Engler, D.: Klee: Unassisted and automatic generation of highcoverage tests for complex systems programs. In: OSDI, pp. 209–224 (2008) 67. Cadar, C., Ganesh, V., Pawlowski, P., Dill, D., Engler, D.: Exe: automatically generating inputs of death. In: CCS 2006: Proceedings of the 13th ACM conference on Computer and Communications Security, pp. 322–335. ACM Press, New York (2006) 68. Calder, M., Kolberg, M., Magill, E.H., Reiff-Marganiec, S.: Feature interaction: a critical review and considered forecast. Computer Networks 41(1), 115–141 (2003) 69. Cartaxo, E.G.: Test case generation by means of UML sequence diagrams and Label Transition System for mobile phone applications. Master’s thesis, Universidade Federal de Campina Grande, UFCG (2006) 70. Cartaxo, E.G., Andrade, W.L., Oliveira Neto, F.G., Machado, P.D.L.: Lts-bt: A tool to generate and select functional test cases for embedded systems. In: ACM Symposium on Applied Computing (2008) 71. Cartaxo, E.G., Machado, P.D.L., Oliveira Neto, F.G.: On the use of a similarity function for test case selection in the context of model-based testing. In: Software Testing, Verification and Reliability (2009) 72. Cartaxo, E.G., Oliveira Neto, F.G., Machado, P.D.L.: Automated test case selection based on a similarity function. In: Model-based Testing Workshop—37th Annual Congress of the Gesellschaft fuer Informatik. LNI, vol. 110, pp. 381–386 (2007) 73. Cavalcanti, A.L.C., Gaudel, M.-C.: Testing for Refinement in CSP. In: Butler, M., Hinchey, M.G., Larrondo-Petrie, M.M. (eds.) ICFEM 2007. LNCS, vol. 4789, pp. 151–170. Springer, Heidelberg (2007) 74. Chastek, G.J., Donohoe, P., McGregor, J.D.: A production system for software product lines. In: 11th Software Product Line Conference, pp. 117–128. IEEE Computer Society, Los Alamitos (2007) 75. Chen, H.Y., Tse, T.H., Chan, F.T., Chen, T.Y.: In black and white: an integrated approach to class-level testing of object-oriented programs. ACM Transactions on Software Engineering and Methodology 7(3), 250–295 (1998) 76. Chen, H.Y., Tse, T.H., Chen, T.Y.: TACCLE: a methodology for object-oriented software testing at the class and cluster levels. ACM Transactions on Software Engineering and Methodology 10(1), 56–109 (2001) 77. Chillarege, R.: Odc crystallizes test effectiveness. In: 3rd International Workshop on Testing Software Product Lines, pp. 31–32 (2006) 78. Choi, B.J., Mathur, A.P., Pattison, A.P.: Pmothra: Scheduling mutants for execution on a hypercube. In: 3rd Symposium on Software Testing, Analysis and Verification, pp. 58–65 (1989) 79. Chow, T.S.: Testing software design modeled by finite-state machines. IEEE Transactions on Software Engineering SE-4(3), 178–187 (1978) 80. Claessen, K., Hughes, J.: Quickcheck: a lightweight tool for random testing of haskell programs. In: International Conference on Functional Programming, pp. 268–279 (2000)

References

299

81. Cleaveland, R., Hennessy, M.: Priorities in process algebra. Information and Computation 87(1-2), 58–77 (1990) 82. Cleaveland, R., Lüttgen, G., Natarajan, V.: Priority and abstraction in process algebra. Information and Computation 205(9), 1426–1458 (2007) 83. Clements, P., Northrop, L.M.: Software Product Lines: Practices and Patterns. AddisonWesley, Reading (2002) 84. Cockburn, A.: Writing Effective Use Cases. Addison-Wesley, Reading (2000) 85. Copeland, L.: A Practitioner’s Guide to Software Test Design. Artech House Publishers (2004) 86. Crispin, L., House, T.: Testing Extreme Programming. Addison-Wesley, Reading (2002) 87. Csallner, C., Tillmann, N., Smaragdakis, Y.: Dysy: dynamic symbolic execution for invariant inference. In: ICSE, pp. 281–290 (2008) 88. Dan, L., Aichernig, B.K.: Combining algebraic and model-based test case generation. In: Liu, Z., Araki, K. (eds.) ICTAC 2004. LNCS, vol. 3407, pp. 250–264. Springer, Heidelberg (2005) 89. Dauchy, P., Gaudel, M.-C., Marre, B.: Using algebraic specifications in software testing: a case study on the software of an automatic subway. Journal of Systems and Software 21(3), 229–244 (1993) 90. de Halleux, J.: Mbunit (2007), http://mbunit.com 91. de Halleux, J., Tillmann, N.: Parameterized unit testing with Pex (tutorial). In: Beckert, B., Hähnle, R. (eds.) TAP 2008. LNCS, vol. 4966, pp. 171–181. Springer, Heidelberg (2008) 92. Decker, S., Dager, J.: Software product lines beyond software development. In: 11th International Software Product Line Conference, pp. 275–280. IEEE Computer Society, Los Alamitos (2007) 93. Dehlinger, J., Humphrey, M., Suvorov, L., Padmanabhan, P., Lutz, R.R.: Decimal and plfaultcat: From product-line requirements to product-line member software fault trees. In: ICSE COMPANION 2007: Companion to the Proceedings of the 29th International Conference on Software Engineering, pp. 49–50. IEEE Computer Society Press, Los Alamitos (2007) 94. Dehlinger, J., Lutz, R.R.: Plfaultcat: A product-line software fault tree analysis tool. Automated Software Engg. 13(1), 169–193 (2006) 95. Delamaro, M.E., Maldonado, J.C.: Proteum—a tool for the assessment of test adequacy for C programs. In: Conference on Performability in Computing Systems, pp. 79–95 (1996) 96. Delamaro, M.E., Maldonado, J.C., Mathur, A.P.: Interface mutation: An approach for integration testing. IEEE Transactions on Software Engineering 27(3), 228–247 (2001) 97. Delamaro, M.E., Maldonado, J.C., Vincenzi, A.M.R.: Proteum/IM 2.0: An integrated mutation testing environment. In: Mutation 2000 Symposium, pp. 91–101. Kluwer Academic Publishers, Dordrecht (2000) 98. Delamaro, M.E., Pezzè, M., Vincenzi, A.M.R., Maldonado, J.C.: Mutant operators for testing concurrent Java programs. In: Brazilian Symposium on Software Engineering, pp. 272– 285 (2001) 99. Delamaro, M.E., Vincenzi, A.M.R.: Structural testing of mobile agents. In: Guelfi, N., Astesiano, E., Reggio, G. (eds.) 3rd International Workshop on Scientific Engineering of Java Distributed Applications. LNCS, pp. 73–85. Springer, Heidelberg (2003) 100. Delamaro, M.E., Vincenzi, A.M.R., Maldonado, J.C.: A strategy to perform coverage testing of mobile applications. In: 1st International Workshop on Automation of Software Test, pp. 118–124. ACM Press, New York (2006) 101. DeMillo, R., Lipton, R., Sayward, F.: Hints on test data selection: Help for the practicing programmer. IEEE Computer 11(4), 34–41 (1978) 102. DeMillo, R.A.: Mutation analysis as a tool for software quality assurance. In: IEEE International Computer Software and Applications Conference (1980)

300

References

103. DeMillo, R.A., Offutt, A.J.: Constraint based automatic test data generation. IEEE Transactions on Software Engineering 17(9), 900–910 (1991) 104. Dick, J., Faivre, A.: Automating the generation and sequencing of test cases from modelbased specifications. In: Larsen, P.G., Woodcock, J.C.P. (eds.) FME 1993. LNCS, vol. 670, pp. 268–284. Springer, Heidelberg (1993) 105. Dijkstra, E.W.: Notes on structured programming. Technical Report 70-WSK-03, Technological University of Eindhoven (1970), http://www.cs.utexas.edu/users/EWD/ewd02xx/EWD249.PDF 106. DO-178B. Software considerations in airborne systems and equipment certification. requirements and technical concepts for aviation (December 1992) 107. Doche, M., Wiels, V.: Extended institutions for testing. In: Rus, T. (ed.) AMAST 2000. LNCS, vol. 1816, pp. 514–528. Springer, Heidelberg (2000) 108. Dong, R.K., Frankl, P.G.: The ASTOOT approach to testing object-oriented programs. ACM Transactions on Software Engineering and Methodology 3(2), 103–130 (1994) 109. Dyba, T., Kitchenham, B.A., Jorgensen, M.: Evidence-based software engineering for practitioners. IEEE Software 22(1), 58–65 (2005) 110. Dybjer, P., Haiyan, Q., Takeyama, M.: Combining testing and proving in dependent type theory. In: Basin, D., Wolff, B. (eds.) TPHOLs 2003. LNCS, vol. 2758, pp. 188–203. Springer, Heidelberg (2003) 111. El-Far, I.K., Whittaker, J.A.: Model-based software testing. In: Encyclopedia on Software Engineering (2001) 112. Ernst, M.D., Perkins, J.H., Guo, P.J., McCamant, S., Pacheco, C., Tschantz, M.S., Xiao, C.: The Daikon system for dynamic detection of likely invariants. Science of Computer Programming (2007) 113. Pohl, C., et al.: Survey of existing implementation techniques with respect to their support for the requirements identified in m3.2. Technical Report AMPLE D3.1, AMPLE consortium (2007) 114. Fabbri, S.C.P.F.: The Mutation Analysis in the Context of Reactive Systems: a Constribution on the Establishment of Validation and Testing Strategies. PhD thesis, IFSC-USP (1996) (in Portuguese) 115. Fabbri, S.C.P.F., Maldonado, J.C., Masiero, P.C., Delamaro, M.E.: Mutation analysis testing for finite state machines. In: 5th International Symposium on Software Reliability Engineering (ISSRE 1994), pp. 220–229. IEEE Computer Society Press, Los Alamitos (1994) 116. Fabbri, S.C.P.F., Maldonado, J.C., Masiero, P.C., Delamaro, M.E.: Mutation analysis applied to validate specifications based on petri nets. In: 8th IFIP Conference on Formal Descriptions Techniques for Distribute Systems and Communication Protocols, pp. 329–337. Kluwer Academic Publishers, Dordrecht (1995) 117. Fairfield, P., Hennell, M.A.: Data flow analysis of recursive procedures. SIGPLAN Notices 23(1), 48–57 (1988) 118. Fernandez, J.-C., Jard, C., Jéron, T., Viho, G.: An experiment in automatic generation of conformance test suites for protocols with verification technology. Science of Computer Programming 29, 123–146 (1997) 119. Ferrari, F.C., Maldonado, J.C., Rashid, A.: Mutation testing for aspect-oriented programs. In: International Conference on Software Testing, Verification, and Validation, pp. 52–61. IEEE Computer Society, Los Alamitos (2008) 120. Figueiredo, A.L.L., Andrade, W.L., Machado, P.D.L.: Generating interaction test cases for mobile phone systems from use case specifications. In: 2nd International Workshop on Advances in Model-based Software Testing, ACM Press, New York (2006) 121. Flanagan, C., Leino, K.R.M., Lillibridge, M., Nelson, G., Saxe, J.B., Stata, R.: Extended static checking for Java. In: Proc. the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, pp. 234–245. ACM Press, New York (2002)

References

301

122. Fowler, M.: Inversion of control containers and the dependency injection pattern. web (January 2004), http://martinfowler.com/articles/injection.html 123. Franchin, I.G.: Structural integration testing of object and aspect-oriented programs: criteria and automation. Master’s thesis, ICMC-USP (2007) (in Portuguese), http://www. teses.usp.br/teses 124. Frankl, P.G., Weyuker, E.J.: A formal analysis of the fault-detecting ability of testing methods. IEEE Transactions on Software Engineering 19(3), 202–213 (1993) 125. Fraser, G., Aichernig, B.K., Wotawa, F.: Handling model changes: Regression testing and test-suite update with model-checkers. Electronic Notes in Theoretical Computer Science 190, 33–46 (2007) 126. Friedman, A.D.: Logical Design of Digital Systems. Computer Science Press, Rockville (1975) 127. Frühwirth, T., Abdennadher, S.: Essentials of Constraint Programming. Springer, Heidelberg (2003) 128. Fuchs, N., Schwertel, U., Schwitter, R.: Attempto Controlled English - not just another logic specification language. In: 8th International Workshop on Logic Programming Synthesis and Transformation, pp. 1–20. Springer, Heidelberg (1990) 129. Le Gall, P.: Les algèbres étiquetées: une sémantique pour les spécifications algébriques fondée sur une utilisation systématique des termes. Application au test de logiciel avec traitement d’exceptions. PhD thesis, Université de Paris XI, Orsay (1993) 130. Le Gall, P., Arnould, A.: Formal specification and test: correctness and oracle. In: Haveraaen, M., Dahl, O.-J., Owe, O. (eds.) Abstract Data Types 1995 and COMPASS 1995. LNCS, vol. 1130, pp. 342–358. Springer, Heidelberg (1996) 131. Gälli, M., Greevy, O., Nierstrasz, O.: Composing unit tests. In: 2nd International Workshop on Testing Software Product Lines, pp. 16–22 (2005) 132. Gamesan, D., Knodel, J., Kolb, R., Haury, U., Meier, G.: Comparing costs and benefits of different test strategies for a software product line: A study from testo ag. In: 11th Software Product Line Conference, pp. 74–83. IEEE Computer Society, Los Alamitos (2007) 133. Ganesan, D., Muthig, D., Yoshimura, K.: Predicting return-on-investment for product line generations. In: 10th Software Product Line Conference, pp. 13–22. IEEE Computer Society, Los Alamitos (2006) 134. Gannon, J., McMullin, P., Hamlet, R.: Data abstraction implementation, specification and testing. ACM Transactions on Programming Languages and Systems 3(3), 211–223 (1981) 135. Gaudel, M.-C.: Testing can be formal, too. In: Mosses, P.D., Schwartzbach, M.I., Nielsen, M. (eds.) CAAP 1995, FASE 1995, and TAPSOFT 1995. LNCS, vol. 915, pp. 82–96. Springer, Heidelberg (1995) 136. Gaudel, M.-C., Le Gall, P.: Testing data types implementations from algebraic specifications. In: Hierons, R.M., Bowen, J.P., Harman, M. (eds.) FORTEST. LNCS, vol. 4949, pp. 209–239. Springer, Heidelberg (2008) 137. Gaudel, M.C., James, P.R.: Testing algebraic data types and processes: a unifying theory. Formal Aspects of Computing 10(5 & 6), 436–451 (1998) 138. Ghosh, S., Mathur, A.P.: Interface mutation. Software Testing, Verification and Reliability 11(4), 227–247 (2001) 139. Godefroid, P.: Compositional dynamic test generation. In: Proc. of POPL 2007, pp. 47–54. ACM Press, New York (2007) 140. Godefroid, P., de Halleux, P., Nori, A.V., Rajamani, S.K., Schulte, W., Tillmann, N., Levin, M.Y.: Automating software testing using program analysis. IEEE Software 25(5), 30–37 (2008) 141. Godefroid, P., Klarlund, N., Sen, K.: DART: directed automated random testing. SIGPLAN Notices 40(6), 213–223 (2005)

302

References

142. Godefroid, P., Levin, M., Molnar, D.: Automated whitebox fuzz testing. In: Proceedings of NDSS 2008 (Network and Distributed Systems Security), pp. 151–166 (2008) 143. Goguen, J.A., Thatcher, J.W., Wagner, E.G.: An initial algebra approach to the specification, correctness and implementation of abstract data types. In: Current Trends in Programming Methodology IV: Data structuring, pp. 80–144. Prentice-Hall, Englewood Cliffs (1978) 144. Goodenough, J.B., Gerhart, S.L.: Towards a theory of test data selection. IEEE Transactions on Software Engineering 2(3), 156–173 (1975) 145. Grieskamp, W., Tillmann, N., Schulte, W.: XRT - Exploring Runtime for .NET - Architecture and Applications. In: SoftMC 2005: Workshop on Software Model Checking. Electronic Notes in Theoretical Computer Science (July 2005) 146. Griesmayer, A., Aichernig, B.K., Johnsen, E.B., Schlatte, R.: Dynamic symbolic execution for testing distributed objects. In: Dubois, C. (ed.) Tests and Proofs. LNCS, vol. 5668, pp. 105–120. Springer, Heidelberg (2009) 147. Object Management Group. Uml testing profile. OMG Document Formal/05-07-07 (2005), http://www.omg.org 148. Object Management Group. Model-level testing and debugging specification. Final Adopted Specification, ptc/07-05-14 (2007), http://www.omg.org 149. The RAISE Method Group. The RAISE Development Method. Prentice Hall International, UK (1995) 150. Grütter, G.: Challenges for testing in software product lines. In: 3rd International Workshop on Testing Software Product Lines, pp. 1–4 (2006) 151. Guttag, J.V., Horning, J.J.: The algebraic specification of abstract data types. Acta Informatica 10(1), 27–52 (1978) 152. Haley, A., Zweben, S.: Development and application of a white box approach to integration testing. The Journal of Systems and Software 4, 309–315 (1984) 153. Hamlet, R.G.: Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering 3(4), 279–290 (1977) 154. Harman, M., Danicic, S.: Amorphous program slicing. In: 5th IEEE International Workshop on Program Comprehesion (IWPC 1997), pp. 70–79. IEEE Computer Society Press, Los Alamitos (1997) 155. Harman, M., Hierons, R.M., Danicic, S.: The relationship between program dependence and mutation analysis. In: Mutation 2000 Symposium, pp. 5–12. Kluwer Academic Publishers, Dordrecht (2000) 156. Harold, E.R.: Test your tests with jester. IBM developerWorks - On-line article (2005), http://www.ibm.com/developerworks/java/library/j-jester/ 157. Harrold, M.J.: Testing: A roadmap. In: 22th International Conference on Software Engineering – Future of SE Track, pp. 61–72 (2000) 158. Harrold, M.J., Gupta, R., Soffa, M.L.: A methodology for controlling the size of a test suite. ACM Transactions on Software Engineering Methodology 2(3), 270–285 (1993) 159. Harrold, M.J., Rothermel, G.: Performing data flow testing on classes. In: Second ACM SIGSOFT Symposium on Foundations of Software Engineering, pp. 154–163. ACM Press, New York (1994) 160. Harrold, M.J., Soffa, M.L.: Selecting and using data for integration test. IEEE Transactions on Software Engineering 8(2), 58–65 (1991) 161. Hartman, A., Nagin, K.: The agedis tools for model based testing. In: ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 129–132. ACM Press, New York (2004) 162. Hartmann, J., Robson, D.J.: Techniques for selective revalidation. IEEE Transactions on Software Engineering 7(1), 31–36 (1990) 163. Hedley, D., Hennell, M.A.: The causes and effects of infeasible paths in computer programs. In: ICSE, pp. 259–267 (1985)

References

303

164. Helke, S., Neustupny, T., Santen, T.: Automating test case generation from Z specifications with Isabelle. In: Till, D., P. Bowen, J., Hinchey, M.G. (eds.) ZUM 1997. LNCS, vol. 1212, pp. 52–71. Springer, Heidelberg (1997) 165. Henkel, J., Diwan, A.: Discovering algebraic specifications from Java classes. In: Cardelli, L. (ed.) ECOOP 2003. LNCS, vol. 2743, pp. 431–456. Springer, Heidelberg (2003) 166. Herman, P.M.: A data flow analysis approach to program testing. Australian Computer Journal 8(3), 92–96 (1976) 167. Hetzel, W.C.: The Complete Guide to Software Testing. In: QED Information Sciences, Wellesly (1984) 168. Hierons, R.M.: Comparing test sets and criteria in the presence of test hypotheses and fault domains. ACM Trans. Softw. Eng. Methodol. 11(4), 427–448 (2002) 169. Hierons, R.M.: Testing from a nondeterministic finite state machine using adaptive state counting. IEEE Transactions on Computers 53(10), 1330–1342 (2004) 170. Hierons, R.M.: Avoiding coincidental correctness in boundary value analysis. ACM Transactions on Software Engineering Methodology 15(3), 227–241 (2006) 171. Hierons, R.M., Harman, M., Danicic, S.: Using program slicing to assist in the detection of equivalent mutants. Software Testing, Verification and Reliability 9(4), 233–262 (1999) 172. Hoare, C.A.R.: Communicating Sequential Processes. Prentice-Hall, Englewood Cliffs (1985) 173. Hoare, C.A.R., He, J.: Unifying Theories of Programming. Prentice-Hall International, Englewood Cliffs (1998) 174. Holt, A.: Formal verification with natural language specifications: guidelines, experiments and lessons so far. South African Computer Journal 257, 253–257 (1999) 175. Howden, W.E.: Functional Program Testing and Analysis. In: Software Engineering and Technology, McGraw-Hill, New York (1987) 176. IEEE. IEEE standard for software test documentation. Standard 829-1998. IEEE Computer Society Press, Los Alamitos (1998) 177. IEEE. IEEE standard glossary of software engineering terminology. Standard 610.12-1990 (R2002). IEEE Computer Society Press, Los Alamitos (2002) 178. Im, K., Im, T., McGregor, J.D.: Automating test case definition using a domain specific language. In: 46th Annual ACM Southeast Conference, pp. 180–185. ACM, New York (2008) 179. Jacobson, I., Ng, P.W., Spence, I.: Enough process — let’s do practices. Journal of Object Technology 6(6), 41–66 (2007) 180. Jard, C., Jéron, T.: Tgv: theory, principles and algorithms: A tool for the automatic synthesis of conformance test cases for non-deterministic reactive systems. International Journal on Software Tools for Technology Transfer 7(4), 297–315 (2005) 181. Jin, Z., Offut, A.J.: Integration testing based on software couplings. In: 10th Annual Conference on Computer Assurance, pp. 13–23 (1995) 182. Jones, C.B.: Systematic Software Development Using VDM. Prentice-Hall, Englewood Cliffs (1986) 183. Jorgensen, P.C.: Software Testing: A Craftsman’s Approach, 2nd edn. CRC Press, Boca Raton (2002) 184. Jorgensen, P.C., Erickson, C.: Object oriented integration testing. Communications of the ACM 37(9), 30–38 (1994) 185. Jouault, F., Kurtev, I.: Transforming models with ATL. In: Bruel, J.-M. (ed.) MoDELS 2005. LNCS, vol. 3844, pp. 128–138. Springer, Heidelberg (2006) 186. JUnit development team. JUnit, http://www.junit.org/ 187. Juristo, N., Moreno, A.M., Vegas, S.: Towards building a solid empirical body of knowledge in testing techniques. SIGSOFT Software Engineering Notes 29(5), 1–4 (2004)

304

References

188. Kamsties, E., Lott, C.: An empirical evaluation of three detection techniques. Technical Report ISERN-95-02, International Software Engineering Network Technical Report (1995) 189. Kaner, C., Falk, J., Nguyen, H.Q.: Testing Computer Software, 2nd edn. International Thomson Computer Press (1993) 190. Kauppinen, R., Taina, J., Tevanlinna, A.: Hook and template coverage criteria for testing framework-based software product families. In: 1st International Workshop on Testing Software Product Lines, pp. 7–12 (2004) 191. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: Getting started with AspectJ. Communications of the ACM 44(10), 59–65 (2001) 192. Kim, S., Clark, J.A., Mcdermid, J.A.: The rigorous generation of Java mutation operators using HAZOP. In: 12th International Conference on Software & Systems Engineering and their Applications (1999) 193. King, J.C.: A new approach to program testing. ACM SIGPLAN Notices 10(6), 228–233 (1975) 194. King, J.C.: Symbolic execution and program testing. Communications of the ACM 19(7), 385–394 (1976) 195. Kitchenham, B.A.: Procedures for Performing Systematic Reviews. Joint Technical Report, Software Engineering Group, Keele University and Empirical Software Engineering, National ICT Australia Ltd. (2004) 196. Kitchenham, B.A., Dyba, T., Jorgensen, M.: Evidence-based software engineering. In: 26th International Conference on Software Engineering, pp. 273–281. IEEE Computer Society, Los Alamitos (2004) 197. Knauber, P., Schneider, J.: Tracing variability from implementation to test using aspectoriented programming. In: 1st International Workshop on Testing Software Product Lines, pp. 36–44 (2004) 198. Korel, B.: A dynamic approach of test data generation. In: IEEE Conference On Software Maintenance, pp. 311–317 (November 1990) 199. Korel, B., Al-Yami, A.M.: Assertion-oriented automated test data generation. In: Proc. of the 18th International Conference on Software Engineering, pp. 71–80. IEEE Computer Society, Los Alamitos (1996) 200. Krauser, E.W., Mathur, A.P., Rego, V.J.: High performance software testing on simd machines. IEEE Transactions on Software Engineering 17(5), 403–422 (1991) 201. Krenn, W., Aichernig, B.: Test case generation by contract mutation in Spec#. In: 5th Workshop on Model Based Testing. Electronic Notes in Theoretical Computer Science. Elsevier, Amsterdam (2009) 202. Kung, D., Gao, J., Hsia, P., Toyoshima, Y., Chen, C.: On Regression Testing of ObjectOriented Programs. The Journal of Systems and Software 32, 21–40 (1996) 203. Lee, B.-S., Bryant, B.: Automated conversion from requirements documentation to an object-oriented formal specification language. In: ACM Symposium on Applied Computing, pp. 932–936. ACM Press, New York (2002) 204. Lee, D., Yannakakis, M.: Principles and methods of testing finite state machines - A survey. In: Proceedings of the IEEE, vol. 84, pp. 1090–1126 (1996) 205. Legeard, B., Peureux, F., Utting, M.: Automated boundary testing from Z and B. In: Eriksson, L.-H., Lindsay, P.A. (eds.) FME 2002. LNCS, vol. 2391, pp. 21–40. Springer, Heidelberg (2002) 206. Legeard, B., Peureux, F., Utting, M.: Controlling test case explosion in test generation from B formal models. Softw. Test., Verif. Reliab. 14(2), 81–103 (2004) 207. Lemos, O.A.L., Vincenzi, A.M.R., Maldonado, J.C., Masiero, P.C.: Control and data flow structural testing criteria for aspect-oriented programs. The Journal of Systems and Software 80(6), 862–882 (2007)

References

305

208. Lestiennes, G.: Contributions au test de logiciel basé sur des spécifications formelles. Ph. D. thesis, Université de Paris-Sud (2005) 209. Lestiennes, G., Gaudel, M.-C.: Testing processes from formal specifications with inputs, outputs and data types. In: ISSRE, pp. 3–14 (2002) 210. Lestiennes, G., Gaudel, M.-C.: Test de systèmes réactifs non réceptifs. Journal Européen des Systèmes automatisés 36(1-2-3), 255–270 (2005) (in French) 211. Lestiennes, G., Gaudel, M.-C.: Modélisation et test de systèmes comportant des actions prioritaires. Rapport LRI 1434, Université de Paris-Sud (March 2006) 212. Lima, H., Ramalho, F., Machado, P.D.L., Galdino, E.: Automatic generation of built-in contract testers based on model-driven engineering techniques. In: Brazilian Symposium on Software Components, Architectures and Reuse (2007) 213. Lin, J.-C., Yeh, P.-L.: Automatic test data generation for path testing using gas. Information Science 131(1-4), 47–64 (2001) 214. Linkman, S., Vincenzi, A.M.R., Maldonado, J.C.: An evaluation of systematic functional testing using mutation testing. In: 7th International Conference on Empirical Assessment in Software Engineering, IEE (2003) 215. Linnenkugel, U., Müllerburg, M.: Test Data Selection Criteria for (Software) Integration Testing. In: 1st International Conference on Systems Integration, pp. 709–717. IEEE Computer Society, Los Alamitos (1990) 216. Littlewood, B., Popov, P.T., Strigini, L., Shryane, N.: Modeling the effects of combining diverse software fault detection techniques. IEEE Transactions on Software Engineering 26(12), 1157–1167 (2000) 217. Lorentsen, L., Tuovinen, A.-P., Xu, J.: Modelling feature interactions in mobile phones. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 7–13. Springer, Heidelberg (2001) 218. Lu, D., Lutz, R.R.: Fault contribution trees for product families. In: 13th IEEE International Symposium on Software Reliability Engineering, pp. 231–242. IEEE Computer Society, Los Alamitos (2002) 219. Lynch, N., Tuttle, M.: An introduction to input-output automata. Technical Report MIT/LCS/TM-373, MIT (1988) 220. Ma, Y.-S., Kwon, Y.-R., Offutt, J.: Inter-class mutation operators for Java. In: 13th International Symposium on Software Reliability Engineering, pp. 352–366. IEEE Computer Society Press, Los Alamitos (2002) 221. Ma, Y.-S., Offutt, J., Kwon, Y.R.: MuJava: an automated class mutation system: Research Articles. Software Testing, Verification and Reliability 15(2), 97–133 (2005) 222. Machado, P.: On oracles for interpreting test results against algebraic specifications. In: Haeberer, A.M. (ed.) AMAST 1998. LNCS, vol. 1548, pp. 502–518. Springer, Heidelberg (1998) 223. Machado, P.: Testing from structured algebraic specifications. In: Rus, T. (ed.) AMAST 2000. LNCS, vol. 1816, pp. 529–544. Springer, Heidelberg (2000) 224. Machado, P., Sannella, D.: Unit testing for CASL architectural specifications. In: Diks, K., Rytter, W. (eds.) MFCS 2002. LNCS, vol. 2420, pp. 506–518. Springer, Heidelberg (2002) 225. Machado, P.D.L., Silva, D.A., Mota, A.C.: Towards property oriented testing. Electronic Notes in Theoretical Computer Science 184, 3–19 (2007) 226. Mackinnon, T., Freeman, S., Craig, P.: Endotesting: Unit testing with mock objects. In: XP 2000 (May 2000) 227. Mafra, S.N., Travassos, G.H.: Primary and Secondary Studies supporting the search for Evidence in Software Engineering. Technical report, PESC COPPE/UFRJ. RT-ES 687/06 (2005) (in Portuguese) 228. Maldonado, J.C.: Potential-Uses Criteria: A Contribution to the Structural Testing of Software. PhD thesis, DCA/FEE/UNICAMP, Campinas–SP, Brazil (1991) (in Portuguese)

306

References

229. Maldonado, J.C., Carver, J., Shull, F., Fabbri, S., Doria, E., Martimiano, L., Mendonça, M., Basili, V.: Perspective-based reading: A replicated experiment focused on individual reviewer effectiveness. Empirical Software Engineering 11(1), 119–142 (2006) 230. Maldonado, J.C., Delamaro, M.E., Fabbri, S.C.P.F., Simao, A.S., Sugeta, T., Vincenzi, A.M.R., Masiero, P.C.: Proteum: A family of tools to support specification and program testing based on mutation. In: Mutation 2000 Symposium—Tool Session, pp. 113–116. Kluwer Academic Publishers, Dordrecht (2000) 231. Marre, B.: Toward an automatic test data set selection using algebraic specifications and logic programming. In: Furukawa, K. (ed.) Eight International Conference on Logic Programming (ICLP 1991), pp. 25–28. MIT Press, Cambridge (1991) 232. Marre, B.: Loft: a tool for assisting selection of test data sets from algebraic specifications. In: Mosses, P.D., Schwartzbach, M.I., Nielsen, M. (eds.) CAAP 1995, FASE 1995, and TAPSOFT 1995. LNCS, vol. 915, pp. 799–800. Springer, Heidelberg (1995) 233. Marre, B., Arnould, A.: Test sequences generation from Lustre descriptions: GATeL. In: ASE 2000: The 15th IEEE Conference on Automated Software Engineering, Grenoble, September 2000, pp. 229–237. IEEE CS Press, Los Alamitos (2000) 234. Marre, B., Blanc, B.: Test selection strategies for Lustre descriptions in GATeL. In: MBT 2004 joint to ETAPS 2004. ENTCS, vol. 111, pp. 93–111 (2004) 235. Marre, B., Thévenod-Fosse, P., Waeselink, H., Le Gall, P., Crouzet, Y.: An experimental evaluation of formal testing and statistical testing. In: SAFECOMP 1992, pp. 311–316 (1992) 236. Massol, V., Husted, T.: JUnit in Action. Manning (2003) 237. Mathews, P.: Design of Experiements with Minitab. ASQ Quality Press (2004) 238. Mathur, A.P., Krauser, E.W.: Modeling mutation on vector processor. In: 10th International Conference on Software Engineering, pp. 154–161 (1988) 239. Mathur, A.P., Wong, W.E.: An empirical comparison of data flow and mutation based test adequacy criteria. Software Testing, Verification and Reliability 4(1), 9–31 (1994) 240. ISO 8807:1989. LOTOS: A formal description technique based on the temporal ordering of observational behaviour. ISO (1989) 241. McCabe, T.: A complexity measure. IEEE Transactions on Software Engineering 2(4), 308– 320 (1976) 242. McCarthy, J.: Towards a mathematical science of computation. In: IFIP Congress, pp. 21– 28 (1962) 243. McGregor, J.D.: Software product line testing process overview, http://www.cs.clemson.edu/~johnmc/PublishTest/index.htm 244. McGregor, J.D.: Reasoning about the testability of product line components. In: 2nd International Workshop on Testing Software Product Lines, pp. 1–7 (2005) 245. McGregor, J.D.: Test early, test often. Journal of Object-Oriented Programming 6(4), 7–14 (2007), http://www.jot.fm/issues/issue_2007_05/column1 (08-08-2007) 246. McGregor, J.D., Sykes, D.A.: A Practical Guide to Testing Object-Oriented Software. Object Technology Series. Addison-Wesley, Reading (2001) 247. Mendes, E.: A systematic review of Web Engineering research. In: 4th International Symposium on Empirical Software Engineering, pp. 498–507 (2005) 248. Meyer, B.: Eiffel: The Language. Prentice-Hall, New York (1992) 249. Microsoft. Visual Studio (2008), http://msdn2.microsoft.com/vs2008/ 250. Sun MicroSystems. Annotations (2004), http://java.sun.com/j2se/1.5.0/ docs/guide/language/annotations.html 251. Milner, R.: Communication and Concurrency. Prentice-Hall, Englewood Cliffs (1989) 252. Morgan, C.: Programming from Specifications, 2nd edn. Prentice-Hall, Englewood Cliffs (1994)

References

307

253. Moura, L.M., Bjørner, N.: Z3: An efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008) 254. Mresa, E.S., Bottaci, L.: Efficiency of mutation operators and selective mutation strategies: an empirical study. Software Testing, Verification and Reliability 9(4), 205–232 (1999) 255. Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P.A., Neamtiu, I.: Finding and reproducing heisenbugs in concurrent programs. In: OSDI, pp. 267–280 (2008) 256. Myers, G.J.: The Art of Software Testing. Wiley, Chichester (1979) 257. Nardi, P., Spoto, E.S., Delamaro, M.E., Vincenzi, A.M.R.: JaBUTi/BD: Use of structural criteria in Java database applications. In: 19th Brazilian Symposium on Software Engineering, pp. 45–50 (2005) 258. Nascimento, L.H.O., Machado, P.D.L.: An experimental evaluation of approaches to feature testing in the mobile phone applications domain. In: Domain-Specific Approaches to Software Test Automation - Satellite workshop of ESEC/FSE 2007, pp. 27–33 (2007) 259. Newkirk, J.W., Vorontsov, A.A.: Test-Driven Development in Microsoft .NET. Microsoft Press, Redmond (April 2004) 260. De Nicola, R., Hennessy, M.C.B.: Testing equivalences for processes. Theoretical Computer Science 34, 83–133 (1984) 261. Nogueira, S., Cartaxo, E.G., Torres, D.G., Aranha, E.H.S., Marques, R.: Model based test generation: An industrial experience. In: 1st Brazilian Workshop on Systematic and Automated Software Testing, Sociedade Brasileira de Computa (2007) 262. Nogueira, S., Sampaio, A.C.A., Mota, A.C.: Guided test generation from csp models. In: Fitzgerald, J.S., Haxthausen, A.E., Yenigun, H. (eds.) ICTAC 2008. LNCS, vol. 5160, pp. 258–273. Springer, Heidelberg (2008) 263. Tillmann, N., Halleux, J.: Pex–White Box Test Generation for .NET. In: Beckert, B., Hähnle, R. (eds.) TAP 2008. LNCS, vol. 4966, pp. 134–153. Springer, Heidelberg (2008) 264. Offutt, A.J., Craft, W.M.: Using compiler optimization techniques to detect equivalent mutants. Software Testing, Verification and Reliability 4, 131–154 (1994) 265. Offutt, A.J., Irvine, A.: Testing object-oriented software using the category-partition method. In: 17th International Conference on Technology of Object-Oriented Languages and Systems, pp. 293–304. Prentice-Hall, Englewood Cliffs (1995) 266. Offutt, A.J., Lee, A., Rothermel, G., Untch, R.H., Zapf, C.: An experimental determination of sufficient mutant operators. ACM Transactions on Software Engineering Methodology 5(2), 99–118 (1996) 267. Offutt, A.J., Pan, J.: Automatically detecting equivalent mutants and infeasible paths. Software Testing, Verification and Reliability 7(3), 165–192 (1997) 268. Offutt, J., Alexander, R.T., Wu, Y., Xiao, Q., Hutchinson, C.: A fault model for subtype inheritance and polymorphism. In: 12th IEEE International Symposium on Software Reliability Engineering, pp. 84–95. IEEE Computer Society, Los Alamitos (2001) 269. Orso, A., Do, H., Rothermel, G., Harrold, M.J., Rosenblum, D.S.: Using component metadata to regression test component-based software: Research Articles. Software Testing, Verification and Reliability 17(2), 61–94 (2007) 270. Ostrand, T.J., Weyuker, E.J.: Using data flow analysis for regression testing. In: 6th Annual Pacific Northwest Software Quality Conference (1988) 271. Pacheco, C., Lahiri, S.K., Ernst, M.D., Ball, T.: Feedback-directed random test generation. In: ICSE 2007, Proceedings of the 29th International Conference on Software Engineering, Minneapolis, MN, USA, May 23-25 (2007) 272. Pai, M., McCulloch, M., Gorman, J.D., Pai, N., Enanoria, W., Kennedy, G., Tharyan, P., Colford Jr., J.M.: Clinical Research Methods - Systematic reviews and meta-analyses: An illustrated, step-by-step guide. The National Medical Journal of India 17(2), 86–94 (2004) 273. Parnas, D., Weiss, D.M.: Active design reviews: principles and practices. Journal of Systems and Software 7(4) (1987)

308

References

274. Peleska, J., Siegel, M.: From testing theory to test driver implementation. In: Gaudel, M.-C., Woodcock, J.C.P. (eds.) FME 1996. LNCS, vol. 1051, pp. 538–556. Springer, Heidelberg (1996) 275. Péraire, C., Barbey, S., Buchs, D.: Test selection for object-oriented software based on formal specifications. In: IFIP Working Conference on Programming Concepts and Methods (PROCOMET 1998), Shelter Island, New York, USA, June 1998, pp. 385–403. Chapman Hall, Boca Raton (1998) 276. Pex development team. Pex (2008), http://research.microsoft.com/Pex 277. Pex development team. Stubs, Lightweight Test Stubs and Detours for.NET (2009), http://research.microsoft.com/Stubs 278. Phillips, I.: CCS with priority guards. J. Log. Algebr. Program. 75(1), 139–165 (2008) 279. Pressman, R.S.: Software Engineering—A Practitioner’s Approach, 6th edn. McGraw-Hill, New York (2005) 280. Probert, R.L., Guo, F.: Mutation testing of protocols: Principles and preliminary experimental results. In: IFIP TC6 – Third International Workshop on Protocol Test Systems, pp. 57–76. North-Holland, Amsterdam (1991) 281. Prowell, S.J., Trammell, C.J., Linger, R.C., Poore, J.H.: Cleanroom Software Engineering Technology and Process. The SEI Series in SE. Addison-Wesley, Reading (1999) 282. QualiPSo. Qualipso project (quality platform for open source software). Project Homepage (2007), Europe Comission – Grant Number IST-FP6-IP-034763, Available at http:// www.qualipso.org/ 283. Rapps, S., Weyuker, E.J.: Selecting software test data using data flow information. IEEE Transactions on Software Engineering 11(4), 367–375 (1985) 284. Reis, S., Metzger, A., Pohl, K.: A reuse technique for performance testing of software product lines. In: 3rd International Workshop on Testing Software Product Lines, pp. 5– 10 (2006) 285. Roscoe, A.W.: Modeling and verifying key-exchange protocols using CSP and FDR. In: 8th IEEE Computer Security Foundations Workshop, p. 98. IEEE Computer Society, Los Alamitos (1995) 286. Roscoe, A.W.: The Theory and Practice of Concurrency. Prentice-Hall, Englewood Cliffs (1997) 287. Rothermel, G., Harrold, M.J., Ronne, J.V., Hong, C.: Empirical studies of test-suite reduction. Journal of Software Testing, Verification, and Reliability 12, 219–249 (2002) 288. Rothermel, G., Untch, R.H., Chu, C., Harrold, M.J.: Test case prioritization: an empirical study. In: IEEE International Conference on Software Maintenance, pp. 179–188 (1999) 289. Ruiz, A., Price, Y.W.: Test-driven GUI development with testNG and abbot. IEEE Software 24(3), 51–57 (2007) 290. Rumbaugh, J., Jacobson, I., Booch, G. (eds.): The Unified Modeling Language reference manual. Addison-Wesley, Reading (1999) 291. Saff, D.: Theory-infected: Or how I learned to stop worrying and love universal quantification. In: Companion to Object-Oriented Programming Systems, Languages, and Applications (OOPSLA 2007), Montréal, Canada, October 23–25 (2007) 292. Saff, D., Artzi, S., Perkins, J.H., Ernst, M.D.: Automatic test factoring for Java. In: Proc. 20th ASE, pp. 114–123. ACM Press, New York (2005) 293. Saff, D., Boshernitsan, M., Ernst, M.: Theories in practice: Easy-to-write specifications that catch bugs. Technical Report MIT-CSAIL-TR-2008-002, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, January 14 (2008) 294. Schneider, S.: Abstraction and testing. In: Wing, J.M., Woodcock, J.C.P., Davies, J. (eds.) FM 1999. LNCS, vol. 1708, pp. 738–757. Springer, Heidelberg (1999)

References

309

295. Schwitter, R., Ljungberg, A., Hood, D.: ECOLE - a look-ahead editor for a controlled language. In: Joint Conference combining the 8th International Workshop of the European Association for Machine Translation and the 4th Controlled Language Application Workshop, pp. 141–150 (2003) 296. Sen, K., Agha, G.: CUTE and jCUTE: Concolic unit testing and explicit path modelchecking tools. In: Ball, T., Jones, R.B. (eds.) CAV 2006. LNCS, vol. 4144, pp. 419–423. Springer, Heidelberg (2006) 297. Sen, K., Marinov, D., Agha, G.: CUTE: a concolic unit testing engine for C. In: Proc. of ESEC/FSE 2005, pp. 263–272. ACM Press, New York (2005) 298. Shull, F., Basili, V., Boehm, B., Brown, A.W., Costa, P., Lindvall, M., Port, D., Rus, I., Tesoriero, R., Zelkowitz, M.: What we have learned about fighting defects. In: VIII International Symposium on Software Metrics—METRICS 2002, pp. 249–258. IEEE Computer Society, Los Alamitos (2002) 299. Shull, F., Mendonça, M.G., Basili, V., Carver, J., Maldonado, J.C., Fabbri, S., Travassos, G.H., Ferreira, M.C.: Knowledge-sharing issues in experimental software engineering. Empirical Software Engineering 9(1-2), 111–137 (2004) 300. Shull, F., Rus, I., Basili, V.R.: How perspective-based reading can improve requirements inspections. IEEE Computer 33(7), 73–79 (2000) 301. Sommerville, I.: Software Engineering, 8th edn. Addison-Wesley, Reading (2006) 302. Souza, S.R.S., Maldonado, J.C., Fabbri, S.C.P.F., Souza, W.L.: Mutation testing applied to Estelle specifications. Software Quality Journal 8, 285–301 (1999) 303. Spoto, E.S., Jino, M., Maldonado, J.C.: Structural software testing: An approach to relational database applications. In: 14th Brazilian Symposium on Software Engineering (2000) (in Portuguese) 304. Srivatanakul, T.: Mutation testing for concurrency. Master’s thesis, Department of Computer Science, University of York (2001) 305. Srivatanakul, T., Clark, J., Stepney, S., Polack, F.: Challenging formal specifications by mutation: a CSP security example. In: 10th Asia-Pacific Software Engineering Conference, pp. 340–351. IEEE Press, Los Alamitos (2003) 306. Stepney, S.: Testing as abstraction. In: Bowen, J.P., Hinchey, M.G. (eds.) ZUM 1995. LNCS, vol. 967, pp. 137–151. Springer, Heidelberg (1995) 307. St.Laurent, S., Lenz, E., McRae, M.: Office 2003 XML: Integrating Office with the rest of the world. O’Reilly & Associates, Inc, Sebastopol (2004) 308. Stocks, P.A.: Applying formal methods to software testing. PhD thesis, Department of Computer Science, University of Queensland (1993) 309. Sugeta, T.: Proteum-RS/ST: A tool to support the validation of statecharts based on mutation analysis. Master’s thesis, ICMC-USP (1999) (in Portuguese) 310. Sugeta, T., Maldonado, J.C., Wong, W.E.: Mutation Testing Applied to Validate SDL Specifications. In: Groz, R., Hierons, R.M. (eds.) TestCom 2004. LNCS, vol. 2978, pp. 193–208. Springer, Heidelberg (2004) 311. Tai, K.-C.: Theory of fault-based predicate testing for computer programs. IEEE Transactions on Software Engineering 22(8), 552–562 (1996) 312. Tai, K.-C., Su, H.-K.: Test generation for Boolean expressions. In: 11tth Annual International Computer Software and Applications Conference, pp. 278–284 (1987) 313. Tillmann, N., Chen, F., Schulte, W.: Discovering likely method specifications. In: Liu, Z., He, J. (eds.) ICFEM 2006. LNCS, vol. 4260, pp. 717–736. Springer, Heidelberg (2006) 314. Tillmann, N., Schulte, W.: Parameterized unit tests. In: Proceedings of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 253–262. ACM, New York (2005)

310

References

315. Tillmann, N., Schulte, W.: Unit tests reloaded: Parameterized unit testing with symbolic execution. IEEE Software 23(4), 38–47 (2006) 316. Tiwari, A., Shankar, N., Rushby, J.M.: Invisible formal methods for embedded control systems. Proceedings of the IEEE 91(1), 29–39 (2003) 317. Trammell, C.: Quantifying the reliability of software: statistical testing based on a usage model. In: 2nd IEEE Software Engineering Standards Symposium, p. 208 (1995) 318. Tretmans, G.J.: Test generation with inputs, outputs and repetitive quiescence. Software— Concepts and Tools 3, 103–120 (1996) 319. Tretmans, J.: A Formal Approach to Conformance Testing. PhD thesis, Universiteit Twente (1992) 320. Tretmans, J.: Testing concurrent systems: A formal approach. In: Baeten, J.C.M., Mauw, S. (eds.) CONCUR 1999. LNCS, vol. 1664, pp. 46–65. Springer, Heidelberg (1999) 321. Turner, C.R., Wolf, A.L., Fuggetta, A., Lavazza, L.: Feature engineering. In: 9th International Workshop on Software Specification and Design, p. 162. IEEE Computer Society, Los Alamitos (1998) 322. Two, M., Poole, C., Cansdale, J., Feldman, G., Newkirk, J., Vorontsov, A., Craig, P.: NUnit, http://www.nunit.org/ 323. Ural, H., Yang, B.: A structural test selection criterion. Information Processing Letters 28, 157–163 (1988) 324. Utting, M., Legeard, B.: Practical Model-Based Testing: A Tools Approach. Morgan Kaufmann, San Francisco (2006) 325. Venolia, G., DeLine, R., LaToza, T.: Software development at microsoft observed. Technical Report MSR-TR-2005-140, Microsoft Research, Redmond, WA (October 2005) 326. Vergílio, S.R., Maldonado, J.C., Jino, M.: A strategy for test-data generation. In: 7th Brazilian Symposium on Software Engineering, pp. 307–319 (1993) (in Portuguese) 327. Vilela, P.R.S.: Integration Potencial-Uses Criteria: Definition and Analysis. PhD thesis, DCA/FEEC/UNICAMP (1998) (in Portuguese). 328. Vincenzi, A.M.R., Maldonado, J.C., Delamaro, M.E., Spoto, E.S., Wong, W.E.: Component-Based Software Quality: Methods and Techniques. In: Cechich, A., Piattini, M., Vallecillo, A. (eds.) Component-Based Software Quality. LNCS, vol. 2693, pp. 99–127. Springer, Heidelberg (2003) 329. Vincenzi, A.M.R., Nakagawa, E.Y., Maldonado, J.C., Delamaro, M.E., Romero, R.A.F.: Bayesian-learning based guidelines to determine equivalent mutants. International Journal of Software Engineering and Knowledge Engineering 12(6), 675–689 (2002) 330. Vincenzi, A.M.R., Wong, W.E., Delamaro, M.E., Maldonado, J.C.: JaBUTi: A coverage analysis tool for Java programs. In: 17th Brazilian Symposium on Software Engineering, pp. 79–84 (2003) 331. Vogel, L.: Java emitter template, jet (2009), Available at http://www.vogella.de/articles/EclipseJET/article.html 332. Vries, R.G., Tretmans, J.: Towards formal test purposes. In: 1st International Workshop on Formal Approaches to Testing of Software, pp. 61–76 (2001) 333. W3C OWL Working Group. OWL 2 web ontology language — document overview. World Wide Web Consortium, Working Draft WD-owl2-overview-20090611 (June 2009) 334. Wallnau, K.C.: A technology for predictable assembly from certifiable components. Technical Report CMU/SEI-2003-TR-009, Software Engineering Institute (2003) 335. Wang, X., Cheung, S.C., Chan, W.K., Zhang, Z.: Taming coincidental correctness: Coverage refinement with context patterns to improve fault localization. In: Proceedings of 31st International Conference on Software Engineering, ICSE 2009, Vancouver, Canada, May 16-24, pp. 45–55. IEEE, Los Alamitos (2009) 336. Warmer, J., Kleppe, A.: The Object Constraint Language: Getting your models ready for MDA, 2nd edn. Addison-Wesley, Reading (2003)

References

311

337. Weiglhofer, M., Aichernig, B., Wotawa, F.: Fault-based conformance testing in practice. International Journal of Software and Informatics 3(2–3), 375–411 (2009) 338. Weiglhofer, M., Aichernig, B.K.: Unifying input output conformance. In: 2nd International Symposium on Unifying Theories of Programming. LNCS, vol. 5713. Springer, Heidelberg (2009) 339. Weyuker, E.J.: The complexity of data flow for test data selection. Information Processing Letters 19(2), 103–109 (1984) 340. Weyuker, E.J., Jeng, B.: Analyzing partition testing strategies. IEEE Transactions on Software Engineering 17(7), 703–711 (1991) 341. Weyuker, E.J., Weiss, S.N., Hamlet, R.G.: Comparison of program testing strategies. In: 4th Symposium on Software Testing, Analysis and Verification, pp. 154–164. ACM Press, New York (1991) 342. Whittaker, J.A., Thomason, M.G.: A markov chain model for statistical software testing. IEEE Transactions on Software Engineering 20(10), 812–824 (1994) 343. Wiki. Mock objects, http://www.mockobjects.com 344. Wikipedia. Luhn algorithm (2007), http://en.wikipedia.org/wiki/Credit_ card_number [accessed 15-November-2007] 345. Wikipedia. Luhn algorithm (2007), http://en.wikipedia.org/w/index.php?parameterstitle=Luhn_ algorithm&oldid=163863211 (accessed 12-October-2007) 346. Williams, J.J.: Data driven test environment considerations when developing controls product line test architecture. In: 3rd International Workshop on Testing Software Product Lines, pp. 11–16 (2006) 347. Wimmel, G., Jürjens, J.: Specification-based test generation for security-critical systems using mutations. In: George, C.W., Miao, H. (eds.) ICFEM 2002. LNCS, vol. 2495, pp. 471–482. Springer, Heidelberg (2002) 348. Wirsing, M.: Algebraic Specification. In: Handbook of Theoretical Computer Science. Formal models and semantics, vol. B. Elsevier, Amsterdam (1990) 349. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Dordrecht (2000) 350. Wojcik, R., Hoard, J., Holzhauser, K.: The Boeing Simplified English Checker. In: International Conference on Human Machine Interaction and Artificial Intelligence in Aeronautics and Space, pp. 43–57 (1990) 351. Wong, W.E.: On Mutation and Data Flow. PhD thesis, Department of Computer Science, Purdue University (1993) 352. Wong, W.E., Horgan, J.R., London, S., Mathur, A.P.: Effect of test set minimization on fault detection effectiveness. In: 17th IEEE International Conference on Software Engineering, pp. 41–50 (1995) 353. Wong, W.E., Maldonado, J.C., Delamaro, M.E.: Reducing the cost of regression test by using selective mutation. In: 8th International Conference on Software Technology, pp. 11– 13 (1997) 354. Wong, W.E., Mathur, A.P., Maldonado, J.C.: Mutation versus all-uses: An empirical evaluation of cost, strength, and effectiveness. In: International Conference on Software Quality and Productivity, pp. 258–265. Chapman and Hall, Boca Raton (1994) 355. Eric Wong, W. (ed.): Mutation Testing for the New Century. Kluwer Academic Publishers, Dordrecht (2001) 356. Woodcock, J., Cavalcanti, A., Freitas, L.: Operational semantics for model checking Circus. In: Fitzgerald, J.S., Hayes, I.J., Tarlecki, A. (eds.) FM 2005. LNCS, vol. 3582, pp. 237–252. Springer, Heidelberg (2005)

312

References

357. Woodward, M.R.: Mutation testing—its origin and evolution. Information and Software Technology 35(3), 163–169 (1993) 358. Woodward, M.R., Hedley, D., Hennell, M.A.: Experience with path analysis and testing of programs. IEEE Transactions on Software Engineering 6, 278–286 (1980) 359. Xie, T., Tillmann, N., de Halleux, P., Schulte, W.: Fitness-guided path exploration in dynamic symbolic execution. Technical Report MSR-TR-2008-123, Microsoft Research, Redmond, WA (September 2008) 360. Yang, Q., Li, J.J., Weiss, D.: A survey of coverage based testing tools. In: 1st International Workshop on Automation of Software Test, pp. 99–103. ACM Press, New York (2006) 361. ITU-T Recommendation Z.140. Testing and Test Control Notation version 3 (TTCN-3): TTCN-3 Core Language. ITU-T (2006) 362. Zhu, H.: A note on test oracles and semantics of algebraic specifications. In: QSIC 2003, pp. 91–99. IEEE Computer Society, Los Alamitos (2003)

Author Index

Aichernig, Bernhard K.

243

de Halleux, Jonathan 141 Delamaro, M´ arcio 18 Gaudel, Marie-Claude

215

Machado, Patr´ıcia 1, 59 Maldonado, Jos´e Carlos 1, 18 McGregor, John D. 104 Sampaio, Augusto 59 Schulte, Wolfram 141 Tillmann, Nikolai

Hennell, Michael H¨ ohn, Erika 18

141

203 Vincenzi, Auri

1, 18

Refinement Techniques in Software Engineering: First Pernambuco Summer School on Software Engineering, PSSE 2004, Recife, Brazil, November 23-December

Advanced Lectures on Software Engineering: LASER Summer School 2007 2008 (Lecture Notes in Computer Science Programming and Software Engineering)

Software Engineering: An Advanced Course (Lecture Notes in Computer Science)

New Trends in Software Process Modelling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

New Trends in Software Process Modeling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

Professional issues in software engineering

Agile Processes in Software Engineering and Extreme Programming: 8th International Conference, XP 2007, Como, Italy, June 18-22, 2007, Proceedings (Lecture Notes in Computer Science)

Testing Techniques in Software Engineering: Second Pernambuco Summer School on Software Engineering, PSSE 2007, Recife, Brazil, December 3-7, 2007, Revised Lectures (Lecture Notes in Computer Science)

Refinement Techniques in Software Engineering: First Pernambuco Summer School on Software Engineering, PSSE 2004, Recife, Brazil, November 23-December

Advanced Lectures on Software Engineering: LASER Summer School 2007 2008 (Lecture Notes in Computer Science Programming and Software Engineering)

Software Engineering: An Advanced Course (Lecture Notes in Computer Science)

Software Engineering for Resilient Systems (Lecture Notes in Computer Science)

Lecture notes on empirical software engineering

Computer software engineering research

Classics in Software Engineering

Advances in Software Engineering

Verification, Validation and Testing in Software Engineering

Advances in Software Engineering

Software Engineering and Testing: An Introduction (Computer Science)

Software Engineering

Ontologies in Software Engineering and Software Technology

Software Engineering

Software Engineering

Empirical Methods and Studies in Software Engineering: Experiences from ESERNET (Lecture Notes in Computer Science)

Model Checking Software (Lecture Notes in Computer Science, 6823)

Software Testing Techniques

2007, Leipzig, ... Papers (Lecture Notes in Computer Science)

Nonlinear optimization: Lectures C.I.M.E. Summer School, 2007

Rationale Management in Software Engineering

Professional issues in software engineering

New Trends in Software Process Modelling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

New Trends in Software Process Modeling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

Professional issues in software engineering

RIMS Symposium on Software Science and Engineering

Software Engineering and Formal Methods. SEFM 2011 Proceedings (Lecture Notes in Computer Science)

Intelligent Control and Computer Engineering (Lecture Notes in Electrical Engineering)

Machine Learning Applications In Software Engineering (Series on Software Engineering and Knowledge Engineering)

Agile Processes in Software Engineering and Extreme Programming: 8th International Conference, XP 2007, Como, Italy, June 18-22, 2007, Proceedings (Lecture Notes in Computer Science)

Testing Techniques in Software Engineering: Second Pernambuco Summer School on Software Engineering, PSSE 2007, Recife, Brazil, December 3-7, 2007, Revised Lectures (Lecture Notes in Computer Science)

Refinement Techniques in Software Engineering: First Pernambuco Summer School on Software Engineering, PSSE 2004, Recife, Brazil, November 23-December

Advanced Lectures on Software Engineering: LASER Summer School 2007 2008 (Lecture Notes in Computer Science Programming and Software Engineering)

Software Engineering: An Advanced Course (Lecture Notes in Computer Science)

Software Engineering for Resilient Systems (Lecture Notes in Computer Science)

Lecture notes on empirical software engineering

Computer software engineering research

Classics in Software Engineering

Advances in Software Engineering

Verification, Validation and Testing in Software Engineering

Advances in Software Engineering

Software Engineering and Testing: An Introduction (Computer Science)

Software Engineering

Ontologies in Software Engineering and Software Technology

Software Engineering

Software Engineering

Empirical Methods and Studies in Software Engineering: Experiences from ESERNET (Lecture Notes in Computer Science)

Model Checking Software (Lecture Notes in Computer Science, 6823)

Software Testing Techniques

2007, Leipzig, ... Papers (Lecture Notes in Computer Science)

Nonlinear optimization: Lectures C.I.M.E. Summer School, 2007

Rationale Management in Software Engineering

Professional issues in software engineering

New Trends in Software Process Modelling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

New Trends in Software Process Modeling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

Professional issues in software engineering

RIMS Symposium on Software Science and Engineering

Software Engineering and Formal Methods. SEFM 2011 Proceedings (Lecture Notes in Computer Science)

Intelligent Control and Computer Engineering (Lecture Notes in Electrical Engineering)

Machine Learning Applications In Software Engineering (Series on Software Engineering and Knowledge Engineering)

Agile Processes in Software Engineering and Extreme Programming: 8th International Conference, XP 2007, Como, Italy, June 18-22, 2007, Proceedings (Lecture Notes in Computer Science)

Recommend Documents